9 Automating OMOP ETL: Tools, AI approaches, and what’s still missing

The landscape of automated OMOP CDM transformation has evolved significantly, with LLM-based concept mapping now achieving 77-96% accuracy compared to traditional Usagi’s 52-70% for complex terminologies. While full end-to-end automation remains elusive, a rich ecosystem of semi-automated tools has emerged spanning open-source frameworks, commercial platforms, and research prototypes. The median OMOP ETL implementation still takes 358 days (EHDEN data), with vocabulary mapping consuming the most effort—but new AI-powered approaches promise to compress this dramatically.

9.1 Summary of existing tools and approaches

The table below consolidates the major tools and approaches for OMOP ETL and concept mapping automation:

Tool/Project	Automation Level	Approach	Source Data Types	Status	Repository/Link
White Rabbit/Rabbit-in-a-Hat	Semi (design only)	Rule-based	CSV, SQL databases, SAS	Active	github.com/OHDSI/WhiteRabbit
Usagi	Semi (suggestions + review)	TF-IDF similarity	Code lists, CSV	Active	github.com/OHDSI/Usagi
Perseus	Semi (web UI)	Rule-based + SQL	Generic uploads	Active dev	github.com/OHDSI/Perseus
ETL-CDMBuilder	Full (pre-built)	Vendor-specific	Claims, EHR	Beta	github.com/OHDSI/ETL-CDMBuilder
TOKI	Semi	Deep learning (transformers)	Code descriptions	Research	PMC8279781
Llettuce	Semi	Local LLM + vector search	Medical terms	Active	arXiv:2410.09076
Jackalope Plus	Semi	GPT-4o + SNOMED post-coord	Complex terminology	Published	Nature s41598-025-04046-9
Carrot	Semi	Rules + LLM (Lettuce)	GDPR-compliant data	Active	github.com/Health-Informatics-UoN/carrot
Rabbit-in-a-Blender	High	Convention-over-config CLI	Hospital EMR	Active	PyPI package
dbt-synthea	Full	SQL transformations	Synthea CSV	Active	github.com/OHDSI/Tutorial-ETL
IQVIA OMOP Converter	Full	Spark-based, visual UI	EHR, claims	Commercial	iqvia.com
InterSystems OMOP	Full	No-code pipelines	FHIR, EHR	Commercial	intersystems.com
Microsoft Fabric Healthcare	Full	Pre-built notebooks	FHIR → OMOP	Commercial	Azure
Google Healthcare Data Harmonization	High	Whistle config language	FHIR → OMOP	Active	GCP
AWS Comprehend Medical	Full	NLP extraction	Clinical notes	Commercial	AWS

9.2 Three distinct categories of automation approaches

Rule-based and configuration-driven systems form the foundation of the OHDSI ecosystem. White Rabbit, Rabbit-in-a-Hat, and Usagi remain the most widely deployed tools, with Usagi’s Apache Lucene-based term matching achieving 90% accuracy for common medications but dropping to 70% for less frequent terms. Perseus represents the next evolution—a web-based platform integrating all core tools with visual mapping interfaces and pre-built SQL functions. The key limitation: these tools generate documentation and specifications, not executable ETL code. Configuration-driven frameworks like dbt-synthea and the clinical-ai/omop-etl package bridge this gap by using YAML schemas and SQL transformations, though they require technical expertise to configure.

Machine learning and deep learning approaches have demonstrated substantial improvements over traditional methods. TOKI (2021) pioneered the use of sentence embeddings with dual-input neural networks, achieving 91% top-100 accuracy—a 10%+ improvement over Usagi. Custom sentence-transformer models trained on drug vocabularies reached 96.5% accuracy for common medications in 2024. These models excel at capturing semantic similarity that TF-IDF misses, particularly for synonymous terms and multilingual data. The Spanish-SapBERT model for clinical entity linking improved SNOMED-CT mapping by 40+ points over prior benchmarks, demonstrating the value of domain-specific fine-tuning.

LLM-based approaches represent the newest frontier with impressive early results. GPT-3 embeddings with three-tiered semantic matching achieved AUC 0.9975 for clinical trial term mapping. Jackalope Plus, using GPT-4o mini for SNOMED CT post-coordination, reached 77.5% accuracy versus Usagi’s 52.5% for complex medical terminology while delivering 50% time savings. The MCP (Model Context Protocol) agentic framework demonstrated 100% retrieval success for OMOP concept mapping when properly integrated, compared to 0% without structured tool access. These approaches particularly excel at handling ambiguous terms, rare diseases, and complex post-coordinated expressions where pre-coordinated concepts are insufficient.

9.3 Automation specifically for ETL code generation

True ETL code generation—automatically producing executable transformation scripts from source schemas—remains the least developed area. No tool currently generates complete, production-ready ETL code from scratch. The closest approaches include:

The ETL-CDMBuilder provides pre-written vendor-specific ETLs for claims and EHR systems from Janssen R&D, but these are fixed templates rather than adaptive generators. Rabbit-in-a-Hat can export SQL skeletons with field mappings as comments, but the actual transformation logic must be manually written. Perseus advances this by offering pre-built SQL functions and visual configuration, but still requires significant manual intervention.

The Rabbit-in-a-Blender tool from AZ Delta hospital takes a different approach—convention-over-configuration design where folder structures define CDM tables and Usagi CSV files drive mappings. This reduces code writing but isn’t true code generation. Similarly, YAML-based ETL frameworks (used in Australian hospital implementations and SWERI/CDMS projects) separate configuration from execution, enabling reusability without dynamic code synthesis.

Cloud platforms offer the most automated pipelines for specific pathways. Microsoft Fabric Healthcare provides ready-to-run notebooks transforming FHIR to OMOP v5.4 with drug era generation. Google’s Whistle configuration language enables declarative FHIR-to-OMOP transformations. AWS Comprehend Medical extracts entities from clinical notes and maps them directly to OMOP NOTE_NLP tables. However, these are all fixed-pattern transformations rather than adaptive code generators.

9.4 What’s still missing: gap analysis

End-to-end automation from source to CDM remains the largest gap. Current tools address individual steps—profiling (White Rabbit), design (Rabbit-in-a-Hat), concept mapping (Usagi), validation (DQD)—but no system chains these into a unified automated workflow. The median EHDEN implementation required 358 days with 52% of sites citing vocabulary mapping as most challenging. An integrated pipeline that ingests source schemas and produces validated OMOP CDM output with minimal intervention doesn’t exist.

Non-English language support is severely underdeveloped. Beyond Spanish-SapBERT and Brazilian SIGTAP mapping experiments, most tools assume English input. German ICD-10-GM, Korean EDI codes, and other national vocabularies require creating custom vocabulary extensions. The OHDSI community recommends Google Translate as a preprocessing step for non-English Usagi inputs—hardly a robust solution.

Complex post-coordination handling represents a semantic bottleneck. Standard OMOP mapping forces pre-coordinated concepts, losing clinical nuance when source data describes complex conditions (e.g., “severe bilateral knee osteoarthritis with morning stiffness”). Jackalope Plus addresses this for SNOMED CT, but analogous solutions for RxNorm drug combinations, LOINC qualifiers, and other vocabularies are lacking.

Low-code/no-code solutions for non-technical users are nascent. Clinical researchers without informatics support struggle to transform their Excel/CSV datasets to OMOP. The NIH HEAL Initiative’s GPT-3 embedding tool explicitly targets this gap, but production-ready accessible tools remain scarce. Perseus aims to fill this niche but is still in active development.

Real-time and incremental ETL is operationally critical but poorly supported. Hospital implementations report daily/weekly batch updates (Epic OMOP Anywhere achieves this), but streaming transformations for real-time clinical decision support are not addressed by standard OHDSI tools.

9.5 Key papers worth investigating

The following publications represent the most significant recent advances:

“Deep-Learning-Based Automated Terminology Mapping in OMOP-CDM” (JAMIA 2021) — Introduced TOKI, demonstrating transformer superiority over TF-IDF for concept matching
“Jackalope Plus” (Nature Scientific Reports 2025) — Validated GPT-4o mini achieving 77.5% accuracy with SNOMED post-coordination
“Breaking Digital Health Barriers Through LLM-Based OMOP Mapping Tool” (JMIR 2025) — Three-tiered GPT-3 embedding approach with AUC 0.9975
“OHDSI Standardized Vocabularies” (JAMIA 2024) — Comprehensive description of the 331+ source, 2.1B patient vocabulary ecosystem
“EHDEN Learnings” (JAMIA 2024) — Real-world data from 25 European implementations documenting timelines, challenges, success factors
“MCP-Based Agentic Framework” (arXiv 2025) — Demonstrates 100% concept retrieval using Model Context Protocol with LLMs

9.6 Critical open-source repositories

The most actively maintained and practically useful repositories include github.com/OHDSI/WhiteRabbit (805 commits, active), github.com/OHDSI/Perseus (3,107 commits, under development), github.com/Health-Informatics-UoN/carrot (GDPR-compliant mapping without data access), github.com/RADar-AZDelta/Rabbit-in-a-Blender (CLI-driven hospital ETL), and github.com/clinical-ai/omop-etl (YAML-configured academic framework). For FHIR pathways, github.com/OHDSI/FhirToCdm and github.com/GoogleCloudPlatform/healthcare-data-harmonization provide production-ready converters.

9.7 Real-world implementation patterns

The EHDEN network provides the most comprehensive implementation data: 25 sites across 11 European countries representing 133+ million patients completed OMOP transformations with highly variable timelines (172-622 days). Vocabulary mapping alone ranged from 4-348 days depending on source terminology complexity. Sites that started immediately after project kickoff were twice as likely to finish on time (63% vs 31%), highlighting that organizational factors often outweigh technical ones.

Epic EHR implementations demonstrate the “OMOP Anywhere” approach—leveraging native Caboodle ETL infrastructure to produce 25 OMOP tables with 97-98% Data Quality Dashboard passing rates and daily refresh capability. This vendor-integrated model minimizes custom development but requires Epic licensing. German university hospitals in the MIRACUM consortium achieved 98.8% mapping rates for ICD-10-GM by distributing pre-configured virtual machines with standard tooling.

The All of Us Research Program (NIH) exemplifies scale challenges: 50+ healthcare organizations with 16+ EHR vendors transform local data to OMOP. Their solution uses OMOP as the canonical model with local transformation responsibility—no central ETL automation, but standardized validation via Achilles and DQD.

9.8 Conclusion

The OMOP automation landscape has matured substantially, with LLM-based concept mapping now approaching clinical utility (77-96% accuracy) and configuration-driven ETL frameworks reducing but not eliminating manual effort. The most promising developments combine semantic embeddings with retrieval-augmented generation—Jackalope Plus and MCP-based agentic frameworks represent the current state of the art. Commercial platforms from IQVIA, InterSystems, and major cloud providers offer the most automated end-to-end pathways, particularly for FHIR-to-OMOP transformations.

The critical gaps remain automatic ETL code generation (no tool synthesizes executable transformations from arbitrary source schemas), non-English language support (limited beyond English and Spanish), and accessible tools for non-technical users (clinical researchers still struggle with CSV-to-OMOP workflows). Organizations planning OMOP implementations should budget 12-18 months for complex sources, prioritize team composition and data governance preparation over tool selection, and consider LLM-augmented concept mapping for vocabularies where Usagi underperforms. The trajectory suggests full automation remains 3-5 years away, but semi-automated approaches using current tools can reduce manual mapping effort by 50% or more compared to purely manual processes.