21 Radiology Report Search
In my radiology department unit, the radiologist write a radiology report and stored as HTML text, which will appears in RESULT_TEXT_HTML column in a SQL-based relational database. However, there is currently no structured system to query radiology report other than simple keyword search via SQL or regex-based report query.
21.0.0.1 Proposed Solution
One of the solution is:
I would like to implement a keyword tagging (e.g., ICD-10, RadLex) system to each radiology report by using a large-language model to read the text in radiology report and tag it.
21.0.0.2 Use Cases
The use case of this might be:
- Research query: To find multiple cases in bulk (e.g., query for “ischemic stroke”) based on radiology report for a research study.
- Report search system: A radiologist or radiology resident might use this to quickly find a cases with similar diagnosis to be an example of how to write a radiology report.
21.0.0.3 Question
What would be your recommended database framework/technologies to support this?
Below is a recommended architecture and set of technologies to help you implement keyword (ICD-10, RadLex) tagging and improve the querying of your radiology reports. The solution aims to complement your existing SQL-based relational database while enabling advanced text search, tagging, and research functionality.
21.1 1. Keep the Existing Relational Database for Core Data
Your current SQL database can remain the source of truth for all patient-related data (patient ID, exam dates, metadata), including storing the HTML text of each radiology report. This preserves: - Familiar relational structure for patient demographics and exam records. - Existing query capabilities for standard transactional or simple reporting needs.
Action:
- Add a new relational table or extend your existing table to store the LLM-generated tags (ICD-10 codes, RadLex terms, etc.) in a structured format, e.g.:
CREATE TABLE radiology_report_tags (
INT,
report_id -- or a separate many-to-many table
icd10_codes TEXT[], -- same approach
radlex_terms TEXT[], ...
PRIMARY KEY (report_id)
);
-- Alternatively, store them in JSONB (if using PostgreSQL).
This structure makes it straightforward to do:
SELECT *
FROM radiology_report_tags
WHERE icd10_codes @> '{"I63"}'; -- example ICD-10 for ischemic stroke
(Exact syntax depends on your SQL dialect.)
21.2 2. Add a Full-Text or Semantic Search Layer
21.2.1 Option A: Built-In Full-Text Search (e.g., PostgreSQL)
If you can migrate or already use PostgreSQL, you can leverage its native full-text search (FTS) features or its pg_trgm extension for fuzzy matching. This allows you to: - Create an FTS index over the RESULT_TEXT_HTML
column. - Combine SQL filtering (e.g., by patient ID, date range) with textual relevance queries.
Pros:
- Single DB for both structured data and text search.
- Less operational overhead than a separate search engine.
- Great for moderate-sized data.
Cons:
- May become slow at very large data scales if you have billions of reports.
- FTS queries, while powerful, might be less flexible than specialized search engines or vector search.
21.2.2 Option B: Dedicated Search Engine (e.g., Elasticsearch / OpenSearch)
For more advanced text search (fuzzy matching, advanced relevancy scoring, synonyms, etc.), you can: 1. Ingest radiology reports (and tags) into Elasticsearch or OpenSearch indexes. 2. Use synonym dictionaries for medical terms, ICD-10 expansions, or RadLex synonyms.
3. Perform bool queries or fuzzy queries to find relevant reports.
Pros:
- Scales well for large data volumes.
- Rich querying and analytics tools (Kibana, etc.).
- Efficient indexing and near real-time search.
Cons:
- Additional infrastructure to maintain (separate from the relational DB).
- Requires a pipeline to keep Elasticsearch/OpenSearch in sync with the primary DB.
21.2.3 Option C: Vector (Semantic) Search (e.g., pgvector, Milvus, Pinecone)
If you’re using a Large Language Model (LLM) to generate embeddings (representations of text in a dense vector space), you can store those embeddings in: - A vector database (Milvus, Pinecone, Qdrant, Weaviate)
- PostgreSQL with the pgvector extension
Then you can do semantic search queries (e.g., “find reports most semantically similar to this text: ‘ischemic stroke in right MCA territory’”) even if the exact keywords aren’t present.
Pros:
- More robust “conceptual” matching (synonyms, paraphrases).
- Great for LLM-based search or question answering pipelines.
Cons:
- Requires maintaining embeddings and re-indexing them if your data or model changes.
- Not as straightforward for pure structured queries; typically used in conjunction with relational or full-text approaches.
21.4 4. Example Stack Combinations
Layer | Option A (Simplest) | Option B (Scalable) | Option C (Advanced Semantic) |
---|---|---|---|
Relational DB | PostgreSQL (with RESULT_TEXT_HTML & tags) |
Same, or MS SQL Server / Oracle for structured data | Same, or any SQL DB for core structured data |
Text Indexing | PostgreSQL FTS + pg_trgm |
Elasticsearch / OpenSearch | Vector DB (Milvus/Pinecone/pgvector) for embeddings |
Tag Storage | JSONB or relational table | Same (can also push tags to ES for indexing) | Same, plus embeddings stored in vector DB |
Query Interface | SQL (with FTS) | ES DSL or Kibana, plus DB queries | Vector similarity search + DB queries |
Recommendation:
- If you want a simpler approach to get started: use PostgreSQL with the full-text extensions and store the tags in a JSONB or a separate table.
- If you need advanced analytics, fuzzy matching, and you expect large scale: use Elasticsearch/OpenSearch or a Vector DB for better performance and flexible search. Keep relational data in your SQL DB, and push the textual data + tags + embeddings to the search engine.
21.5 5. Radiology Use Case Summary
- Research Query
- Researchers can quickly find relevant imaging studies (e.g., “ischemic stroke” or ICD-10 = I63) across thousands of archived reports.
- LLM-based tagging ensures “stroke” synonyms (e.g., CVA, infarct) all map to the correct ICD code or RadLex concept.
- Researchers can quickly find relevant imaging studies (e.g., “ischemic stroke” or ICD-10 = I63) across thousands of archived reports.
- Report Search System
- Radiologists can search for “similar cases” using a semantic search approach (like vector embeddings).
- They can also do a direct keyword or tag-based query if they want narrower results.
- Radiologists can search for “similar cases” using a semantic search approach (like vector embeddings).
21.5.1 Example Flow for “Ischemic Stroke” Search
- LLM tags each report with
["I63", "stroke", …]
. - The user enters “Find me all ischemic stroke reports from 2024 that mention subacute changes.”
- Your application queries the structured tags (ICD-10 = “I63”) plus any mention of “subacute” in the text index.
- The system returns a curated list of matching reports (with optional preview snippets).
21.6 Final Recommendation
- Retain your current SQL-based system as the canonical data store for patient records and raw HTML reports.
- Introduce an LLM-based pipeline that generates and stores structured tags (ICD-10, RadLex) in a separate table or JSON column.
- Add a specialized search layer:
- For moderate scale: PostgreSQL Full Text Search (simple, all-in-one).
- For large scale or more advanced text analytics: Elasticsearch or OpenSearch.
- For semantic queries: a vector store (pgvector, Milvus, Pinecone) to store embeddings for more concept-based searches.
This hybrid approach preserves your relational architecture while enabling advanced text/semantic queries and powerful tagging for research and day-to-day radiologist searches.