NLP Tools for Free-Text Clinical Notes
This page provides resources for processing free-text clinical notes, including de-identification and section extraction tools.
De-identification Tools
Tools for removing protected health information (PHI) from clinical text to ensure patient privacy.
NLM Scrubber
A clinical text de-identification tool developed by the National Library of Medicine (NLM).
- Website: NLM Scrubber
- Type: Rule-based
Philter
A rule-based de-identification tool developed by UCSF.
- GitHub: philter-ucsf
- Type: Rule-based
BERT-DEID
A deep learning approach to clinical text de-identification using BERT.
- GitHub: bert-deid
- Type: Deep learning
OBI DEID BERT (i2b2)
A BERT model fine-tuned on i2b2 de-identification data.
- Hugging Face: obi/deid_bert_i2b2
- Type: Deep learning
Sectioning Tools
Tools for extracting structured sections from clinical notes.
LLM-based Section Extraction
This section describes the process of extracting structured sections from clinical notes using OpenAI's API.
Overview
The system implements a clinical text extraction system that identifies and extracts specific sections from clinical notes. It uses OpenAI's API with carefully crafted prompts to ensure accurate extraction without modification or interpretation of the original content.
Dependencies
import pandas as pd
import numpy as np
from openai import OpenAI
from pydantic import BaseModel
import json
Workflow
1. Load Clinical Note
The system reads a clinical note from a text file:
with open('path_to_your_file/your_file_name.txt', 'r', encoding='utf-8') as f:
content = f.read()
2. Initialize OpenAI Client
client = OpenAI()
3. Prompts
General Section Extraction Prompt
The main prompt (prompt) is designed for comprehensive section extraction with the following key features:
- Critical Rules:
- Extract original consecutive text exactly as it appears
- Do not rephrase, summarize, or interpret content
- Do not insert or supply missing information
-
Copy text exactly including typos, abbreviations, spacing, and formatting
-
Instructions:
- Match sections to provided header list
- Extract exact text under each section header
- Combine multiple occurrences with line breaks
- Set missing sections to "None"
- Preserve original punctuation, capitalization, and medical abbreviations
-
Place unmatched sections into "other"
-
Header Mapping Rules:
- Include "subjective" and "interval events" in "history_of_present_illness"
- Include "orders" in "assessment_and_plan"
- Include "cardiac_tests" and "blood_pressure" in "labs_and_results"
- Include "vital" in "physical_exam"
- Include "habits" in "social_history"
-
Include "reason_for_visit" in "chief_complaint"
-
Header List:
- allergy
- chief_complaint
- history_of_present_illness
- past_medical_history
- family_history
- social_history
- surgical_history
- physical_exam
- mental_exam
- labs_and_results
- medications
- neurological
- review_of_systems
- assessment_and_plan
- patient_instructions
- problem_list
- diagnoses
- hospital_course
- imaging
- other
Neurological Section Extraction Prompt
A specialized prompt (prompt_1) focuses specifically on extracting neurological sections:
- Same critical rules as the general prompt
- Returns only the "neurological" section
- Includes example responses showing various neurological examination formats
4. API Call
The system makes an API call to extract neurological sections:
response = client.responses.create(
model="gpt-5-mini",
input=[
{
"role": "system",
"content": [
{
"type": "input_text",
"text": prompt_1
}
]
},
{
"role": "user",
"content": [
{
"type": "input_text",
"text": content
}
]
}
],
#temperature=0
)
print(response.output_text)
Example Output:
[{"header": "neurological", "body": "Neurology RESOLVED: History of CVA (cerebrovascular accident) Relevant Orders CBC W/ PLATELETS+ DIFF (COMPLETE) COMPREHENSIVE CHEM PANEL HEMOGLOBIN A1C LIPID PROFILE TSH (THYROID STIM HORMONE),3RD GENERATION URIC ACID LEVEL AUTO DIFFERENTIAL\n\nNeurological: Negative for dizziness, seizures, light-headedness, numbness and headaches. Memory loss\n\nNeurological: General: No focal deficit present. Mental Status: He is alert and oriented to person, place, and time."}]
5. Save Results
The extracted sections are saved to a JSON file:
json_file = json.loads(response.output_text)
with open('sectioned_refined.json', 'w', encoding='utf-8') as f:
json.dump(json_file, f, indent=2, ensure_ascii=False)
6. Read and Verify
The saved JSON can be read back for verification:
with open("sectioned_refined.json", "r") as f:
data = json.load(f)
print(data[0]["body"]) # example if your JSON has {"key": "value"}
Key Features
- Exact Extraction: The system preserves the original text exactly as written, including typos and formatting
- No Interpretation: Content is never rephrased, summarized, or interpreted
- Structured Output: Results are returned in a consistent JSON format
- Flexible Mapping: Header mapping rules allow for variations in clinical note structure
- Specialized Extraction: Separate prompts can be used for specific section types (e.g., neurological)
Output Format
The system returns a JSON array with the following structure:
[
{
"header": "section_name",
"body": "exact_text_from_note_or_None"
}
]
Notes
- The system uses
gpt-5-minimodel for extraction - Temperature can be set to 0 for deterministic outputs
- The output is saved with UTF-8 encoding and proper indentation for readability
- Missing sections are explicitly marked as "None" rather than omitted