NLP Tools for Free-Text Clinical Notes

This page provides resources for processing free-text clinical notes, including de-identification and section extraction tools.

De-identification Tools

Tools for removing protected health information (PHI) from clinical text to ensure patient privacy.

NLM Scrubber

A clinical text de-identification tool developed by the National Library of Medicine (NLM).

Website: NLM Scrubber
Type: Rule-based

Philter

A rule-based de-identification tool developed by UCSF.

GitHub: philter-ucsf
Type: Rule-based

BERT-DEID

A deep learning approach to clinical text de-identification using BERT.

GitHub: bert-deid
Type: Deep learning

OBI DEID BERT (i2b2)

A BERT model fine-tuned on i2b2 de-identification data.

Hugging Face: obi/deid_bert_i2b2
Type: Deep learning

Sectioning Tools

Tools for extracting structured sections from clinical notes.

LLM-based Section Extraction

This section describes the process of extracting structured sections from clinical notes using OpenAI's API.

Overview

The system implements a clinical text extraction system that identifies and extracts specific sections from clinical notes. It uses OpenAI's API with carefully crafted prompts to ensure accurate extraction without modification or interpretation of the original content.

Dependencies

import pandas as pd
import numpy as np
from openai import OpenAI
from pydantic import BaseModel
import json

Workflow

1. Load Clinical Note

The system reads a clinical note from a text file:

with open('path_to_your_file/your_file_name.txt', 'r', encoding='utf-8') as f:
    content = f.read()

2. Initialize OpenAI Client

client = OpenAI()

3. Prompts

General Section Extraction Prompt

The main prompt (prompt) is designed for comprehensive section extraction with the following key features:

Critical Rules:
Extract original consecutive text exactly as it appears
Do not rephrase, summarize, or interpret content
Do not insert or supply missing information
Copy text exactly including typos, abbreviations, spacing, and formatting
Instructions:
Match sections to provided header list
Extract exact text under each section header
Combine multiple occurrences with line breaks
Set missing sections to "None"
Preserve original punctuation, capitalization, and medical abbreviations
Place unmatched sections into "other"
Header Mapping Rules:
Include "subjective" and "interval events" in "history_of_present_illness"
Include "orders" in "assessment_and_plan"
Include "cardiac_tests" and "blood_pressure" in "labs_and_results"
Include "vital" in "physical_exam"
Include "habits" in "social_history"
Include "reason_for_visit" in "chief_complaint"
Header List:
allergy
chief_complaint
history_of_present_illness
past_medical_history
family_history
social_history
surgical_history
physical_exam
mental_exam
labs_and_results
medications
neurological
review_of_systems
assessment_and_plan
patient_instructions
problem_list
diagnoses
hospital_course
imaging
other

Neurological Section Extraction Prompt

A specialized prompt (prompt_1) focuses specifically on extracting neurological sections:

Same critical rules as the general prompt
Returns only the "neurological" section
Includes example responses showing various neurological examination formats

4. API Call

The system makes an API call to extract neurological sections:

response = client.responses.create(
  model="gpt-5-mini",
  input=[
    {
      "role": "system",
      "content": [
        {
          "type": "input_text",
          "text": prompt_1
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": content
        }
      ]
    }
  ],
  #temperature=0
)
print(response.output_text)

Example Output:

[{"header": "neurological", "body": "Neurology    RESOLVED: History of CVA (cerebrovascular accident)    Relevant Orders    CBC W/ PLATELETS+ DIFF (COMPLETE)    COMPREHENSIVE CHEM PANEL    HEMOGLOBIN A1C    LIPID PROFILE    TSH (THYROID STIM HORMONE),3RD GENERATION    URIC ACID LEVEL    AUTO DIFFERENTIAL\n\nNeurological: Negative for dizziness, seizures, light-headedness, numbness and headaches.        Memory loss\n\nNeurological:      General: No focal deficit present.      Mental Status: He is alert and oriented to person, place, and time."}]

5. Save Results

The extracted sections are saved to a JSON file:

json_file = json.loads(response.output_text)
with open('sectioned_refined.json', 'w', encoding='utf-8') as f:
    json.dump(json_file, f, indent=2, ensure_ascii=False)

6. Read and Verify

The saved JSON can be read back for verification:

with open("sectioned_refined.json", "r") as f:
    data = json.load(f)

print(data[0]["body"])  # example if your JSON has {"key": "value"}

Key Features

Exact Extraction: The system preserves the original text exactly as written, including typos and formatting
No Interpretation: Content is never rephrased, summarized, or interpreted
Structured Output: Results are returned in a consistent JSON format
Flexible Mapping: Header mapping rules allow for variations in clinical note structure
Specialized Extraction: Separate prompts can be used for specific section types (e.g., neurological)

Output Format

The system returns a JSON array with the following structure:

[
  {
    "header": "section_name",
    "body": "exact_text_from_note_or_None"
  }
]

Notes

The system uses gpt-5-mini model for extraction
Temperature can be set to 0 for deterministic outputs
The output is saved with UTF-8 encoding and proper indentation for readability
Missing sections are explicitly marked as "None" rather than omitted