Skip to content

NLP Tools for Free-Text Clinical Notes

This page provides resources for processing free-text clinical notes, including de-identification and section extraction tools.


De-identification Tools

Tools for removing protected health information (PHI) from clinical text to ensure patient privacy.

NLM Scrubber

A clinical text de-identification tool developed by the National Library of Medicine (NLM).

Philter

A rule-based de-identification tool developed by UCSF.

BERT-DEID

A deep learning approach to clinical text de-identification using BERT.

OBI DEID BERT (i2b2)

A BERT model fine-tuned on i2b2 de-identification data.


Sectioning Tools

Tools for extracting structured sections from clinical notes.

LLM-based Section Extraction

This section describes the process of extracting structured sections from clinical notes using OpenAI's API.

Overview

The system implements a clinical text extraction system that identifies and extracts specific sections from clinical notes. It uses OpenAI's API with carefully crafted prompts to ensure accurate extraction without modification or interpretation of the original content.

Dependencies

import pandas as pd
import numpy as np
from openai import OpenAI
from pydantic import BaseModel
import json

Workflow

1. Load Clinical Note

The system reads a clinical note from a text file:

with open('path_to_your_file/your_file_name.txt', 'r', encoding='utf-8') as f:
    content = f.read()
2. Initialize OpenAI Client
client = OpenAI()
3. Prompts
General Section Extraction Prompt

The main prompt (prompt) is designed for comprehensive section extraction with the following key features:

  • Critical Rules:
  • Extract original consecutive text exactly as it appears
  • Do not rephrase, summarize, or interpret content
  • Do not insert or supply missing information
  • Copy text exactly including typos, abbreviations, spacing, and formatting

  • Instructions:

  • Match sections to provided header list
  • Extract exact text under each section header
  • Combine multiple occurrences with line breaks
  • Set missing sections to "None"
  • Preserve original punctuation, capitalization, and medical abbreviations
  • Place unmatched sections into "other"

  • Header Mapping Rules:

  • Include "subjective" and "interval events" in "history_of_present_illness"
  • Include "orders" in "assessment_and_plan"
  • Include "cardiac_tests" and "blood_pressure" in "labs_and_results"
  • Include "vital" in "physical_exam"
  • Include "habits" in "social_history"
  • Include "reason_for_visit" in "chief_complaint"

  • Header List:

  • allergy
  • chief_complaint
  • history_of_present_illness
  • past_medical_history
  • family_history
  • social_history
  • surgical_history
  • physical_exam
  • mental_exam
  • labs_and_results
  • medications
  • neurological
  • review_of_systems
  • assessment_and_plan
  • patient_instructions
  • problem_list
  • diagnoses
  • hospital_course
  • imaging
  • other
Neurological Section Extraction Prompt

A specialized prompt (prompt_1) focuses specifically on extracting neurological sections:

  • Same critical rules as the general prompt
  • Returns only the "neurological" section
  • Includes example responses showing various neurological examination formats
4. API Call

The system makes an API call to extract neurological sections:

response = client.responses.create(
  model="gpt-5-mini",
  input=[
    {
      "role": "system",
      "content": [
        {
          "type": "input_text",
          "text": prompt_1
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": content
        }
      ]
    }
  ],
  #temperature=0
)
print(response.output_text)

Example Output:

[{"header": "neurological", "body": "Neurology    RESOLVED: History of CVA (cerebrovascular accident)    Relevant Orders    CBC W/ PLATELETS+ DIFF (COMPLETE)    COMPREHENSIVE CHEM PANEL    HEMOGLOBIN A1C    LIPID PROFILE    TSH (THYROID STIM HORMONE),3RD GENERATION    URIC ACID LEVEL    AUTO DIFFERENTIAL\n\nNeurological: Negative for dizziness, seizures, light-headedness, numbness and headaches.        Memory loss\n\nNeurological:      General: No focal deficit present.      Mental Status: He is alert and oriented to person, place, and time."}]
5. Save Results

The extracted sections are saved to a JSON file:

json_file = json.loads(response.output_text)
with open('sectioned_refined.json', 'w', encoding='utf-8') as f:
    json.dump(json_file, f, indent=2, ensure_ascii=False)
6. Read and Verify

The saved JSON can be read back for verification:

with open("sectioned_refined.json", "r") as f:
    data = json.load(f)

print(data[0]["body"])  # example if your JSON has {"key": "value"}

Key Features

  1. Exact Extraction: The system preserves the original text exactly as written, including typos and formatting
  2. No Interpretation: Content is never rephrased, summarized, or interpreted
  3. Structured Output: Results are returned in a consistent JSON format
  4. Flexible Mapping: Header mapping rules allow for variations in clinical note structure
  5. Specialized Extraction: Separate prompts can be used for specific section types (e.g., neurological)

Output Format

The system returns a JSON array with the following structure:

[
  {
    "header": "section_name",
    "body": "exact_text_from_note_or_None"
  }
]

Notes

  • The system uses gpt-5-mini model for extraction
  • Temperature can be set to 0 for deterministic outputs
  • The output is saved with UTF-8 encoding and proper indentation for readability
  • Missing sections are explicitly marked as "None" rather than omitted