Data de-identification workflow in DigitalTWINS platform

UK biobank de-identification

De-Identification protocol
Purpose: Ensures participant data released to researchers is de-identified to prevent inadvertent re-identification, while complying with legal and ethical obligations.
Data Types:
- Identifiable Data (not shared with researchers): Names, addresses, NHS numbers, contact details.
- De-identified Data (shared with researchers): Phenotypes, genomic data, biomarkers, imaging, health records, and questionnaire responses.
Data Management:
- Identifiable data is stored separately and encrypted, accessible only to authorized personnel.
- A PID (unique identifier derived from NHS numbers) links datasets securely.
- EID (project-specific encrypted identifier) replaces the PID in datasets shared with researchers.

HIPAA’s De-Identification Methods

Under the U.S. Health Insurance Portability and Accountability Act (HIPAA), two methods are recognized:

Safe Harbor Method:
- Requires removal of 18 specific identifiers, including:
  - Names, addresses, phone numbers, email addresses.
  - Dates (e.g., birthdates, admission dates) except year.
  - Geographic subdivisions smaller than a state (e.g., full ZIP codes).
  - Social Security numbers, medical record numbers, health plan numbers.
  - Biometric identifiers (e.g., fingerprints), facial photos, unique device IDs.
  - Any remaining data is considered de-identified.
Expert Determination Method:
- A qualified expert statistically certifies that the risk of re-identification is "very small."
- Requires documentation of methods and results.
Key Differences
- Date Handling:
  - HIPAA Safe Harbor requires dates to be reduced to year only.
  - UK Biobank truncates birthdates to month and year, which would not comply with HIPAA Safe Harbor.
- Unique Identifiers:
  - HIPAA prohibits sharing medical record numbers, while UK Biobank uses encrypted PIDs/EIDs (allowed under Expert Determination if risk is minimized).
- Genetic Data:
  - HIPAA does not explicitly address genomic data risks, whereas UK Biobank acknowledges genetic re-identification risks but deems them low.
- Compliance Framework:
  - HIPAA is legally mandated for U.S. entities; UK Biobank follows GDPR and contractual obligations (e.g., MTAs).

EUCanSHare

EUCanSHare, a project focused on cardiovascular data sharing across Europe, employs a comprehensive strategy to handle de-identified data, balancing research utility with privacy protection. Here's a structured overview of their approach:

De-identification Techniques:
- Pseudonymization: Primary method where direct identifiers (e.g., names, IDs) are replaced with codes. A secure, separate key is maintained by a trusted third party or data custodian, allowing re-identification only under strict conditions.
- Anonymization: Pursued where feasible, ensuring data cannot be linked back to individuals (e.g., aggregation, removal of indirect identifiers). This aligns with GDPR standards for irreversible anonymization.
Regulatory Compliance:
- GDPR Adherence: Ensures all processes meet EU data protection standards. Pseudonymized data is treated as personal data, requiring safeguards like access controls and encryption.
- Ethical and Legal Frameworks: Collaborates with ethics boards to ensure compliance with regional regulations and study-specific consent terms.
Data Governance and Access:
- Tiered Access Model:
  - Open Access: Fully anonymized datasets available publicly.
  - Controlled Access: Pseudonymized data accessible via secure platforms, requiring ethical approval and data use agreements.
- Data Use Agreements: Prohibit re-identification attempts and mandate secure handling.
Technical Safeguards:
- Secure Infrastructure: Uses encrypted storage and transfer (e.g., HTTPS, VPNs) on platforms like the European Genome-Phenome Archive (EGA).
- Metadata Management: Removes or obscures indirect identifiers (e.g., dates, rare diagnoses) in metadata to prevent linkage attacks.
Operational Protocols:
- Centralized De-identification: Standardized processes applied at data hubs to ensure consistency across diverse sources (cohorts, registries).
- Provenance Tracking: Maintains records of data transformations to preserve integrity without compromising privacy.
Ethical and Training Measures:
- Informed Consent: Ensures participant consent covers de-identified data sharing, with studies audited for compliance.
- Researcher Training: Provides guidelines on ethical data handling, GDPR, and de-identification best practices.
Risk Mitigation:
- Regular Audits: Monitors data access and usage to prevent breaches.
- Re-identification Risk Assessments: Evaluates datasets for residual risks, applying additional anonymization if needed.

Using EUCanSHare apporche to de-identify FHIR Patient resource data

Scenario

We have FHIR Patient resources stored on your server. We want to share de-identified data for research via EUCanSHare while complying with privacy standards.

Identify Direct and Indirect Identifiers:
- Direct identifiers (must be pseudonymized or removed):
  - Patient.id (internal system ID)
  - Patient.identifier (e.g., national ID, medical record number)
  - Patient.name
  - Patient.telecom (email/phone)
  - Patient.address
  - Patient.birthDate (if precise)
- Indirect identifiers (may require generalization/removal):
  - Patient.gender
  - Patient.birthDate (year only)
  - Rare conditions (e.g., in Patient.extension or linked resources).

Apply De-identification Techniques:

Original FHIR Patient Resource:

json

{
    "resourceType": "Patient",
    "id": "12345",
    "identifier": [{
        "system": "urn:oid:2.16.840.1.113883.4.1",
        "value": "SSN-987-65-4321"
    }],
    "name": [{
        "given": ["John"],
        "family": "Doe"
    }],
    "telecom": [{
        "system": "phone",
        "value": "+1-555-123-4567"
    }],
    "gender": "male",
    "birthDate": "1985-07-15",
    "address": [{
        "city": "Amsterdam",
        "country": "Netherlands"
    }]
}

{
    "resourceType": "Patient",
    "id": "12345",
    "identifier": [{
        "system": "urn:oid:2.16.840.1.113883.4.1",
        "value": "SSN-987-65-4321"
    }],
    "name": [{
        "given": ["John"],
        "family": "Doe"
    }],
    "telecom": [{
        "system": "phone",
        "value": "+1-555-123-4567"
    }],
    "gender": "male",
    "birthDate": "1985-07-15",
    "address": [{
        "city": "Amsterdam",
        "country": "Netherlands"
    }]
}

De-identified FHIR Patient Resource:

json

{
    "resourceType": "Patient",
    "id": "pseud-9a8b7c6d", // Pseudonymized ID Patient.id: "12345" → "pseud-9a8b7c6d" (using a secure hashing algorithm like SHA-256 with a salt).
    "identifier": [{
        "system": "urn:oid:2.16.840.1.113883.4.1",
        "value": "pseud-a1b2c3d4" // Pseudonymized SSN
    }],
    "gender": "male",
    "birthDate": "1985", // Generalized to year only
    "address": [{
        "country": "Netherlands" // Retain country, remove city
    }]
}

{
    "resourceType": "Patient",
    "id": "pseud-9a8b7c6d", // Pseudonymized ID Patient.id: "12345" → "pseud-9a8b7c6d" (using a secure hashing algorithm like SHA-256 with a salt).
    "identifier": [{
        "system": "urn:oid:2.16.840.1.113883.4.1",
        "value": "pseud-a1b2c3d4" // Pseudonymized SSN
    }],
    "gender": "male",
    "birthDate": "1985", // Generalized to year only
    "address": [{
        "country": "Netherlands" // Retain country, remove city
    }]
}

Code for de-identification via EUCanSHare

import json
import hashlib
import secrets
from datetime import datetime

# Generate a secure salt (store this securely in production!)
SALT = secrets.token_hex(16)


def pseudonymize(value: str) -> str:
    """Pseudonymize a value using SHA-256 with salt"""
    if not value:
        return ""
    return hashlib.sha256(f"{SALT}{value}".encode()).hexdigest()[:12]


def generalize_birthdate(date_str: str) -> str:
    """Generalize birthdate to year only"""
    try:
        return datetime.strptime(date_str, "%Y-%m-%d").strftime("%Y")
    except:
        return ""  # Handle invalid/missing dates


def deidentify_patient(patient_data: dict) -> dict:
    """Main de-identification function for FHIR Patient resources"""
    deidentified = {"resourceType": "Patient"}

    # Pseudonymize direct identifiers
    deidentified["id"] = pseudonymize(patient_data.get("id", ""))

    # Handle identifiers (e.g., SSN)
    if "identifier" in patient_data:
        deidentified["identifier"] = [{
            "system": ident["system"],
            "value": pseudonymize(ident["value"])
        } for ident in patient_data["identifier"]]

    # Remove name and telecom completely
    if "name" in patient_data:
        deidentified["name"] = [{"family": "[REDACTED]", "given": ["[REDACTED]"]}]

    # Generalize birthdate
    if "birthDate" in patient_data:
        deidentified["birthDate"] = generalize_birthdate(patient_data["birthDate"])

    # Generalize address (keep country only)
    if "address" in patient_data:
        deidentified["address"] = [{
            "country": addr.get("country", "")
        } for addr in patient_data["address"]]

    # Keep gender (low-risk)
    deidentified["gender"] = patient_data.get("gender", "unknown")

    return deidentified


# Example Usage
if __name__ == "__main__":
    # Original FHIR Patient data
    original_patient = {
        "resourceType": "Patient",
        "id": "12345",
        "identifier": [{
            "system": "urn:oid:2.16.840.1.113883.4.1",
            "value": "SSN-987-65-4321"
        }],
        "name": [{
            "given": ["John"],
            "family": "Doe"
        }],
        "telecom": [{
            "system": "phone",
            "value": "+1-555-123-4567"
        }],
        "gender": "male",
        "birthDate": "1985-07-15",
        "address": [{
            "city": "Amsterdam",
            "country": "Netherlands"
        }]
    }

    # De-identify the patient
    deidentified_patient = deidentify_patient(original_patient)

    # Remove telecom completely
    if "telecom" in deidentified_patient:
        del deidentified_patient["telecom"]

    print("De-identified Patient:")
    print(json.dumps(deidentified_patient, indent=2))

import json
import hashlib
import secrets
from datetime import datetime

# Generate a secure salt (store this securely in production!)
SALT = secrets.token_hex(16)


def pseudonymize(value: str) -> str:
    """Pseudonymize a value using SHA-256 with salt"""
    if not value:
        return ""
    return hashlib.sha256(f"{SALT}{value}".encode()).hexdigest()[:12]


def generalize_birthdate(date_str: str) -> str:
    """Generalize birthdate to year only"""
    try:
        return datetime.strptime(date_str, "%Y-%m-%d").strftime("%Y")
    except:
        return ""  # Handle invalid/missing dates


def deidentify_patient(patient_data: dict) -> dict:
    """Main de-identification function for FHIR Patient resources"""
    deidentified = {"resourceType": "Patient"}

    # Pseudonymize direct identifiers
    deidentified["id"] = pseudonymize(patient_data.get("id", ""))

    # Handle identifiers (e.g., SSN)
    if "identifier" in patient_data:
        deidentified["identifier"] = [{
            "system": ident["system"],
            "value": pseudonymize(ident["value"])
        } for ident in patient_data["identifier"]]

    # Remove name and telecom completely
    if "name" in patient_data:
        deidentified["name"] = [{"family": "[REDACTED]", "given": ["[REDACTED]"]}]

    # Generalize birthdate
    if "birthDate" in patient_data:
        deidentified["birthDate"] = generalize_birthdate(patient_data["birthDate"])

    # Generalize address (keep country only)
    if "address" in patient_data:
        deidentified["address"] = [{
            "country": addr.get("country", "")
        } for addr in patient_data["address"]]

    # Keep gender (low-risk)
    deidentified["gender"] = patient_data.get("gender", "unknown")

    return deidentified


# Example Usage
if __name__ == "__main__":
    # Original FHIR Patient data
    original_patient = {
        "resourceType": "Patient",
        "id": "12345",
        "identifier": [{
            "system": "urn:oid:2.16.840.1.113883.4.1",
            "value": "SSN-987-65-4321"
        }],
        "name": [{
            "given": ["John"],
            "family": "Doe"
        }],
        "telecom": [{
            "system": "phone",
            "value": "+1-555-123-4567"
        }],
        "gender": "male",
        "birthDate": "1985-07-15",
        "address": [{
            "city": "Amsterdam",
            "country": "Netherlands"
        }]
    }

    # De-identify the patient
    deidentified_patient = deidentify_patient(original_patient)

    # Remove telecom completely
    if "telecom" in deidentified_patient:
        del deidentified_patient["telecom"]

    print("De-identified Patient:")
    print(json.dumps(deidentified_patient, indent=2))

🌱 Architecture

Literature Riview

Hapi Implementation Guide

Hapi FHIR

SODA

Data de-identification workflow in DigitalTWINS platform

UK biobank de-identification

HIPAA’s De-Identification Methods

EUCanSHare

Using EUCanSHare apporche to de-identify FHIR Patient resource data

Scenario

Literature Riview

Hapi Implementation Guide

Hapi FHIR

Data de-identification workflow in DigitalTWINS platform ​

UK biobank de-identification ​

HIPAA’s De-Identification Methods ​

EUCanSHare ​

Using EUCanSHare apporche to de-identify FHIR Patient resource data ​

Scenario ​

Data de-identification workflow in DigitalTWINS platform

UK biobank de-identification

HIPAA’s De-Identification Methods

EUCanSHare

Using EUCanSHare apporche to de-identify FHIR Patient resource data

Scenario