Safeguarding Sensitive Data: A Guide to PII Masking with LlamaIndex in Python

Understanding PII (Personally Identifiable Information)

What is PII?

Personally Identifiable Information (PII) refers to any information that can be used to identify an individual. This can be anything like names, addresses, social security numbers, credit card details, and more. Protecting PII is crucial to prevent unauthorized access and maintain user privacy.

Example of PII:

Consider a medical record entry:

Dear Dr. Ralph,

Patient: Jane Doe
Medical Record Number: 987654
DOB: 01/15/1980
Address: 456 Medical Avenue, Cityville, State 12345

In this scenario, “Jane Doe,” the medical record number “987654,” and the address “456 Medical Avenue, Cityville, State 12345” are all examples of healthcare-related PII.

Concerns with Exposing PII

1. Privacy Violation: Revealing PII without proper masking can lead to privacy violations, potentially exposing individuals to identity theft, fraud, or other malicious activities.

2. Regulatory Compliance: Many industries are subject to strict regulations regarding PII handling, such as GDPR, HIPAA, or CCPA. Failure to comply with these regulations can result in severe legal consequences.

3. Trust Issues: Users expect their sensitive information to be handled with care. Any breach of trust can damage relationships and reputation.

The Solution: PII Masking with LlamaIndex 🦙

Now that we understand the risks associated with exposing PII, let’s explore a solution using LlamaIndex for PII masking.

LlamaIndex provides a way in which you can use a LLM to make the PII related data masked, So that any information that is important and should not be revealed will get masked.

NOTE: Following example is being shown using OpenAI, But Local LLM that you can host at your end are preferred.

You’ll need to install llama_index and get OpenAI API key to run the exact below code. To set up llama_index and know more about it, Check out LlamaIndex Boilerplate.

The code would look like this:

from llama_index import ServiceContext, VectorStoreIndex
from llama_index.schema import TextNode
from llama_index.schema import NodeWithScore
from llama_index.indices.postprocessor import PIINodePostprocessor
import openai

# Set openai Key
openai.api_key = "OPENAI_API_KEY"
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
service_context = ServiceContext.from_defaults(llm=llm)

text = """
Dear Dr. Ralph,\

Patient: Jane Doe\
Medical Record Number: 987654\
DOB: 01/15/1980\
Address: 456 Medical Avenue, Cityville, State 12345\
"""
# Create a node using the above text
node = TextNode(text=text)

# Create PII processor
processor = PIINodePostprocessor(service_context=service_context)

# Parse the above created node with the PII prcessor
new_nodes = processor.postprocess_nodes([NodeWithScore(node=node)])

# check if the masking worked or not!
print("\nMasked Text\n", new_nodes[0].node.get_text())
print("\nMetadata PII:\n", new_nodes[0].node.metadata["__pii_node_info__"])

# You can add the masked node into the index and query against it.
# feed into index
index = VectorStoreIndex([n.node for n in new_nodes])

response = index.as_query_engine().query(
    "What is the patient's home address?"
)
print("\nQuery Output: \n", str(response))
Code language: Python (python)

Run the .py file and check the output. If everything worked fine, You will get something like this in the terminal.

Masked Text:
Dear [Doctor's Name],

Patient: [NAME]
Medical Record Number: [MEDICAL_RECORD_NUMBER]
DOB: [DATE_OF_BIRTH]
Address: [ADDRESS]

Metadata PII:

{
  'NAME': 'Jane Doe',
  'MEDICAL_RECORD_NUMBER': '987654',
  'DATE_OF_BIRTH': '01/15/1980',
  'ADDRESS': '456 Medical Avenue, Cityville, State 12345'
}

Query Output:

[ADDRESS]
Code language: Python (python)

As you can see, All the important information regarding the patient is now masked and the Query engine will not send any important data to LLM.

LlamaIndex is the best framework to interact with LLM with your own data. They are working everyday to bring out something new and exciting everyday! Check them out here: LlamaIndex Docs

Keep learning. Keep Debugging!

Understanding PII (Personally Identifiable Information)

Concerns with Exposing PII

The Solution: PII Masking with LlamaIndex 🦙

You Might Also Like

Setting up Python Environment

Using the Detoxify Library in Python: A Comprehensive Guide

Getting Started with Python

Leave a Reply Cancel reply