I'm having a hard time anonymizing PII for a project I am working on using Presidio. For example, when I am trying to clean the data and I give in an address (i.e 123 Sesame Street, Los Angeles, California) it will give me back
123 Sesame Street, <LOCATION>, <LOCATION>.
While this is a step in the right direction, how can I get it to also anonymize 123 Sesame Street?
I tried adding context clues like "I live at", "My address is", and "is my address" in hoping it would take a closer look at the following or prior text. It did not help.
Code:
from presidio_anonymizer import AnonymizerEngine
from presidio_analyzer import AnalyzerEngine
analyzer_engine = AnalyzerEngine()
anonymizer_engine = AnonymizerEngine()
pii_context_clues = ['name', 'phone', 'address is', 'my address', 'live at', 'live in']
text = 'My address is 123 Sesame Street, Los Angeles, California'
analysis_results = analyzer_engine.analyze(text=text, language='en', context=pii_context_clues)
redacted_text = anonymizer_engine.anonymize(text, analysis_results)
print(redacted_text)
Output
text: My address is 123 Sesame Street, <LOCATION>, <LOCATION>
items:
[
{'start': 45, 'end': 55, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},
{'start': 33, 'end': 43, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'}
]
Desired Output:
text : My address is <LOCATION>, <LOCATION>, <LOCATION>