0

I have a question about the identification of GDPR (General Data Protection Regulation) related sentences. Is there a tool / method in Python, Java, ... that identifies whether a database column contains personnally identifiable information from its description only ?

We may think about using word embedding to get the "most_similar" or "most_similar_cosmul" words given a sentence and afterwards identifying keywords related to GDPR (biometric, personnal, id, photo...) but the results depend on the robustness of the word embedding model.

Thank you in advance,

Amr
  • 15
  • 2

1 Answers1

0

There is no such thing as "personally identifiable information" in GDPR. The term (from GDPR article 4(1)) is "personal data", defined as:

any information relating to an identified or identifiable natural person

and it doesn't itself have to be identifying to qualify. What's an "identifiable natural person"? GDPR says:

an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person

The key thing that turns regular "data" into "personal data" here is that "one or more factors" phrase. A single field, such as a phone number, could reasonably be considered as uniquely identifying a person. By itself a postal code probably doesn't, but when combined with a street address and a first name, we'd be very close to being able to identify someone, and hence all other data would become "personal". It's hard to evaluate whether a collection of fields is enough to uniquely identify someone or not – you might think that first name and city might not identify an individual, given "John" and "London", but "Esmerelda" and "Ulaanbaatar" might be pretty easy to track down, and it's the "worst case" that counts.

For a simpler example: A colour value such as #663399 by itself is just plain "data", is not "personal data", and is not subject to GDPR. That exact same value stored as "favourite colour" in a field in a table linking that data to a person is personal data. "City" in a table of cities is not personal data, but a "city" field in a user table is.

In short, you're not going to be able to do what you want. You can't tell whether a field is personal data or not from its name because you have insufficient context.

Synchro
  • 35,538
  • 15
  • 81
  • 104
  • Thank you for your response. In fact if we consider that I do have a context in my sentences. Say for example "This field gathers information located in users' birth certificates". After removing stopwords: "information, located, user, birth, certificate", these column will presumably contain personnal information. How could I use external pretrained models (gensim, fasttext...) from Wikimedia for instance to tag this column as containing personnal information ? Are there other tools / techniques ? I thought about using word similarity and then define a "personnal information" area – Amr Jul 28 '20 at 13:47
  • I just don't think you can tell reliably enough to be useful – there's nothing stopping a database naming its columns "a", "b", "c" and storing sensitive data in them, or having something innocuous-sounding like "notes" into which someone has pasted a medical history. I don't think even a human could form a reliable opinion by looking at field names alone, so you're going to have a very hard time getting a machine to do it. The obvious ones might be easy, but there will be a ton of others that will be impossible. Going the other way, fields that sound personal might not be. – Synchro Jul 28 '20 at 13:54
  • If you already have definitions that give you that much metadata about an individual field, you can presumably flag it as being identifying or not at the same time, in which case the machine learning is redundant. – Synchro Jul 28 '20 at 13:56
  • Thank you very much for your help. – Amr Jul 28 '20 at 14:14