-2

I have Python code to generate SQL queries from English queries. But while predicting, I might have to send sensitive data in my English query to the model. I want to mask sensitive information like nouns and numbers in my English query. When I receive the predicted query, I want to unmask that data again.

In short, I need a python program that can mask nouns and numbers in my string and then unmask them whenever I want them to. We can replace it with anything you  suggest.

Sample English Query:

How many Chocolate Orders for a customer with ID 123456?

Masking Expected Output:

How many xxxxxxxxxx Orders for a customer with ID xxxxxxxxx? 

My algorithm with create the query like:

Select count(1) from `sample-bucket` as d where d.Type ='xxxxxxxx' and d.CustId = 'xxxxxxx'

Now I need the unmasked query like below:

Select count(1) from `sample-bucket` as d where d.Type ='Chocolate' and d.CustId = '123456'
Aadhi Verma
  • 172
  • 2
  • 11

1 Answers1

0

You can use below code for the masking and unmasking a string. This way you can retain the words in the dictionary and can use them later on when you want to unmask the string. I think this code can be very helpful for the people using third party tools.

import base64 
import nltk

nltk.download('averaged_perceptron_tagger')

def base_64_encoding(text):
    return str(base64.b64encode(text.encode("utf-8")).decode("utf-8"))

def base_64_decoding(text):
    return str(base64.b64decode(text.encode('utf-8')).decode('utf-8'))

masked_element = {}
english_query = "How many Chocolate Orders for a customer with ID 123456?"
print("English Query: ", english_query)
for word in english_query.split(" "):
    ans = nltk.pos_tag([word])
    val = ans[0][1]
    if val == 'NN' or val == 'NNS' or val == 'NNPS' or val == 'NNP':
        masked_element[word] = base_64_encoding(word)
        english_query = english_query.replace(word, base_64_encoding(word))
    if word.isdigit():
        masked_element[word] = base_64_encoding(word)
        english_query = english_query.replace(word, base_64_encoding(word))
print("Masked Query: ", english_query)

for key, val in masked_element.items():
    if val in english_query:
        english_query = english_query.replace(val, key)
print("Unmasked English Query: ", english_query)

Below is the output of above program: enter image description here

Aadhi Verma
  • 172
  • 2
  • 11