We have lots of documents in an elastic search index and doing full text searches at the moment. My next requirement in a project is finding all credit cards data in documents. Also user will be able to define some regular expression searching rules dynamically in the future. But with standard analyzer it is not possible to search credit card info or any user defined rule. For instance, let's say a document contains credit card info such as 4321-4321-4321-4321 or 4321 4321 4321 4321. Elastic search indexes this data as 4 parts as seen below :
"tokens" : [
{
"token" : "4321",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "4321",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "4321",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "4321",
"start_offset" : 15,
"end_offset" : 19,
"type" : "<NUM>",
"position" : 3
}
]
I just don't take into account Luhm algorithm now. If i do a basic regular expression search for finding a credit card with reg exp "([0-9]{4}[- ]){3}[0-9]{4}" it returns nothing because data is not analyzed and indexed for that. I thought for this purpose, i need to define a custom analyzer for regular expression searches and store the another version of data in another field or index. But as I said before in the future the user will define his/her own custom rule patterns for searching. How should i define the custom analyzer? Should i define ngram tokenizer(min:2, max:20) for that? With ngram tokenizer i think i can search for all defined regular expression rules. But is it reasonable? Project has to work with huge data without any performance problems. (A company's whole file system will be indexed). Do you have any other suggestion for this type of data discovery problem? My main purpose is finding credit cards at the moment. Thanks for helping.