Elastic Search Analyzer for Dynamically Defined Regular Expression Searches

Question

We have lots of documents in an elastic search index and doing full text searches at the moment. My next requirement in a project is finding all credit cards data in documents. Also user will be able to define some regular expression searching rules dynamically in the future. But with standard analyzer it is not possible to search credit card info or any user defined rule. For instance, let's say a document contains credit card info such as 4321-4321-4321-4321 or 4321 4321 4321 4321. Elastic search indexes this data as 4 parts as seen below :

  "tokens" : [
    {
      "token" : "4321",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "4321",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "4321",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "4321",
      "start_offset" : 15,
      "end_offset" : 19,
      "type" : "<NUM>",
      "position" : 3
    }
  ]

I just don't take into account Luhm algorithm now. If i do a basic regular expression search for finding a credit card with reg exp "([0-9]{4}[- ]){3}[0-9]{4}" it returns nothing because data is not analyzed and indexed for that. I thought for this purpose, i need to define a custom analyzer for regular expression searches and store the another version of data in another field or index. But as I said before in the future the user will define his/her own custom rule patterns for searching. How should i define the custom analyzer? Should i define ngram tokenizer(min:2, max:20) for that? With ngram tokenizer i think i can search for all defined regular expression rules. But is it reasonable? Project has to work with huge data without any performance problems. (A company's whole file system will be indexed). Do you have any other suggestion for this type of data discovery problem? My main purpose is finding credit cards at the moment. Thanks for helping.

Besides the fact that it's not exactly a good idea to store credit card numbers in ES (but that's not the point here), will the user be able to search for any prefix/infix/suffix substring in the credit card number or only a full credit card number? — Val, Oct 15 '19 at 14:19
My aim here is to detect documents which include sensitive data, so that i can take an action for these type of documents. I don't interested in any substring in credit card. — Thorux, Oct 15 '19 at 14:39
So you confirm you're only ever going to search for 16 digits (plus some eventual separation symbols) ? — Val, Oct 15 '19 at 14:40
For credit card yes. But in the future there will some other rules too like finding the documents which includes social security number. More patterns will be added to the system. Thats why i thought for analyzing with ngram. — Thorux, Oct 15 '19 at 14:50
Ok, but that will be in a different field, right? or are those numbers in a big bulk of text? — Val, Oct 15 '19 at 14:51
Could you please explain your question more? What do you mean by different field? Actually I have all the company's file system documents and i need a strategy for doing data discovery in an efficient way for all these documents. — Thorux, Oct 16 '19 at 05:42
Do you have a field `"card_number": "2435 3526 3527 3728"` or are card numbers burried inside a bigger text body such as in `"text": "the card number for Mr XYZ is 4637-2342-3442-3224 blablabla"`? — Val, Oct 16 '19 at 05:51

score 2 · Accepted Answer · answered Oct 16 '19 at 06:34

Ok, here is a pair of custom analyzers that can help you detect credit card numbers and social security numbers. Feel free to adapt the regular expression as you see fit (by adding/removing other character separators that you will find in your data).

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "card_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "card_number"
          ]
        },
        "ssn_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "social_number"
          ]
        }
      },
      "filter": {
        "card_number": {
          "type": "pattern_replace",
          "preserve_original": false,
          "pattern": """.*(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4})[\s\.\-]+(\d{4}).*""",
          "replacement": "$1$2$3$4"
        },
        "social_number": {
          "type": "pattern_replace",
          "preserve_original": false,
          "pattern": """.*(\d{3})[\s\.\-]+(\d{2})[\s\.\-]+(\d{4}).*""",
          "replacement": "$1$2$3"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "fields": {
          "card": {
            "type": "text",
            "analyzer": "card_analyzer"
          },
          "ssn": {
            "type": "text",
            "analyzer": "ssn_analyzer"
          }
        }
      }
    }
  }
}

Let's test this.

POST test/_analyze
{
  "analyzer": "card_analyzer",
  "text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
}

Will yield a nice digit-only credit card number:

{
  "tokens" : [
    {
      "token" : "3526472847236374",
      "start_offset" : 0,
      "end_offset" : 86,
      "type" : "word",
      "position" : 0
    }
  ]
}

Similarly for SSN:

POST test/_analyze
{
  "analyzer": "ssn_analyzer",
  "text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374"
}

Will yield a nice digit-only social security number:

{
  "tokens" : [
    {
      "token" : "442231452",
      "start_offset" : 0,
      "end_offset" : 86,
      "type" : "word",
      "position" : 0
    }
  ]
}

And now we can search for either a credit card or a SSN. Let's say we have the following two documents. The SSN and credit card numbers are the same, yet they use different character separators

POST test/_doc
{ "text": "Mr XYZ whose SSN is 442-23-1452 has a credit card whose number was 3526 4728 4723 6374" }

POST test/_doc
{ "text": "SSN is 442.23.1452 belongs to Mr. XYZ. He paid $20 via credit card number 3526-4728-4723-6374" }

You can now find both documents by looking for the credit card number and/or SSN in any format:

POST test/_search 
{
  "query": {
    "match": {
      "text.card": "3526 4728 4723 6374"
    }
  }
}

POST test/_search 
{
  "query": {
    "match": {
      "text.card": "3526 4728 4723-6374"
    }
  }
}

POST test/_search 
{
  "query": {
    "match": {
      "text.ssn": "442 23-1452"
    }
  }
}

All the above queries will match and return both documents.

Thanks for the guidance. My implementation is slighlty different but your answer helped me a lot. — Thorux, Nov 08 '19 at 13:53

Elastic Search Analyzer for Dynamically Defined Regular Expression Searches

1 Answers1