How to do an efficient search for dynamically defined regexes in Elasticsearch?

Question

I am working in a file system project (like dropbox). For the file system, I have an indexed data for full text search in elastic search. I have lots of large documents and searching works really well. But now my requirement is to use this data to query for some regexes. We have an admin panel for the customer and regexes will be defined dynamically by the customer in admin panel.

I know i can do regex searches in elastic search, but here the problem is tokenizer. For instance, let’s assume that user wants to create a regex pattern and wants to search 3 letters, ‘-’ and 2 digits such that “ABC-12” or "ASD-34". Problem here is my tokenizer. The defined tokenizer omits the character ‘-’, and indexes “ABC” and “12” separately. You may say not the omit ‘-’ character. But user may want to search a pattern with 3 letters, white space and 2 digits to retrieve data "ABC 12". Here white space is the problem. Somehow I have to use a tokenizer and cannot cover all dynamic regexes. So searching in the index does not solve my problem.

Actually for this type of search, I have another option which is to query all data with match all. With search scroll api, I can query all original documents partially. After each response from scroll api, I can run my regex finder in separate thread. So that I can prepare the desired data after the scrolling operation. Do you think this option is good for big data? I think I will need good cpu power and ram. I know it is not a special solution but I can not find any effective solution for my requirement. I am open for better solutions. Thanks.

I would definitely not query with match_all as it boils down to getting a full dump and doing the matching on the client-side, which defeats the purpose of ES in the first place. — Val, Dec 16 '19 at 10:00
Why include the _regex_ tag if this is not a regex question ? — , Dec 16 '19 at 21:58

score 0 · Answer 1 · answered Dec 16 '19 at 02:17

I believe, ES allows you to analyse the same field multiple times. Documentation states that new analysers can be added to existing fields later:

New multi-fields can be added to existing fields using the PUT mapping API.

This opens up a possibility to dynamically add new analysers (and tokenisers for that matter) as you find what sort of regex your users are after. I am not sure how trivial it will be for your particular use case, but this seems like an avenue to explore

How to do an efficient search for dynamically defined regexes in Elasticsearch?

1 Answers1