I am working in a file system project (like dropbox). For the file system, I have an indexed data for full text search in elastic search. I have lots of large documents and searching works really well. But now my requirement is to use this data to query for some regexes. We have an admin panel for the customer and regexes will be defined dynamically by the customer in admin panel.
I know i can do regex searches in elastic search, but here the problem is tokenizer. For instance, let’s assume that user wants to create a regex pattern and wants to search 3 letters, ‘-’ and 2 digits such that “ABC-12” or "ASD-34". Problem here is my tokenizer. The defined tokenizer omits the character ‘-’, and indexes “ABC” and “12” separately. You may say not the omit ‘-’ character. But user may want to search a pattern with 3 letters, white space and 2 digits to retrieve data "ABC 12". Here white space is the problem. Somehow I have to use a tokenizer and cannot cover all dynamic regexes. So searching in the index does not solve my problem.
Actually for this type of search, I have another option which is to query all data with match all. With search scroll api, I can query all original documents partially. After each response from scroll api, I can run my regex finder in separate thread. So that I can prepare the desired data after the scrolling operation. Do you think this option is good for big data? I think I will need good cpu power and ram. I know it is not a special solution but I can not find any effective solution for my requirement. I am open for better solutions. Thanks.