We use spark for our day to day activities. During the processing we want to extract geo information based on some regular expressions from the description column. We tried to find out the regular expressions through which we can extract CITY information, with this we ended up having hundreds of regular expressions for each city CA, NY, etc.
We have created a mapping of regular expressions for city CA and NY and so on and loaded that data into spark by broad casting. Ann used these rules in custom udfs to extract city information.
The problem is when the rules are increasing the execution time started increasing, so looking for some option where the rules can be executed in a distributed way.
We may extend the same rule based data extraction to other fields as well.
I also tried "Drools" integration spark, incase if I don't find any optimised solution I may go with this.
Looking forward!!!
R, Krish