0

We use spark for our day to day activities. During the processing we want to extract geo information based on some regular expressions from the description column. We tried to find out the regular expressions through which we can extract CITY information, with this we ended up having hundreds of regular expressions for each city CA, NY, etc.

We have created a mapping of regular expressions for city CA and NY and so on and loaded that data into spark by broad casting. Ann used these rules in custom udfs to extract city information.

The problem is when the rules are increasing the execution time started increasing, so looking for some option where the rules can be executed in a distributed way.

We may extend the same rule based data extraction to other fields as well.

I also tried "Drools" integration spark, incase if I don't find any optimised solution I may go with this.

Looking forward!!!

R, Krish

Krish
  • 135
  • 1
  • 3
  • 11
  • you could try to minimise the number of regex by using string interpolation where possible and/or writing more compact regex. Then you can use dataframe map and pattern matching without the need of UDFs as shown [here](https://stackoverflow.com/questions/60014546/how-to-create-a-generic-regular-expression-so-that-all-group-result-can-be-extra/60020373#60020373) – abiratsis Feb 05 '20 at 14:22

1 Answers1

0

Please make sure that your Spark job is using a lot of parallelism. Without this small level of slowness will feel bigger. Theoretically regex processing shouldn't be that heavy and if it is run on records independently (from other records) then it would be scalable too. Please avoid running regex on a large document, rather run it on different parts of the document or on small documents in parallel.

Please check if your data is partitioned to more than 3X the number of CPUs used for the Spark job. Also avoid many small partitions.

If the Spark job is already been optimally parallelized and it is just a matter of too many regex being run then just get a bigger cluster and parallelize it even more.

Salim
  • 2,046
  • 12
  • 13
  • We have taken care that the spark job utilizes all the cores. The moment we had these regular expression it is slowing down. And even the my job is taking care of processing records in distributed way, each record has to go through multiple regex patterns so it is slowing down. – Krish Feb 06 '20 at 04:47
  • if that's the case then just get a bigger cluster and parallel even more. – Salim Feb 06 '20 at 16:06
  • I know, that will be the last option, if I don't get any other solution. – Krish Feb 21 '20 at 09:06