4

I am using Hive and trying to clean up data that may have multiple unwanted data. Would like to know if I can use a single regexp_replace or translate function for multiple conditions.

For Eg. If the string is 2000 Helen St - DO NOT USE I need to replace/trim - DO NOT USE and also change St to Street. The output should be 2000 Helen Street.

Similarly

  • 3000 Cross St should be 3000 Cross Street,
  • 4000 Mascot Dr should be 4000 Mascot Drive
  • etc..

There are about 10 or more such clean up conditions, and the input string could either meet 0 or multiple conditions criteria. Can these conditions be set up in a single regexp_replace?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Suraj
  • 575
  • 1
  • 9
  • 23
  • can you show some examples of clean up conditions? also include your attempt so far. – Vamsi Prabhala Mar 13 '18 at 23:53
  • For me, it seems that you are going to have a set of transformation rule and using `regexp_replace` multiple times should not be a `clean` solution. Rather think about writing a simple `GenericUDF`, which would take care of all scenario. And it comes with power of Java, so there are endless possibility of cleaning all type of data. – Gyanendra Dwivedi Mar 14 '18 at 19:50

1 Answers1

1

You can easily combine templates for removing substrings or replacing with the same value using OR (|). In other cases, use multiple regexp_replace:

For example, if you want to remove 'TEST' and '- DO NOT USE' and replace 'Dr' with 'Street':

hive> select regexp_replace(
                  regexp_replace('TEST 2000 Helen St - DO NOT USE',
                                 '[- ]*DO NOT USE|TEST( *|$)',''),
                          'St([ ]*|$)','Street'  
                  );
OK
2000 Helen Street
Time taken: 0.055 seconds, Fetched: 1 row(s)

It removes 'TEST' and '- DO NOT USE' - these combined into single template. 'St' replaced with 'Street' - this is second regexp_replace. Of course this is example only.

leftjoin
  • 36,950
  • 8
  • 57
  • 116