0

I am trying to isolate several substrings from a specific column of a parquet file that contains text (string). The substrings are all in an array and I want to keep only those rows that contain one or more of these substrings - words, while I keep a new column with the substrings that where found at the text.

I have currently used the following transformations:

  1. source: which is the parquet file I use
  2. derived column: where I create a new column (words) which contains an array of the words-substrings that are contained in the text, by using the following expression intersect(split(text_column, ' '), ['array','of','words'])
  3. filter: where I want to filter the derived column that was created at the previous transformation and exclude those rows that are either Null or contain an empty array
  4. sink

I have currently stuck to the 3rd transformation where I cannot filter and discard those rows that the 2 arrays do not intersect. I think that when intersect doesn't find any common element it returns an empty string array which I have not find the right condition that filters it out. I have tried: 1. not(isNull(words)) 2. word != array('') 3. not(isNull(word[1])) but none of them worked.

Any suggestions regarding the whole process or the filtering of the empty string array will be perfect. Thank you in advance.

I was expecting to get back only the rows that contain at least one of the substrings, but I get all the rows regardless if they contain one of the substrings.

ClockWork
  • 1
  • 3

1 Answers1

1

You can check the size of the array and remove the rows with array size=0. In filter transformation, filter on size(words)!=0.

  • I repro'd this with sample inputs. enter image description here

  • Derived column transformation with same expression is given. intersect(split(text_column, ' '), ['array','of','words'])

enter image description here

  • Then in Filter transformation, condition is given as filter on size(words)!=0

enter image description here

By this way, we can remove the empty array.

Reference: MS document on size expression.

Aswin
  • 4,090
  • 2
  • 4
  • 16