I am trying to isolate several substrings from a specific column of a parquet file that contains text (string). The substrings are all in an array and I want to keep only those rows that contain one or more of these substrings - words, while I keep a new column with the substrings that where found at the text.
I have currently used the following transformations:
- source: which is the parquet file I use
- derived column: where I create a new column (words) which contains an array of the words-substrings that are contained in the text, by using the following expression intersect(split(text_column, ' '), ['array','of','words'])
- filter: where I want to filter the derived column that was created at the previous transformation and exclude those rows that are either Null or contain an empty array
- sink
I have currently stuck to the 3rd transformation where I cannot filter and discard those rows that the 2 arrays do not intersect. I think that when intersect doesn't find any common element it returns an empty string array which I have not find the right condition that filters it out. I have tried: 1. not(isNull(words)) 2. word != array('') 3. not(isNull(word[1])) but none of them worked.
Any suggestions regarding the whole process or the filtering of the empty string array will be perfect. Thank you in advance.
I was expecting to get back only the rows that contain at least one of the substrings, but I get all the rows regardless if they contain one of the substrings.