Azure Data Flows isolate-find substrings form text field

Question

I am trying to isolate several substrings from a specific column of a parquet file that contains text (string). The substrings are all in an array and I want to keep only those rows that contain one or more of these substrings - words, while I keep a new column with the substrings that where found at the text.

I have currently used the following transformations:

source: which is the parquet file I use
derived column: where I create a new column (words) which contains an array of the words-substrings that are contained in the text, by using the following expression intersect(split(text_column, ' '), ['array','of','words'])
filter: where I want to filter the derived column that was created at the previous transformation and exclude those rows that are either Null or contain an empty array
sink

I have currently stuck to the 3rd transformation where I cannot filter and discard those rows that the 2 arrays do not intersect. I think that when intersect doesn't find any common element it returns an empty string array which I have not find the right condition that filters it out. I have tried: 1. not(isNull(words)) 2. word != array('') 3. not(isNull(word[1])) but none of them worked.

Any suggestions regarding the whole process or the filtering of the empty string array will be perfect. Thank you in advance.

I was expecting to get back only the rows that contain at least one of the substrings, but I get all the rows regardless if they contain one of the substrings.

In filter transformation, give condition as `size(words)!=0` — Aswin, Jan 03 '23 at 08:57

score 1 · Answer 1 · answered Jan 03 '23 at 08:56

You can check the size of the array and remove the rows with array size=0. In filter transformation, filter on size(words)!=0.

I repro'd this with sample inputs.
Derived column transformation with same expression is given. intersect(split(text_column, ' '), ['array','of','words'])

enter image description here

Then in Filter transformation, condition is given as filter on size(words)!=0

enter image description here

By this way, we can remove the empty array.

Reference: MS document on size expression.

Azure Data Flows isolate-find substrings form text field

1 Answers1