Suppose I have the following columns for a csv that I read through a 'File Reader' node:
id, name, city, income
After reading it, I notice that the column 'city' contains a huge number of unique values. I want to:
- Know which values are the 'k' most frequent for 'city'
- Modify those which are not the 'k' most frequent to hold something like 'other'
Example:
id, name, city, income
1, Person 1, New York, 100.000
2, Person 2, Toronto, 90.000
3, Person 3, New York, 50.000
4, Person 4, Seattle, 60.000
Choosing k to be 1, I want to produce the following table:
id, name, city, income
1, Person 1, New York, 100.000
2, Person 2, Other, 90.000
3, Person 3, New York, 50.000
4, Person 4, Other, 60.000
It happens because 'New York' is the '1' most frequent value for 'city' in the original table.
Do you know how I can do that using Knime?
Thanks a lot!