Google refine cross-reference between row and column

Question

I'm not sure if this can be achieved in Google Refine at all. But basically, I have data like this.

enter image description here

The first table is the table of all the users. The second table show all the friends. However, in the second table in "friends" column not all the id exists in the first table which I want to get rid of. So, how can I search each id in friends column in the second table and get rid of the id that doesn't exists in the table 1?

score 4 · Accepted Answer · answered Jan 31 '13 at 23:58

4

Put the two tables in different projects (we'll call them Table1 and Table2).

In Table2 on on the friends column:

use "split multi-valued cells" to get each value on a separate row
convert the visitors column to numbers (or conversely user_id in Table1 to string)
use "add a new column based on this column" with the expression cross(cell,'Table1','user_id').length()

This will return 0 if there's no match, 1 if there's a match or N>1 if there are duplicates in Table1

If you want the data back in the original format, set up a facet to filter on the validity column, blank out all the bad values and then use "join multi-valued cells" to reverse the split operation you did up front.

I fixed some caching bugs with cross() for OpenRefine 2.6, so if the cross doesn't work, try stopping and restarting the Refine server.

answered Jan 31 '13 at 23:58

Tom Morris

10,490
32
53

Hi, btw I forgot to mention that "friends" column has around 3000 ids, so when I do that I got Heap error, even though I started with -m 6000m. Not sure if it's too much for Google Refine to work around this much of data? – toy Feb 01 '13 at 11:44
What is the total size of the two tables? Number of cells? One thing you could do is create a variant of Table1 which consists of just the user_id column to save memory. Also, Refine isn't very aggressive about evicting projects from memory, so you could restart the server to make sure no other projects are in memory. – Tom Morris Feb 01 '13 at 13:39
The first table is just 30m, but the second is 300m and they both only have 30000 records. I couldn't even do the "split multi-valued cells". Because Table2 column "friends" has around 3000 ids. – toy Feb 01 '13 at 21:31
You could use `forEach(value.split(','),v,if(,v,null).join(',')` to pull apart the friends string, validate each piece, and then put it back together again, but it's not something I have the time to put together for you right now. – Tom Morris Feb 01 '13 at 22:42
I will try and figure it out. Really appreciate your help. :-) – toy Feb 01 '13 at 22:46

Google refine cross-reference between row and column

1 Answers1