Count by value distributed across multiple columns in google cloud dataprep

Question

I have a somewhat complex data transformation task that I can not figure out in Google Cloud Data prep. The source data is voter file information. The CSV has 10 columns (among many others) that contain a voter's election participation history. See screenshot. In short, the most recent election you voted in is included as text_election_code_1, the second most recent is in text_election_code_2 and so on. The value of the cell is the code of election itself i.e. GN2016 = 2016 general election.

Ideally I would like to transform this into a lookup matrix to answer questions like "Did voter with id# vote in GN2016?" and "How many people total voted in GN2012?"

As the data stands now it is extremely tough to count by election code because "GN2012" might be in any 1 of 10 columns. I.e. in screenshot below GN2012 is in column 3 for the first 2 rows and column 2 for the 3rd row.

I've done this before with SQL, but I can not figure out how to do this within cloud dataprep. Can anyone steer me in the right direction?

Current data shape (other P.I.I. columns omitted from screenshot)

Ideal data shape (maybe)

score 1 · Answer 1 · answered Aug 30 '18 at 01:29

I decide against a "wide" table in favor of a "long" one. It was pretty easy to accomplish after all with the "unpivot" option that converted the column values into rows. This example was very helpful: https://cloud.google.com/dataprep/docs/html/Analyze-across-Multiple-Columns_57344575

Count by value distributed across multiple columns in google cloud dataprep

1 Answers1