Faster CSV + trying to find unique items

Question

I have a csv file where I'm trying to find all the uniq values in columns past column 2 where column 1 has the same value and consolidate that in a new csv file. I know, that sounds way confusing so here's an example:

a sample of the original file foo.csv:

"Boom Lifts","Model Number","Manufacturer","Platform Height","Horizontal Outreach","Lift Capacity"
"Boom Lifts","Model Number","Platform Height","Horizontal Outreach","Up & Over Height","Platform Capacity"
"Boom Lifts","Model Number","Platform Height","Horizontal Outreach","Up & Over Height"
"Pusharound Lifts","Model Number","Manufacturer","Platform Height","Stowed Height"
"Scissor Lifts","Model Number","Manufacturer","Platform Height","Stowed Height","Overall Dimensions","Platform Extension"
"Scissor Lifts","Overall Dimensions","Platform Size","Platform Extension","Lift Capacity"

the ideal outcome bar.csv:

"Boom Lifts","Model Number","Manufacturer","Platform Height","Horizontal Outreach","Lift Capacity","Up & Over Height","Platform Capacity",,,
"Pusharound Lifts","Model Number","Manufacturer","Platform Height","Stowed Height"
"Scissor Lifts","Model Number","Manufacturer","Platform Height","Stowed Height","Overall Dimensions","Platform Size","Platform Extension","Lift Capacity"

each of the rows is of varying length and it's a pretty huge file (over 5k lines), I'm totally scratching my head on how to do the matching / string manipulation. And yes, some of those lines have trailing commas where there are 'empty cells'. I've been using Faster CSV so if there is a way to do this with that, it would be great.

pointers? preferably something that won't make my mbp come to a screeching halt?

So, a) first column can be treated as a key, and b) all subsequent columns can be treated as values in a list, where in the end you want this list to contain unique values...? That last row in bar.csv repeats "Overall Dimension" and "Platform Extensions". Are repeated values OK? — buruzaemon, Dec 06 '11 at 04:23
my bad, oveall dimension and platform extensions should NOT be repeated. I'd like to use fasterCSV to read in one file foo.csv and spit out another bar.csv. Thanks. — MarkL, Dec 06 '11 at 21:01

score 1 · Answer 1 · answered Dec 06 '11 at 04:51

Assuming you can get it into a 2d array with Faster CSV:

a = [
  ["Boom Lifts","Model Number","Manufacturer","Platform Height","Horizontal Outreach","Lift Capacity"]
  ["Boom Lifts","Model Number","Platform Height","Horizontal Outreach","Up & Over Height","Platform Capacity"]
  ["Boom Lifts","Model Number","Platform Height","Horizontal Outreach","Up & Over Height"]
  ["Pusharound Lifts","Model Number","Manufacturer","Platform Height","Stowed Height"]
  ["Scissor Lifts","Model Number","Manufacturer","Platform Height","Stowed Height","Overall Dimensions","Platform Extension"]
  ["Scissor Lifts","Overall Dimensions","Platform Size","Platform Extension","Lift Capacity"]
]

a.group_by {|e| e[0]}.map {|e| e.flatten.uniq}

gets you:

[
  ["Boom Lifts", "Model Number", "Manufacturer", "Platform Height", "Horizontal Outreach", "Lift Capacity", "Up & Over Height", "Platform Capacity"]
  ["Pusharound Lifts", "Model Number", "Manufacturer", "Platform Height", "Stowed Height"]
  ["Scissor Lifts", "Model Number", "Manufacturer", "Platform Height", "Stowed Height", "Overall Dimensions", "Platform Extension", "Platform Size", "Lift Capacity"]
]

Won't be instantaneous but shouldn't bring your MBP down.

Faster CSV + trying to find unique items

1 Answers1