1

I need to take a pipe that has a column of labels with associated values, and pivot that pipe so that there is a column for each label with the correct values in each column. So f example if I have this:

Id  Label Value 
1   Red   5
1   Blue  6
2   Red   7
2   Blue  8
3   Red   9
3   Blue  10

I need to turn it into this:

ID Red Blue
1  5   6
2  7   8
3  9   10

I know how to do this using the pivot command, but I have to explicitly know the values of the labels. How can I can dynamically read the labels from the “label” column into a list that I can then pass into the pivot command? I have tried to create list with:

pipe.groupBy('id) {_.toList('label) }

, but I get a type mismatch saying it found a symbol but is expecting (cascading.tuple.Fields, cascading.tuple.Fields). Also, from reading online, it sounds like using toList is frowned upon. The number of things in 'label is finite and not that big (30-50 items maybe), but may be different depending on what sample of data I am working with.

Any suggestions you have would be great. Thanks very much!

Greg Guida
  • 7,302
  • 4
  • 30
  • 40
J Calbreath
  • 2,665
  • 4
  • 22
  • 31

2 Answers2

1

I think you're on the right track, you just need to map the desired values to Symbols:

val newHeaders = lines
    .map(_.split(" "))
    .map(a=>a(1))
    .distinct
    .map(f=>Symbol(f))
    .toList

The Execution type will help you to combine with the subsequent pivot, for performance reasons.

Note that I'm using a TypedPipe for the lines variable.

If you want your code to be super-concise, you could combine lines 1 & 2, but it's just a stylistic choice:

map(_.split(" ")(1))
Tristan Reid
  • 5,844
  • 2
  • 26
  • 31
0

Try using Execution to get the list of values from the data. More info on executions: https://github.com/twitter/scalding/wiki/Calling-Scalding-from-inside-your-application

Dan Osipov
  • 1,429
  • 12
  • 15