Scalding: parsing comma-separated data with header

Question

I have data in format:

"header1","header2","header3",...
"value11","value12","value13",...
"value21","value22","value23",...
....

What is the best way to parse it in Scalding? I have over 50 columns altogether, but I am only interested in some of them. I tried importing it with Csv("file"), but that doesn't work.

The only solution that comes to mind is to parse it manually with TextLine and disregard the line with offset == 0. But I'm sure there must be a better solution.

*I tried importing it with Csv("file"), but that doesn't work.* -- You should probably explain a bit more about why it doesn't work. For example, does it result in a compile-time or run-time error? — DaoWen, Jul 28 '14 at 16:53
I tried importing it providing a schema consisting of one field I was interestedin, for which I got "did not parse correct number of values from input data, expected: 1, got: 88:"x","xxx",...". When I did not provide a schema I got: "could not select fields: [{1}:'fieldName'], from: [{?}:UNKNOWN]" — Savage Reader, Jul 28 '14 at 16:59

technotring · Answer 1 · 2014-07-31T16:35:01.940

It looks like you have 88 fields (well over 22 fields) in your data-set and not just 1. Have a read of:

https://github.com/twitter/scalding/wiki/Frequently-asked-questions#what-if-i-have-more-than-22-fields-in-my-data-set

See text from above link here:

What if I have more than 22 fields in my data-set?

Many of the examples (e.g. in the tutorial/ directory) show that the fields argument is specified as a Scala Tuple when reading a delimited file. However Scala Tuples are currently limited to a maximum of 22 elements. To read-in a data-set with more than 22 fields, you can use a List of Symbols as fields specifier. E.g.

 val mySchema = List('first, 'last, 'phone, 'age, 'country)
 val input = Csv("/path/to/file.txt", separator = ",", 
 fields = mySchema) val output = TextLine("/path/to/out.txt") input.read
      .project('age, 'country)
      .write(Tsv(output))

Another way to specify fields is using Scala Enumerations, which is available in the develop branch (as of Apr 2, 2013), as demonstrated in Tutorial 6:

object Schema extends Enumeration {
   val first, last, phone, age,country = Value // arbitrary number of fields 
}

import Schema._

Csv("tutorial/data/phones.txt", separator = " ", fields = Schema)  
.read.project(first,age).write(Tsv("tutorial/data/output6.tsv"))

So while reading your file supply a schema with all 88 fields using either List or Enumeration (see in above link/quote)

For skipping the header, you can additionally supply skipHeader = true in the Csv constructor.

Csv("tutorial/data/phones.txt", fields = Schema, skipHeader = true)

I've read that. The point is, I am only interested in a couple of fields and not going to create a huge schema just for that. — Savage Reader, Jul 31 '14 at 12:38
I appreciate you want to project only the fields that you are after without supplying schema but looking at the current scala tuple limitations you will be able to do that only if you have up to 22 fields. So as a workaround, supply the schema to parse the file. Then use project function to project only the fields that you are interested in. — technotring, Jul 31 '14 at 13:12

score 1 · Accepted Answer · answered Jul 31 '14 at 12:39

In the end I solved it by parsing each line manually as follows:

def tipPipe = TextLine("tip").read.mapTo('line ->('field1, 'field5)) {
line: String => val arr = line.split("\",\"")
  (arr(0).replace("\"", ""), if (arr.size >= 88) arr(4) else "unknown")
}

Scalding: parsing comma-separated data with header

2 Answers2