Spark mapping a flat file to a class

Question

I have a flat file on HDFS containing a list of companies

CompanyA
CompanyA Decription
April '12
San Fran
11-50
CompanyB
...

and i want to map this into a companies class

case class Company(company: String, 
                   desc: String, 
                   founded: Date, 
                   location: String, 
                   employees: String)

I have tried the following but it doesn't seem to map properly

val companiesText = sc.textFile(...)

val companies = companyText.map(
   lines => Company(
        lines(0).toString.replaceAll("\"", ""),
        lines(1).toString.replaceAll("\"", ""),
        lines(2).toString.replaceAll("\"", ""),
        lines(3).toString.replaceAll("\"", ""),
        lines(4).toString.replaceAll("\"", ""),
        lines(5).toString.replaceAll("\"", "")
    )
)

i know i am not doing the date properly here but that is not the issue.

if i do and companies.count i get org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 4 times, most recent failure: Lost task 0.3 in stage 11.0 (TID 67, sandbox.hortonworks.com): java.lang.StringIndexOutOfBoundsException: String index out of range: 5 — Eoin Lane, Apr 06 '16 at 13:01
Unless you tell it somehow, Spark has no idea your file has one record per 5 lines. So in your map, each element is one line. So lines(0) is the first *character* of your string, not the first line. Eventually, you get a line with less than 6 characters, and so a StringIndexOutOfBoundsException — The Archetypal Paul, Apr 06 '16 at 13:23
For another thing, your `case class` takes only 5 parameters, you are passing in 6. That won't work either. Not that that's your problem. — David Griffin, Apr 06 '16 at 13:23
You need `sc.hadoopFile` and the n-line inoput format: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html — The Archetypal Paul, Apr 06 '16 at 13:31
Details here: http://stackoverflow.com/questions/36052480/how-to-read-multiple-line-elements-in-spark — The Archetypal Paul, Apr 06 '16 at 13:34
David, do mind posting your answer again please? I don't need this to run quickly. I am sure the newAPIHadoopFile example is better but it seem complex to me. — Eoin Lane, Apr 06 '16 at 13:58
I voted to undelete, but really, the `hadoopFile` isn't complex. All you need is in the answer in the linked-to question. — The Archetypal Paul, Apr 06 '16 at 16:26

Spark mapping a flat file to a class

0 Answers0