2

I am coming from R, new to SparkR, and trying to split a SparkDataFrame column of JSON strings into respective columns. The columns in the Spark DataFrame are arrays with a schema like this:

> printSchema(tst)
root
 |-- FromStation: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ToStation: array (nullable = true)
 |    |-- element: string (containsNull = true)

If I look at the data in the viewer, View(head(tst$FromStation)) I can see the SparkDataFrame's FromStation column has a form like this in each row:

list("{\"Code\":\"ABCDE\",\"Name\":\"StationA\"}", "{\"Code\":\"WXYZP\",\"Name\":\"StationB\"}", "{...

Where the ... indicates the pattern repeats an unknown amount of times.

My Question

How do I extract this information and put it in a flat dataframe? Ideally, I would like to make a FromStationCode and FromStationName column for each observation in the nested array column. I have tried various combinations of explode and getItem...but to no avail. I keep getting a data type mismatch error. I've searched through examples of other people with this challenge in Spark, but SparkR examples are more scarce. I'm hoping someone with more experience using Spark/SparkR could provide some insight.

Many thanks, nate

nate
  • 1,172
  • 1
  • 11
  • 26
  • I have to wonder if I would have these problems if I could specify a schema with array types in SparkR? I dream of being able to go something like `structType(a bunch of regular old structField("blah", "string"),` then when I hit the nested fields, do something like use another nested structType...or maybe use a list to denote an array with subfields specified by the structFields...so maybe: `structType(boring StructFieldsHere, list("FromStation" , structField("Name", "string"), structField("Code", "string"))` I also need to test out using `flatMap` with some sort of `strsplit` today. – nate Apr 05 '17 at 13:58

1 Answers1

0

I guess you need to convert tst into usual R object

df = collect(tst)

Then you operate with df like with any other R data.frame

  • But doesn't that get around the whole point of SparkR? If I do a collect and then operate on it like a regular dataframe, don't I lose out on parallel processing? What if the data is too big to collect onto a single machine? – nate Apr 04 '17 at 15:54
  • right, but then what you mean by "flat dataframe" ? – Sergio Alyoshkin Apr 04 '17 at 22:43
  • By flat dataframe, I mean one that doesn't have a nested schema/fields. My goal is to have something akin to what `jsonlite::fromJSON(myJSONobject, flatten=true)` would produce. In this case, Instead of an RDD with two columns (`FromStation`, and `ToStation`) I would like an RDD with 4 columns named `FromStationCode`, `FromStationName`, `ToStationCode`, `ToStationName`...After that, I'd like to merge the resulting 4 column dataframe with other ID variables, but that goes beyond the core of my problem: getting at objects nested within an array string in SparkR – nate Apr 05 '17 at 13:38
  • I've done that in scala, but R API and examples are bit late. Havn't seen the same in sparkR. – Sergio Alyoshkin Apr 05 '17 at 19:44