0

I have a column in a dataframe that contains a JSON object. For each row in my dataframe, I'd like to extract the JSON, parse it and pull out certain fields. Once extracted, I'd like to append the fields to the row as new column elements.

I've looked at the explode() methods available on DataFrame as well as foreach(), flatMap() and map(), but have not been able to discern which is more appropriate for this type of processing.

dmux
  • 442
  • 7
  • 24

1 Answers1

0

Map is likely what you need. Using this, you can parse the json, select the fields you need, then return a new row with these additional columns.

In general, map is used for user defined functions that are 1:1 (eg 1 output row for each input row). Flatmap is used for user defined functions that are 1:n (where each row may return any number of rows)

David
  • 11,245
  • 3
  • 41
  • 46
  • David, thanks for the tip. Even though map is 1:1, does it expect the row length to remain consistent? – dmux Apr 13 '16 at 18:57
  • The output rows do not need to be the same length as the input rows. But with a data frame, all rows must have the same fields (so all of your output rows must have the same structure/be the same length) – David Apr 13 '16 at 18:59
  • Doesn't `map` convert it to an `RDD` first? You can skip the conversion just by using `withColumn` and a `UDF`. – David Griffin Apr 13 '16 at 19:00
  • @David, does it make sense that the map() method takes in a Function1 object? When creating a new anonymous class that implements Function1's interface, I'm required to override 60+ methods. – dmux Apr 13 '16 at 19:41
  • 1
    That sounds like Java -- I don't really know Java. I use Scala. – David Griffin Apr 13 '16 at 19:43
  • 1
    I'm a pyspark user, but that doesn't sound right. You should only have to create the function that is passed to map. This is a function that would take 1 row and output 1 row. Again, not a java spark user. But check to see if the code from this question helps clear it up for you http://stackoverflow.com/questions/29790417/java-spark-sql-dataframe-map-function-is-not-working – David Apr 13 '16 at 19:59
  • @David, Thanks for the link. It appears one of the commenters there used the `DataFrame.javaRDD().map()` method which requires a much simpler Function interface (only have to override `call()`). – dmux Apr 13 '16 at 20:39