0

In order to run a few ML algorithms, I need to create extra columns of data. Each of these columns involves some fairly intense calculations that involves keeping moving averages and recording information as you go through each row (and updating it meanwhile). I've done a mock through with a simple Python script and it works, and I am currently looking to translate it to a Scala Spark script that could be run on a larger data set.

The issue is it seems that for these to be highly efficient, using Spark SQL, it is preferred to use the built in syntax and operations (which are SQL-like). Encoding the logic in a SQL expression seems to be a very thought-intensive process, so I'm wondering what the downsides will be if I just manually create the new column values by iterating through each row, keeping track of variables and inserting the column value at the end.

Eric Staner
  • 969
  • 2
  • 9
  • 14
  • 2
    This question is too broad to answer. Please review your question with specifics about what you have tried and also a [MVCE](http://stackoverflow.com/help/mcve) – eliasah Jun 07 '16 at 05:40
  • "Each of these columns involves some fairly intense calculations that involves keeping moving averages and recording information as you go through each row (and updating it meanwhile)" -- it seems that your calculations require state across row. How will that work in a distributed environment where your data is in multiple partitions? If you don't require global state, what's the definition of the "window" for managing state (in the sense of SQL window functions)? – Sim Jun 12 '16 at 17:48

1 Answers1

0

You can convert an rdd into dataframe. Then use map on the data frame and process each row as you wish. If you need to add new column, then you can use, withColumn. However this will only allow one column to be added and it happens for the entire dataframe. If you want more columns to be added, then inside map method,

a. you can gather new values based on the calculations

b. Add these new column values to main rdd as below

val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)

Here row, is the reference of row in map method

c. Create new schema as below

val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))

d. Add to the old schema

val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)

e. Create new dataframe with new columns

val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)
Ramzy
  • 6,948
  • 6
  • 18
  • 30
  • The thing is I am not sure whether the SQL query or the map function will be very efficient. I am using "dynamic programming"-ish style to keep recurring variables that are used to make the traversal more efficient. – Eric Staner Jun 07 '16 at 15:46
  • DataFrames API is the one with good performance improvement. Spark SQL is more close to using spark for SQL and HIVE queries. Definitely there are few optimizations done underneath, given that the DataFrames work on schema beneath. – Ramzy Jun 07 '16 at 17:23