1

I have 2 data frames in Apache Spark, each with a column named "JoinValue". JoinValue is numeric and has the same semantics and meaning in both data frames.

I need the combination of both data frames as input (training and test set) for a Machine Learning algorithm. Is it correct that I first need to combine both DataFrames into a single DataFrame before using it in an ML Pipeline?

Example:

df1.show()
+---------+---------+
|        a|JoinValue|
+---------+---------+
|A value 0|        0|
|A value 1|        5|
|A value 2|       10|
|A value 3|       15|
|A value 4|       20|
|A value 5|       25|
|A value 6|       30|
+---------+---------+

and

> df2.show()
+---------+---------+
|        b|JoinValue|
+---------+---------+
|B value 0|        0|
|B value 1|        7|
|B value 2|       14|
|B value 3|       21|
|B value 4|       28|
+---------+---------+

An outer join followed by an orderBy yields the following results:

> df1.join(df2, 'JoinValue', 'outer').orderBy('JoinValue').show()
+---------+---------+---------+
|JoinValue|        a|        b|
+---------+---------+---------+
|        0|A value 0|B value 0|
|        5|A value 1|     null|
|        7|     null|B value 1|
|       10|A value 2|     null|
|       14|     null|B value 2|
|       15|A value 3|     null|
|       20|A value 4|     null|
|       21|     null|B value 3|
|       25|A value 5|     null|
|       28|     null|B value 4|
|       30|A value 6|     null|
+---------+---------+---------+

What I actually want is this, without nulls:

+---------+---------+---------+
|JoinValue|        a|        b|
+---------+---------+---------+
|        0|A value 0|B value 0|
|        5|A value 1|B value 0|
|        7|A value 1|B value 1|
|       10|A value 2|B value 1|
|       14|A value 2|B value 2|
|       15|A value 3|B value 2|
|       20|A value 4|B value 2|
|       21|A value 4|B value 3|
|       25|A value 5|B value 3|
|       28|A value 5|B value 4|
|       30|A value 6|B value 4|
+---------+---------+---------+

What is the best way to use the JoinValue, a and b, coming from multiple data frames as features and labels in a machine learning algorithm?

pvoosten
  • 3,247
  • 27
  • 43
  • 1
    what you are asking for is not a standard join operation, I guess what you need is a followup forward fill (pandas ffill equivalent) .. [this](http://stackoverflow.com/questions/36019847/pyspark-forward-fill-with-last-observation-for-a-dataframe) might help – muon Apr 08 '17 at 02:36
  • Thanks, muon. I'll try the second answer: using [sparkts](http://sryza.github.io/spark-timeseries/0.3.0/index.html) – pvoosten Apr 10 '17 at 08:49

0 Answers0