Merge two tables in Scala/Spark

Question

I have two tab separated data files like below:

file 1:

number  type    data_present
 1       a        yes
 2       b        no

file 2:

type    group   number  recorded
 d       aa      10       true
 c       cc      20       false

I want to merge these two files so that output file looks like below:

number  type    data_present    group   recorded
  1      a         yes           NULL    NULL
  2      b         no            NULL    NULL
  10     d         NULL           aa     true
  20     cc        NULL           cc     false

As you can see, for columns which are not present in other file, I'm filling those places with NULL.

Any ideas on how to do this in Scala/Spark?

This is a type of cross join or perhaps a union but prepping the data is the tedious part. How far have you gotten? Can you post some spark code? — Paul, Aug 04 '15 at 09:54

score 3 · Answer 1 · answered Aug 04 '15 at 13:50

Create two files for your data set:

$ cat file1.csv 
number  type    data_present
 1       a        yes
 2       b        no

$ cat file2.csv
type    group   number  recorded
 d       aa      10       true
 c       cc      20       false

Convert them to CSV:

$ sed -e 's/^[ \t]*//' file1.csv | tr -s ' ' | tr ' ' ',' > f1.csv
$ sed -e 's/^[ ]*//' file2.csv | tr -s ' ' | tr ' ' ',' > f2.csv

Use spark-csv module to load CSV files as dataframes:

$ spark-shell --packages com.databricks:spark-csv_2.10:1.1.0

import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df1 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f1.csv", "header" -> "true"))
val df2 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f2.csv", "header" -> "true"))

Now perform joins:

scala> df1.join(df2, df1("number") <=> df2("number") && df1("type") <=> df2("type"), "outer").show()

+------+----+------------+----+-----+------+--------+
|number|type|data_present|type|group|number|recorded|
+------+----+------------+----+-----+------+--------+
|     1|   a|         yes|null| null|  null|    null|
|     2|   b|          no|null| null|  null|    null|
|  null|null|        null|   d|   aa|    10|    true|
|  null|null|        null|   c|   cc|    20|   false|
+------+----+------------+----+-----+------+--------+

For more details goto here, here and here.

What if the resulting join is too big to fit on memory. How can I do it directly on disk? — skan, Jun 09 '19 at 01:38
@skan: Try this - https://stackoverflow.com/questions/8820778/linux-join-2-csv-files — tuxdna, Jun 10 '19 at 02:23

score 2 · Answer 2 · edited Jun 23 '17 at 09:57

2

This will give you the desired output:

val output = file1.join(file2, Seq("number","type"), "outer")

edited Jun 23 '17 at 09:57

Erwin Bolwidt

30,799
15
56
79

answered Jun 29 '16 at 13:28

khrystal

110
2
11

Facing similar problem , tried this but getting : found : Seq[String] required: org.apache.spark.sql.Column http://stackoverflow.com/questions/38149483/error-while-doing-outer-join-on-2-data-frames-spark-scala-sqlcontext – Jul 01 '16 at 17:01
@user3407267 i think your ques has already been answered by zero323 are you still facing the prob?? – khrystal Jul 21 '16 at 10:34

score 0 · Answer 3 · answered Jun 14 '17 at 14:08

0

Simple convert all columns into to String, than do union on two DF.

answered Jun 14 '17 at 14:08

Jhon

39
2
10

Merge two tables in Scala/Spark

3 Answers3