14

Given the following DataSet values as inputData:

column0 column1 column2 column3
A       88      text    99
Z       12      test    200
T       120     foo     12

In Spark, what is an efficient way to compute a new hash column, and append it to a new DataSet, hashedData, where hash is defined as the application of MurmurHash3 over each row value of inputData.

Specifically, hashedData as:

column0 column1 column2 column3 hash
A       88      text    99      MurmurHash3.arrayHash(Array("A", 88, "text", 99))
Z       12      test    200     MurmurHash3.arrayHash(Array("Z", 12, "test", 200))
T       120     foo     12      MurmurHash3.arrayHash(Array("T", 120, "foo", 12))

Please let me know if any more specifics are necessary.

Any help is appreciated. Thanks!

soote
  • 3,240
  • 1
  • 23
  • 34
Jesús Zazueta
  • 1,160
  • 1
  • 17
  • 32

2 Answers2

21

One way is to use the withColumn function:

import org.apache.spark.sql.functions.{col, hash}
dataset.withColumn("hash", hash(dataset.columns.map(col):_*))
Javier Montón
  • 4,601
  • 3
  • 21
  • 29
soote
  • 3,240
  • 1
  • 23
  • 34
  • 1
    thanks! But I think that line is passing in the column string names into `MurmurHash3` (i.e. `Array("column0", "column1", "column2", "column3")`). I'll try to find a way to extract the actual row values in the context of the mapping function. – Jesús Zazueta Nov 07 '17 at 17:26
  • 2
    @JesúsZazueta Just stating that I don't think his solution did column names only. Also, there is a neat function for taking multiple columns and generating a new column with their content: `df.withColumn("concat", concat(df.columns.map(col):_*))` They have some other methods too, such as for [specifying the join separator](https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#concat_ws-java.lang.String-org.apache.spark.sql.Column...-). – Lo-Tan Sep 18 '18 at 23:03
7

Turns out that Spark already has this implemented as the hash function inside package org.apache.spark.sql.functions

/**
 * Calculates the hash code of given columns, and returns the result as an int column.
 *
 * @group misc_funcs
 * @since 2.0
 */
@scala.annotation.varargs
def hash(cols: Column*): Column = withExpr {
  new Murmur3Hash(cols.map(_.expr))
}

And in my case, applied as:

import org.apache.spark.sql.functions.{col, hash}

val newDs = typedRows.withColumn("hash", hash(typedRows.columns.map(col): _*))

I truly have a lot to learn about Spark sql :(.

Leaving this here in case someone else needs it. Thanks!

Jesús Zazueta
  • 1,160
  • 1
  • 17
  • 32