0

I have the two pyspark dataframes. I want to select all records from voutdf where its "hash" does not exist in vindf.tx_hash

How to do this using pyspark dataframe.? I tried a semi join but I am ending up with out of memory errors.

voutdf = sqlContext.createDataFrame(voutRDD,["hash", "value","n","pubkey"])

vindf = sqlContext.createDataFrame(vinRDD,["txid", "tx_hash","vout"])
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62

1 Answers1

3

You can do it with left-anti join:

df = voutdf.join(vindf.withColumnRenamed("tx_hash", "hash"), "hash", 'left_anti')

left-anti join:

It takes all rows from the left dataset that don't have their matching in the right dataset.

Adam Dukkon
  • 273
  • 1
  • 6
  • thanks Adam, just to confirm would that be the equivalent of select * from voutdf where hash not in (select tx_hash from vindf) – user11534659 Dec 04 '19 at 10:24
  • Thanks a ton, Adam. Also to get the voutdf records where hash is in (select tx_hash from vindf) would left_semi be ok? thanks df = voutdf.join(vindf.withColumnRenamed("tx_hash", "hash"), "hash", 'left_semi') – user11534659 Dec 04 '19 at 14:30
  • Yes, exactly. You can also register these two DataFrames as a temptable, and run SQL queries, if you are more familiar with that: `df.registerTempTable("voutdf")` `df.registerTempTable("vindf")` `spark.sql("select * from voutdf where hash not in (select tx_hash from vindf)")` – Adam Dukkon Dec 04 '19 at 14:35
  • I get error when I try to run it in sql form. maybe I am missing some relevant imports. is this the right way to call it? sc = pyspark.SparkContext() sqlContext = SQLContext(sc) sess = SparkSession(sc) sess.sql("select * from voutdf where hash not in (select tx_hash from vindf)") – user11534659 Dec 04 '19 at 15:03
  • 19/12/04 14:56:54 WARN metadata.Hive: Failed to register all functions. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1775) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:80) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:130) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy – user11534659 Dec 04 '19 at 16:09