1

I'm experiencing a strange behavior when I try to use JavaRDD subtract to compare 2 DataFrames.

This is what I'm doing: I try to compare if 2 DataFrame (A,B) is equals by converting them to JavaRDD and than subtract A from B and B from A. If they are equals (contains the same data) than both result should be an empty JavaRDD.

I did not get empty result:

DataFrame A = someFunctionRespondWithDF(param);
DataFrame B = sqlContext.read().json("src/test/resources/expected/exp.json");
Assert.assertTrue(B.toJavaRDD().subtract(A.toJavaRDD()).isEmpty());
Assert.assertTrue(A.toJavaRDD().subtract(B.toJavaRDD()).isEmpty());

...assert fails

If I write the data to disk and read it back to another Dataframe, than it's fine.

A.write().json("target/result.json");
DataFrame AA = sqlContext.read().json("target/result.json");
Assert.assertTrue(B.toJavaRDD().subtract(AA.toJavaRDD()).isEmpty());
Assert.assertTrue(AA.toJavaRDD().subtract(B.toJavaRDD()).isEmpty());

...assert true

I also tried to enforce the evaluation by call the count(), cache() or persist() function on the DataFrame (based on this answer) but no success.

DataFrame AAA = A.cache();
Assert.assertTrue(B.toJavaRDD().subtract(AAA.toJavaRDD()).isEmpty();
Assert.assertTrue(AAA.toJavaRDD().subtract(B.toJavaRDD()).isEmpty();

Is there anybody experienced the same? What do I miss here?

Spark version: 1.6.1

Community
  • 1
  • 1
kecso
  • 2,387
  • 2
  • 18
  • 29

1 Answers1

1

Ok I can answer my own question:

The reason it fails on the assertion is that when I read the DataFrame from a json, the types differs. Let's say I had an Integer in my original DataFrame, after reading it back from a json (!without schema file) it will be a Long. Solution -> use a format what describes the schema, like avro.

kecso
  • 2,387
  • 2
  • 18
  • 29