When I try in pyspark to do a simple read from spark 2.1.1 to elasticsearch 2.4 via the elasticsearch-spark connector 5.1.2 (ES_READ_FIELD_EXCLUDE and ES_READ_FIELD_AS_ARRAY_INCLUDE are environment variables, the rest are variables that are passed as arguments to my reading function or contained in self object):
df = spark.read.format("org.elasticsearch.spark.sql") \
.option("es.net.proxy.http.host", self.server) \
.option("es.net.proxy.http.port", self.port) \
.option("es.net.http.auth.user", self.username) \
.option("es.net.http.auth.pass", self.password) \
.option("es.net.proxy.http.user", self.username) \
.option("es.net.proxy.http.pass", self.password) \
.option("query", qparam) \
.option("pushdown", "true") \
.option("es.read.field.exclude",ES_READ_FIELD_EXCLUDE) \
.option("es.read.field.as.array.include",ES_READ_FIELD_AS_ARRAY_INCLUDE) \
.load(self.index) \
.limit(limit) \
.select(*fields) \
.withColumn("id", monotonically_increasing_id())
I'm getting this ClassCastException error (from Double to Long):
WARN scheduler.TaskSetManager: Lost task 42.0 in stage ...: java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:105) ...
The strange thing is that sometimes it works, sometimes not. I suspect that reading data with NULL values or data that have no content for some fields causes the problem but it's only an hypothesis, I'm maybe wrong.
Is there a way to better trace the error, I don't know where to look at.