1

When I try in pyspark to do a simple read from spark 2.1.1 to elasticsearch 2.4 via the elasticsearch-spark connector 5.1.2 (ES_READ_FIELD_EXCLUDE and ES_READ_FIELD_AS_ARRAY_INCLUDE are environment variables, the rest are variables that are passed as arguments to my reading function or contained in self object):

df = spark.read.format("org.elasticsearch.spark.sql") \
            .option("es.net.proxy.http.host", self.server) \
            .option("es.net.proxy.http.port", self.port) \
            .option("es.net.http.auth.user", self.username) \
            .option("es.net.http.auth.pass", self.password) \
            .option("es.net.proxy.http.user", self.username) \
            .option("es.net.proxy.http.pass", self.password) \
            .option("query", qparam) \
            .option("pushdown", "true") \
            .option("es.read.field.exclude",ES_READ_FIELD_EXCLUDE) \
            .option("es.read.field.as.array.include",ES_READ_FIELD_AS_ARRAY_INCLUDE) \
            .load(self.index) \
            .limit(limit) \
            .select(*fields) \
            .withColumn("id", monotonically_increasing_id())

I'm getting this ClassCastException error (from Double to Long):

WARN scheduler.TaskSetManager: Lost task 42.0 in stage ...: java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:105) ...

The strange thing is that sometimes it works, sometimes not. I suspect that reading data with NULL values or data that have no content for some fields causes the problem but it's only an hypothesis, I'm maybe wrong.

Is there a way to better trace the error, I don't know where to look at.

Patrick
  • 2,577
  • 6
  • 30
  • 53
  • Can you try specifying the schema before the load? If it's easy to reproduce (for a specific query param) then you could also attach the sources and run it in debug mode. You also have the option of enabling debug/trace level logging. – Traian Aug 14 '17 at 17:05
  • Is that the whole stack trace? To debug this, shouldn't you be looking at the code that calls `scala.runtime.BoxesRunTime.unboxToLong()`? – Hendrik Aug 14 '17 at 17:40
  • 1
    @jarrod-roberson This looks like Python code so the problem is likely to be outside OP control and generic Java answer is unlikely to help. I'll reopen this. – zero323 Aug 14 '17 at 17:47
  • @zero323 - regardless the cause is identical and the fix is identical –  Aug 14 '17 at 18:00
  • *The strange thing is that sometimes it works, sometimes not.* that is not strange at all, it is because sometimes the data is correct and sometimes it is the wrong type, nothing *strange* about **garbage in/garbage out**. –  Aug 14 '17 at 18:02
  • Ok, maybe strange was not the good word to use. @zero323: you're right. Code is run in pyspark and problem seems to be outside of my control. Does it help if I edit my question to include the whole stack trace? – Patrick Aug 14 '17 at 18:13
  • That's for sure but to be honest it looks like a bug in the connector. Probably the best thing you can do a) Isolate the problem (minimal data and schema which can be used to reproduce the problem) b) create a ticket in the upstream repo if not present (https://github.com/elastic/elasticsearch-hadoop) c) Include it as an edit or answer so it documents the problem. Maybe @eliasah will have a better suggestion. – zero323 Aug 14 '17 at 21:24
  • @zero323 I found my problem. See answer below. – Patrick Aug 16 '17 at 19:37

1 Answers1

3

I found my problem. First I have used the latest dev build for the spark elasticsearch connector (6.0.0-beta1), hoping that it could solve the problem. That was not the case but this time the error message was more informative:

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException:
Incompatible types found in multi-mapping: 
Field [my_problematic_field] has conflicting types of [LONG] and [DOUBLE].

Now I understand the cast class exception from long to double that I got at the beginning. It was related to my field that was defined in one index as a long and in another as a double (I use one index alias in ES to point to a series of indexes). The problem is that these fields were dynamically mapped by ES when they were inserted the first time in their respective index, and some were casted as long (because the first value was for example 123) and other casted as double (because the first value was for example 123.0).

I don't know if there is a way to get around this problem without having to reindex all my data (billions!)

Patrick
  • 2,577
  • 6
  • 30
  • 53
  • Don't forget to accept your answer and take a look at this https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html – eliasah Aug 18 '17 at 08:19