How to read 1M records from Elasticsearch into PySpark?

Question

I have a problem with reading data from Elasticsearch into Spark cluster (I'm using Zeppelin environment, so all connection settings are configured in the Zeppelin interpreter settings).

First, I have tried to read it with PySpark:

%pyspark
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

df = spark.read.format("org.elasticsearch.spark.sql").load("index")
df = df.limit(100).drop('tags').drop('a.b')
# if 'tags' field is not dropped, pyspark cannot map scala field and throws an exception.
# If the limit is not set, pyspark will probably try to get the whole index at once
# if "a.b" is not dropped, the dot in the field name causes mapping error: https://github.com/elastic/elasticsearch-hadoop/issues/853

df = df.cache()
z.show(df)

Unfortunately, in this case I face many mapping issues. Cause I have a lot of fields containing dots in the dataset, I decided to give Scala a try to read the data (in order to process it in PySpark later):

%spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark
import org.elasticsearch.spark.sql
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoder

val conf = new SparkConf()

conf.set("spark.es.mapping.date.rich", "false");
conf.set("spark.serializer", classOf[KryoSerializer].getName)

val EsReadRDD = sc.esRDD("index")

However, even with Scala I can only retrieve small numbers of records, like

EsReadRDD.take(10).foreach(println)

For some reason, collect() does not work:

val esdf = EsReadRDD.collect() //does not work probably because data are too large

The error is:

Job aborted due to stage failure: Task 0 in stage 833.0 failed 4 times, most recent failure: Lost task 0.3 in stage 833.0 (TID 479, 10.10.11.37, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

I have also tried conversion to DF, but get an error:

val esdf = EsReadRDD.toDF()

java.lang.UnsupportedOperationException: No Encoder found for scala.AnyRef
- map value class: "java.lang.Object"
- field (class: "scala.collection.Map", name: "_2")
- root class: "scala.Tuple2"

Do you have any idea on how to deal with it?

Why do you need to collect the data to the driver?, you can process your records applying transformations to the RDD. The data will be processed in the executors and you have store the results without collecting the data. — Emiliano Martinez, Jan 29 '20 at 14:09
I need to process the data in PySpark, not Scala. So if there are mapping issues, I thought that I can read data using Scala and forward it to PySpark. — Andrey Sapegin, Jan 29 '20 at 14:10
I don´t understand your point. What you do you have in mind when you say "forward to PySpark". If you need to apply some Python library to your code you can directly use PySpark. Edit your question with more details for a better understanding. — Emiliano Martinez, Jan 29 '20 at 16:55
As was mentioned in the question, I use Zeppelin. There it is possible to have a notebook with paragraphs written in both Scala and PySpark. PySpark cannot read data, cause native RDD support is only implemented in Scala, see https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html. Reading data without native RDD is also not possible, cause I have issues due to the elasticsearch-hadoop bug (cannot properly map field names containing dots). So my last try was to read a data frame with Scala and forward it to PySpark using z.put() from Zeppelin. In the end, none of these works. — Andrey Sapegin, Jan 31 '20 at 10:27

How to read 1M records from Elasticsearch into PySpark?

0 Answers0