0

I'm very new to ElasticSearch: I am trying to read data from an index using Spark in Java.

I have a working piece of code, but it returns the document inside a Dataset where columns are only the two "root" elements of the doc, while all the remaining data are stored inside those columns in a JSON format.

This is my code:

SparkConf sparkConf = new SparkConf(true);
    sparkConf.setAppName(Test.class.getName());

    SparkSession spark = null;
    try {
      spark = SparkSession.builder().config(sparkConf).getOrCreate();
    } catch (Exception e) {
      sparkConf.setMaster("local[*]");
      sparkConf.set("spark.cleaner.ttl", "3600");
      sparkConf.set("es.nodes", "1.1.1.1");
      sparkConf.set("es.port", "9999");
      sparkConf.set("es.nodes.discovery", "false");
      sparkConf.set("es.nodes.wan.only", "true");
      spark = SparkSession.builder().config(sparkConf).getOrCreate();
      Logger rootLogger = Logger.getRootLogger();
      rootLogger.setLevel(Level.ERROR);
    }

    SQLContext sqlContext = spark.sqlContext();

    Dataset<Row> df1 = JavaEsSparkSQL.esDF(sqlContext, "index/video");

    df1.printSchema();
    df1.show(5, false);

A very simplified version of the schema inferred by Spark is:

root
 |-- aaa: struct (nullable = true)
 |    |-- bbbb: array (nullable = true)
 |    |    |-- cccc: struct (containsNull = true)
 |    |    |    |-- dddd: string (nullable = true)
 |    |    |    |-- eeee: string (nullable = true)
 |    |-- xxxx: string (nullable = true)
 |-- ffff: struct (nullable = true)
 |    |-- gggg: long (nullable = true)
 |    |-- hhhh: boolean (nullable = true)
 |    |-- iiii: struct (nullable = true)
 |    |    |-- vvvv: string (nullable = true)
 |    |    |-- llll: array (nullable = true)
 |    |    |    |-- oooo: struct (containsNull = true)
 |    |    |    |    |-- wwww: long (nullable = true)
 |    |    |    |    |-- rrrr: string (nullable = true)
 |    |    |    |    |-- tttt: long (nullable = true)
 |    |    |-- pppp: string (nullable = true)

All I can get from Spark using show() is something like

+-------------------+-------------------+
|aaaa               |ffff               |
+-------------------+-------------------+
|[bbbb,cccc]        |[1,false,null]     |
|[bbbb,dddd]        |[1,false,null]     |
|[bbbb]             |[1,false,null]     |
|[bbbb]             |[1,false,null]     |
|[null,eeee]        |[1,false,null]     |
+-------------------+-------------------+
only showing top 5 rows

Is there a way to get the data inside each row (e.g. bbbb) without processing them in Spark? (i.e. is there a way to get those data directly from ElasticSearch?)

ercaran
  • 23
  • 1
  • 1
  • 8

1 Answers1

0

Solved.

It was too simple and I didn't even try: you can access nested data using a dot notation. To have the values of xxxx item, just

df1.select("aaaa.xxxx").show(5, false);

Result

+--------+
|xxxx    |
+--------+
|35992783|
|35994342|
|35973981|
|35984563|
|35979054|
+--------+
only showing top 5 rows
ercaran
  • 23
  • 1
  • 1
  • 8