Questions tagged [elasticsearch-hadoop]

Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Elasticsearch real-time search and analytics natively integrated with Hadoop.

Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Requirements

Elasticsearch (0.9X series or 1.0.0 or higher (highly recommended)) cluster accessible through REST. That's it! Significant effort has been invested to create a small, dependency-free, self-contained jar that can be downloaded and put to use without any dependencies. Simply make it available to your job classpath and you're set. For a certain library, see the dedicated chapter.

Documentation

109 questions
1
vote
0 answers

pyspark read form s3 and write to elasticsearch

I'm trying to read from s3 and write to Elasticsearch, using jupyter install on spark master machine I have this configuration: import pyspark import os #os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3…
YanivK
  • 11
  • 1
1
vote
1 answer

How to set dynamic doc id in elasticsearch sink using spark structured streaming

In elasticsearch write sink how should I add doc id with dynamic values from the dataset field. In my case I need to set doc id based on a particular field from the formatted dataset. Came across "es.mapping.id" but how would I get values from my…
Gokulraj
  • 450
  • 1
  • 3
  • 20
1
vote
1 answer

pyspark, elasticsearch input, can't show dataframe

I can do df.head() fine after loading elasticsearch data. But after I do withColumn, I can't do df.head or df.show() I can't figure out what's going on, the same withColumn code works fine if I create df2 = sqlContext.createDataFrame( [(1, "a",…
eugene
  • 39,839
  • 68
  • 255
  • 489
1
vote
0 answers

How Spark is writing compressed parquet file?

Using Apache Spark 1.6.4, with elasticsearch4hadoop plugin, I am exporting an elasticsearch index (100m documents, 100Go, 5 shards) into a gzipped parquet file, within HDFS 2.7. I run this ETL as a Java program, with 1 executor (8 CPU, 12Go…
Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124
1
vote
0 answers

Elasticsearch-Hadoop formatting multi resource writes issue

I am interfacing Elasticsearch with Spark, using the Elasticsearch-Hadoop plugin and I am having difficulty writing a dataframe with a timestamp type column to Elasticsearch. The problem is when I try to write using dynamic/multi resource…
1
vote
0 answers

Write to Elasticsearch from Spark, wrong timestamp

I have a one column Spark dataframe: StructType(List(StructField(updateDate,TimestampType,true))) When writing to elasticsearch with spark, the updateDate field is not seen as a date, and is written as a…
Chargaff
  • 1,562
  • 2
  • 19
  • 41
1
vote
5 answers

Elasticsearch + Spark: write json with custom document _id

I am trying to write a collection of objects in Elasticsearch from Spark. I have to meet two requirements: Document is already serialized in JSON and should be written as is Elasticsearch document _id should be provided Here's what I tried so…
1
vote
1 answer

Spark driver memory for rdd.saveAsNewAPIHadoopFile and workarounds

I'm having issues with a particular spark method, saveAsNewAPIHadoopFile. The context is that I'm using pyspark, indexing RDDs with 1k, 10k, 50k, 500k, 1m records into ElasticSearch (ES). For a variety of reasons, the Spark context is quite…
ghukill
  • 1,136
  • 17
  • 42
1
vote
0 answers

ES batch size is not reflected in spark + elastic search

Trying to read 9 GB of json data (in multiple files) and loading into ES using spark elastic search connector. It took more time than expected, got 288 tasks each of writing 32MB and takes around 19s to complete. one of the documents suggested to…
1
vote
1 answer

Spark + Elastic search write performance issue

Seeing low # of writes to elasticsearch using spark java. Here are the Configurations using 13.xlarge machines for ES cluster 4 instances each have 4 processors. Set refresh interval to -1 and replications to '0' and other basic configurations…
1
vote
0 answers

Spark ES-Hadoop Plugin JSON Data

val ordersDF = spark.read.schema(revenue_schema).format("csv").load("s3://xxxx/fifa/pocs/smallMetrics.csv") val product_df = spark.read.json("s3://xxxx/fifa/pocs/smallCatalogue.json").toDF("id", "product", "style_id") val product_json_df…
Rajiv
  • 392
  • 6
  • 22
1
vote
1 answer

Zeppelin and Spark Configuration

I'm working with Zeppelin (0.7.1) on Spark (2.1.1) on my localhost, and trying to add some configuration values to the jobs I run. Specifically, I'm trying to set the es.nodes value for elasticsearch-hadoop. I tried adding the key and value to the…
Oren
  • 1,796
  • 1
  • 15
  • 17
1
vote
2 answers

Writing data to Elasticsearch:EsHadoopSerializationException

I am using ELasticsearch 5.4 and Hadoop 2.7.3 and wanna writing data from HDFS to Elasticsearch.My data in…
Yao Pan
  • 524
  • 9
  • 26
1
vote
1 answer

How to read a few columns of elasticsearch by spark?

In the es cluster, it has a large scale data, we used spark to compute data but in the way of elasticsearch-hadoop, followed by https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html We have to read full columns of an index. Is…
oaksharks
  • 33
  • 5
1
vote
2 answers

Scala SBT elasticsearch-hadoop unresolved dependency

When adding dependency libraryDependencies += "org.elasticsearch" % "elasticsearch-hadoop" % "5.1.1" and refreshing project, I get many unresolved dependencies(cascading, org.pentaho,...). However if I add another dependency, like…
Romain
  • 799
  • 1
  • 9
  • 29