Questions tagged [elasticsearch-hadoop]

Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Elasticsearch real-time search and analytics natively integrated with Hadoop.

Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Requirements

Elasticsearch (0.9X series or 1.0.0 or higher (highly recommended)) cluster accessible through REST. That's it! Significant effort has been invested to create a small, dependency-free, self-contained jar that can be downloaded and put to use without any dependencies. Simply make it available to your job classpath and you're set. For a certain library, see the dedicated chapter.

Documentation

109 questions

vote

0 answers

pyspark read form s3 and write to elasticsearch

I'm trying to read from s3 and write to Elasticsearch, using jupyter install on spark master machine I have this configuration: import pyspark import os #os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3…

amazon-s3 pyspark elasticsearch-hadoop

asked Feb 18 '20 at 12:55

YanivK

vote

1 answer

How to set dynamic doc id in elasticsearch sink using spark structured streaming

In elasticsearch write sink how should I add doc id with dynamic values from the dataset field. In my case I need to set doc id based on a particular field from the formatted dataset. Came across "es.mapping.id" but how would I get values from my…

spark-structured-streaming elasticsearch-hadoop

asked Sep 24 '19 at 09:40

Gokulraj

vote

1 answer

pyspark, elasticsearch input, can't show dataframe

I can do df.head() fine after loading elasticsearch data. But after I do withColumn, I can't do df.head or df.show() I can't figure out what's going on, the same withColumn code works fine if I create df2 = sqlContext.createDataFrame( [(1, "a",…

apache-spark elasticsearch pyspark elasticsearch-hadoop

asked Feb 09 '19 at 15:45

eugene

39,839
68
255
489

vote

0 answers

How Spark is writing compressed parquet file?

Using Apache Spark 1.6.4, with elasticsearch4hadoop plugin, I am exporting an elasticsearch index (100m documents, 100Go, 5 shards) into a gzipped parquet file, within HDFS 2.7. I run this ETL as a Java program, with 1 executor (8 CPU, 12Go…

apache-spark hdfs elasticsearch-hadoop

asked Dec 22 '18 at 08:19

Thomas Decaux

21,738
2
113
124

vote

0 answers

Elasticsearch-Hadoop formatting multi resource writes issue

I am interfacing Elasticsearch with Spark, using the Elasticsearch-Hadoop plugin and I am having difficulty writing a dataframe with a timestamp type column to Elasticsearch. The problem is when I try to write using dynamic/multi resource…

python-2.7 elasticsearch pyspark apache-spark-sql elasticsearch-hadoop

asked Mar 05 '18 at 11:59

petroslamb

vote

0 answers

Write to Elasticsearch from Spark, wrong timestamp

I have a one column Spark dataframe: StructType(List(StructField(updateDate,TimestampType,true))) When writing to elasticsearch with spark, the updateDate field is not seen as a date, and is written as a…

elasticsearch pyspark elasticsearch-hadoop

asked Jan 26 '18 at 21:24

Chargaff

1,562
2
19
41

vote

5 answers

Elasticsearch + Spark: write json with custom document _id

I am trying to write a collection of objects in Elasticsearch from Spark. I have to meet two requirements: Document is already serialized in JSON and should be written as is Elasticsearch document _id should be provided Here's what I tried so…

scala apache-spark elasticsearch elasticsearch-hadoop

asked Dec 19 '17 at 17:58

Nikolay Vasiliev

5,656
22
31

vote

1 answer

Spark driver memory for rdd.saveAsNewAPIHadoopFile and workarounds

I'm having issues with a particular spark method, saveAsNewAPIHadoopFile. The context is that I'm using pyspark, indexing RDDs with 1k, 10k, 50k, 500k, 1m records into ElasticSearch (ES). For a variety of reasons, the Spark context is quite…

apache-spark pyspark elasticsearch-hadoop

asked Nov 30 '17 at 21:35

ghukill

1,136
17
42

vote

0 answers

ES batch size is not reflected in spark + elastic search

Trying to read 9 GB of json data (in multiple files) and loading into ES using spark elastic search connector. It took more time than expected, got 288 tasks each of writing 32MB and takes around 19s to complete. one of the documents suggested to…

apache-spark elasticsearch elasticsearch-hadoop

asked Oct 20 '17 at 17:12

camelBeginner

vote

1 answer

Spark + Elastic search write performance issue

Seeing low # of writes to elasticsearch using spark java. Here are the Configurations using 13.xlarge machines for ES cluster 4 instances each have 4 processors. Set refresh interval to -1 and replications to '0' and other basic configurations…

apache-spark elasticsearch elasticsearch-hadoop elasticsearch-spark

asked Oct 18 '17 at 15:00

camelBeginner

vote

0 answers

Spark ES-Hadoop Plugin JSON Data

val ordersDF = spark.read.schema(revenue_schema).format("csv").load("s3://xxxx/fifa/pocs/smallMetrics.csv") val product_df = spark.read.json("s3://xxxx/fifa/pocs/smallCatalogue.json").toDF("id", "product", "style_id") val product_json_df…

scala apache-spark elasticsearch elasticsearch-hadoop

asked Sep 24 '17 at 18:07

Rajiv

vote

1 answer

Zeppelin and Spark Configuration

I'm working with Zeppelin (0.7.1) on Spark (2.1.1) on my localhost, and trying to add some configuration values to the jobs I run. Specifically, I'm trying to set the es.nodes value for elasticsearch-hadoop. I tried adding the key and value to the…

apache-spark apache-zeppelin elasticsearch-hadoop

asked Jun 04 '17 at 05:24

Oren

1,796
1
15
17

vote

2 answers

Writing data to Elasticsearch:EsHadoopSerializationException

I am using ELasticsearch 5.4 and Hadoop 2.7.3 and wanna writing data from HDFS to Elasticsearch.My data in…

elasticsearch-hadoop

asked May 09 '17 at 11:32

Yao Pan

vote

1 answer

How to read a few columns of elasticsearch by spark?

In the es cluster, it has a large scale data, we used spark to compute data but in the way of elasticsearch-hadoop, followed by https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html We have to read full columns of an index. Is…

apache-spark elasticsearch-hadoop

asked May 04 '17 at 02:11

oaksharks

vote

2 answers

Scala SBT elasticsearch-hadoop unresolved dependency

When adding dependency libraryDependencies += "org.elasticsearch" % "elasticsearch-hadoop" % "5.1.1" and refreshing project, I get many unresolved dependencies(cascading, org.pentaho,...). However if I add another dependency, like…

scala intellij-idea sbt elasticsearch-hadoop

asked Jan 27 '17 at 11:24

Romain

Prev 1 2

4 5 6 7 8 Next