Questions tagged [elasticsearch-hadoop]

Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Elasticsearch real-time search and analytics natively integrated with Hadoop.

Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Requirements

Elasticsearch (0.9X series or 1.0.0 or higher (highly recommended)) cluster accessible through REST. That's it! Significant effort has been invested to create a small, dependency-free, self-contained jar that can be downloaded and put to use without any dependencies. Simply make it available to your job classpath and you're set. For a certain library, see the dedicated chapter.

Documentation

109 questions
0
votes
1 answer

Retrieve metrics from elasticsearch-spark

At the end of an ETL Cascading job, I am extracting metrics about the Elasticsearch ingestion using Hadoop metrics that elasticsearch-hadoop exposes using Hadoop counters. I want to do the same using Spark, but I don't find documentation related to…
0
votes
1 answer

Is it possible to write to a dynamically created Elasticsearch index with a formatted date using elasticsearch-hadoop/spark?

Within standalone spark I'm trying to write from a dataframe to Elasticsearch. While I can get that to work, what I can't figure out is how to write to a dynamically named index that is formatted like 'index_name-{ts_col:{YYYY-mm-dd}}', where…
Jim
  • 224
  • 1
  • 3
  • 10
0
votes
1 answer

Inserting arrays in Elasticsearch via PySpark

I have a case much like this one: Example DataFrame: from pyspark.sql.types import * schema = StructType([ # schema StructField("id", StringType(), True), StructField("email", ArrayType(StringType()), True)]) df =…
dtj
  • 281
  • 1
  • 4
  • 15
0
votes
1 answer

Ingesting data in elasticsearch from hdfs , cluster setup and usage

I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances. Current setup is 1-master (spark and hdfs) 6-spark workers and hdfs data nodes All instances are same, 16gig dual core (unfortunately). I have 3…
0
votes
1 answer

Insert geograpic data in Elastic Search from Spark

I try to upload an RDD with a latitude and a longitude fields in my ES. I would like to use the geo_point type to plot them on a map. I tried to create a "location" field for each document containing either a string like "12.25, -5.2" or a array of…
Benjamin
  • 3,350
  • 4
  • 24
  • 49
0
votes
1 answer

Elasticsearch hadoop configure bulk batch size

I read through possibly Stackoverflow that es-hadoop / es-spark projects use bulk indexing. If it does is the default batchsize is as per BulkProcessor(5Mb). Is there any configuration to change this. I am using…
rohit
  • 862
  • 12
  • 26
0
votes
1 answer

Elasticsearch 5.0 and Elasticsearch-Spark connector - what is correct maven artefact

When writing application to run on Apache Spark 1.6 using Elasticsearch-Spark connector, documentation at (https://www.elastic.co/guide/en/elasticsearch/hadoop/5.0/install.html#_minimalistic_binaries) says to use maven artefact
Vladimir Kroz
  • 5,237
  • 6
  • 39
  • 50
0
votes
1 answer

Upgrading to Spark 2.0 dataframe.map

I'm updating some Spark 1.6 code to 2.0.1 and I'm running into some issues using map. I see other questions on SO questions like encoder-error-while-trying-to-map-dataframe-row-to-updated-row but I have not been able to get these techniques to…
jspooner
  • 10,975
  • 11
  • 58
  • 81
0
votes
1 answer

How to parallel reIndex ElasticSearch

I'm trying to reIndex ElasticSearch, I used Scan and Bulk API, but it's very slow, how can I parallel the process to make it faster. My python code as following: actions=[] for hit in helpers.scan(es,scroll='20m',index=INDEX,doc_type=TYPE,params= …
Jack
  • 5,540
  • 13
  • 65
  • 113
0
votes
1 answer

how to get term vectors by using Elasticsearch Hadoop

I'm using ElasticSearch-Hadoop API. And I was trying to get _mtermvector by using the following Spark code: val query= """_mtermvectors { "ids" : ["1256"], "parameters": { "fields": [ "tname" …
Jack
  • 5,540
  • 13
  • 65
  • 113
0
votes
1 answer

how does elasticsearch-hadoop create two RDDs based on different ES clusters

I need to join two Rdds from two different ES clusters,but I found I just can create one SparkConf and SparkContext based on one ES cluster. For example the code as following: var sparkConf: SparkConf = new SparkConf() sparkConf.set("es.nodes",…
Jack
  • 5,540
  • 13
  • 65
  • 113
0
votes
0 answers

Elasticsearch count is less than indexed while using elasticsearch-hadoop-2.2

I created an index and indexed data into it using elasticsearch-hadoop-2.2. The HQL looks like this: CREATE EXTERNAL TABLE es_external_table ( field1 type1, field2 type2 ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES…
Longxing Wei
  • 171
  • 2
  • 17
0
votes
1 answer

ResouceManager got stucked in Accepted State

I am trying to integrate my es 2.2.0 version with hadoop HDFS.In my envoirnment,I have 1 master node and 1 data node. On my master node my Es is installed. But while integrating it with HDFS my resource manager applications jobs get stuck in…
krishna kumar
  • 1,190
  • 12
  • 14
0
votes
2 answers

Extracting data from documents stored in HDFS to index in Elasticsearch

I have a HDFS archive to store variety of documents like pdf,ms word file,ppt,csv etc. I would like to build a platform using elasticsearch to search the file or text contents. I know I can use the es-hadoop plugin to index data to from HDFS to ES.…
Sachin
  • 1,675
  • 2
  • 19
  • 42
0
votes
1 answer

mvn package elasticsearch-spark error

I had a maven project that want to use es-spark to read from elasticsearch, my pom.xml is like: com.jzdata.logv es-spark 0.0.1-SNAPSHOT jar