Spark machine learning and Elasticsearch analyzed tokens/text in Python

Question

I'm trying to build an application that indexes a bunch of documents in Elasticsearch and retrieves the documents through Boolean queries into Spark for machine learning. I'm trying to do this all through Python through pySpark and elasticsearch-py.

For the machine learning part, I need to create the features using tokens from each of the text documents. To do this, I need to process/analyze each of the documents for typical things like lowercase, stemming, removing stopwords, etc.

So basically I need to turn "Quickly the brown fox is getting away." into something like "quick brown fox get away" or ["quick", "brown", "fox", "get", "away"]. I know you can do this pretty easily through various Python packages and functions, but I want to do this using the Elasticsearch analyzers. Furthermore, I need to do it in a way that is efficient for big datasets.

Basically, I want to pull the analyzed versions of the text or the analyzed tokens directly from Elasticsearch and do it within the Spark framework in an efficient manner. Being the relative ES newcomer, I've figured out how to query documents directly from Spark by adapting the elasticsearch-hadoop plugin through this:

http://blog.qbox.io/elasticsearch-in-apache-spark-python

Basically something like this:

read_conf = {
    'es.nodes': 'localhost',
    'es.port': '9200',
    'es.resource': index_name + '/' + index_type,
    'es.query': '{ "query" : { "match_all" : {} }}',
    } 

es_rdd = sc.newAPIHadoopRDD(
    inputFormatClass = 'org.elasticsearch.hadoop.mr.EsInputFormat',
    keyClass = 'org.apache.hadoop.io.NullWritable', 
    valueClass = 'org.elasticsearch.hadoop.mr.LinkedMapWritable', 
    conf = read_conf)

This code will more or less retrieve the unanalyzed original stored version of the text from ES. What I haven't figured out is how to query the analyzed text/tokens in an efficient manner. I've so far figured out two possible ways:

Map the es.termvector() function provided by elasticsearch-py onto each record of the RDD to retrieve the analyzed tokens.
Map the es.indices.analyze() function provided by elasticsearch-py onto each record of the RDD to analyze each record.

See related: Elasticsearch analyze() not compatible with Spark in Python?

From my understanding, both of these methods are extremely inefficient for big datasets because they involve a REST call to ES for each record in the RDD.

Thus, my questions are

Is there an alternative efficient way I can pull the analyzed text/tokens from ES without making a REST call for each record? Perhaps an ES setting that stores the analyzed text in a field along with the original text? Or the ability to request the analyzed tokens/text within the query itself so that I can just include it in the elasticsearch-hadoop configurations.
Is there an alternative or better solution to my problem that can leverage Spark's parallel machine learning capabilities with an ES-like query/storage/analysis capability?

Can you clear up what you mean by "What I haven't figured out is how to query the analyzed text/tokens in an efficient manner"? — eliasah, Aug 25 '15 at 07:04
Secondly, you seem to mix up the usage of Elasticsearch in your case, considering this question and the other question you have posted. Elasticsearch is more or less like a Data Source in this case and not an analytical engine. Your tokenizing and analysis most be done inside of Spark and pushed back into Elasticsearch as a serving layer — eliasah, Aug 25 '15 at 07:08
What I mean is that I want to somehow be able to query the documents from ES in the analyzed form rather than its original form, or short of that, be able to extract analyzed tokens from ES in an efficient manner. — plam, Aug 25 '15 at 07:25
For the time being, you can't do that! Read my second comment — eliasah, Aug 25 '15 at 07:26
As to your second comment, I do realize that I'm trying to adapt ES for something that it is not typically used for (analysis/processing beyond just data storage). For the purposes of my application specifically, I need the feature tokens to specifically match exactly how ES would analyze them when performing a query/indexing because I do a second level query using the analyzed tokens as search terms in a later step that I didn't describe. — plam, Aug 25 '15 at 07:29
In theory, I could just write the tokenization/analysis in Spark in the exact same way as ES does it, but that would involve a lot of code and testing (especially since I'm adapting to multiple foreign languages), so I figured I would try and see if I can just use the ES analysis engine for the processing rather than writing my own and hope that it matches for every language that I have. — plam, Aug 25 '15 at 07:29

score 1 · Answer 1 · answered Aug 26 '15 at 01:34

I may have found a temporary solution to this by using the "fielddata_fields" parameter within the body of the search query.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fielddata-fields.html

For example,

read_conf = {
    'es.nodes': 'localhost',
    'es.port': '9200',
    'es.resource': index_name + '/' + index_type,
    'es.query': '{ "query" : { "match_all" : {} }, "fields": ["_id"], "fielddata_fields": "text" }',
}

returns the documents with the id and the (analyzed) tokens for the "text" field. It's still unclear how this affects memory consumption within my jobs. It also does not include the term frequency of each token within a document, which is potentially necessary information. If anybody knows how to add in term frequencies to the tokens, I would love to hear it.

Spark machine learning and Elasticsearch analyzed tokens/text in Python

1 Answers1