I'm trying to build an application that indexes a bunch of documents in Elasticsearch and retrieves the documents through Boolean queries into Spark for machine learning. I'm trying to do this all through Python through pySpark and elasticsearch-py.
For the machine learning part, I need to create the features using tokens from each of the text documents. To do this, I need to process/analyze each of the documents for typical things like lowercase, stemming, removing stopwords, etc.
So basically I need to turn "Quickly the brown fox is getting away."
into something like "quick brown fox get away"
or ["quick", "brown", "fox", "get", "away"]
. I know you can do this pretty easily through various Python packages and functions, but I want to do this using the Elasticsearch analyzers. Furthermore, I need to do it in a way that is efficient for big datasets.
Basically, I want to pull the analyzed versions of the text or the analyzed tokens directly from Elasticsearch and do it within the Spark framework in an efficient manner. Being the relative ES newcomer, I've figured out how to query documents directly from Spark by adapting the elasticsearch-hadoop plugin through this:
http://blog.qbox.io/elasticsearch-in-apache-spark-python
Basically something like this:
read_conf = {
'es.nodes': 'localhost',
'es.port': '9200',
'es.resource': index_name + '/' + index_type,
'es.query': '{ "query" : { "match_all" : {} }}',
}
es_rdd = sc.newAPIHadoopRDD(
inputFormatClass = 'org.elasticsearch.hadoop.mr.EsInputFormat',
keyClass = 'org.apache.hadoop.io.NullWritable',
valueClass = 'org.elasticsearch.hadoop.mr.LinkedMapWritable',
conf = read_conf)
This code will more or less retrieve the unanalyzed original stored version of the text from ES. What I haven't figured out is how to query the analyzed text/tokens in an efficient manner. I've so far figured out two possible ways:
- Map the es.termvector() function provided by elasticsearch-py onto each record of the RDD to retrieve the analyzed tokens.
- Map the es.indices.analyze() function provided by elasticsearch-py onto each record of the RDD to analyze each record.
See related: Elasticsearch analyze() not compatible with Spark in Python?
From my understanding, both of these methods are extremely inefficient for big datasets because they involve a REST call to ES for each record in the RDD.
Thus, my questions are
- Is there an alternative efficient way I can pull the analyzed text/tokens from ES without making a REST call for each record? Perhaps an ES setting that stores the analyzed text in a field along with the original text? Or the ability to request the analyzed tokens/text within the query itself so that I can just include it in the elasticsearch-hadoop configurations.
- Is there an alternative or better solution to my problem that can leverage Spark's parallel machine learning capabilities with an ES-like query/storage/analysis capability?