Using Whoosh on PySpark - Sending Index to Each Node?

Asked Apr 14 '16 at 04:36

Active Apr 14 '16 at 05:54

Viewed 142 times

I'm creating an index using Whoosh on ~100 documents that are about 8MB each.

I have a list 17 million phrases that I need to search for in these docs, so I'm considering using PySpark on EMR to distribute the load. Is there a way to distribute the entire Whoosh index to each node and have Spark use the local version each time I call the function? Here's what I'm thinking:

def searchFunc(searchphrase):
    ....
    return results

phrasesRDD=sc.textFile('searchTermsFile.txt').map(lambda x: (x,searchFunc(x)))

Where the first element of the tuple is the search phrase and the second is an object with relevant results that I'll do more work with later on.

How do I send the entire index to each node and ensure it reads its local copy instead of relying on I/O to the master?

edited Apr 14 '16 at 05:54

asked Apr 14 '16 at 04:36

econgineer

1,117
10
20

And what is the question ? – eliasah Apr 14 '16 at 05:42
Sorry, updated with a question. – econgineer Apr 14 '16 at 05:54
Spark isn't adapted for such scenarios. Why don't try a search engine Elasticseach or Solr ? – eliasah Apr 14 '16 at 06:16
a) Seems like there's more overhead with those (I've set up Elastic before). Since this is for a one off analysis, I'm not sure I want to deal with either. b) Do those scale well for querying 17MM different phrases? – econgineer Apr 14 '16 at 06:34

Using Whoosh on PySpark - Sending Index to Each Node?

0 Answers0