I'm creating an index using Whoosh on ~100 documents that are about 8MB each.
I have a list 17 million phrases that I need to search for in these docs, so I'm considering using PySpark on EMR to distribute the load. Is there a way to distribute the entire Whoosh index to each node and have Spark use the local version each time I call the function? Here's what I'm thinking:
def searchFunc(searchphrase):
....
return results
phrasesRDD=sc.textFile('searchTermsFile.txt').map(lambda x: (x,searchFunc(x)))
Where the first element of the tuple is the search phrase and the second is an object with relevant results that I'll do more work with later on.
How do I send the entire index to each node and ensure it reads its local copy instead of relying on I/O to the master?