0

I'm creating an index using Whoosh on ~100 documents that are about 8MB each.

I have a list 17 million phrases that I need to search for in these docs, so I'm considering using PySpark on EMR to distribute the load. Is there a way to distribute the entire Whoosh index to each node and have Spark use the local version each time I call the function? Here's what I'm thinking:

def searchFunc(searchphrase):
    ....
    return results

phrasesRDD=sc.textFile('searchTermsFile.txt').map(lambda x: (x,searchFunc(x)))

Where the first element of the tuple is the search phrase and the second is an object with relevant results that I'll do more work with later on.

How do I send the entire index to each node and ensure it reads its local copy instead of relying on I/O to the master?

econgineer
  • 1,117
  • 10
  • 20

0 Answers0