I am writing a grep tool in pyspark that takes a word on the command line and searches a text file and returns any line that contain the word given on the command line. My search returns lines that are not the search word
#!/usr/bin/python
import sys
from pyspark import SparkContext
def search_word(word):
if (word) != -1:
print ('%s\t%s' % ( word, word.strip() ))
# assign search word given on command line
if len(sys.argv) > 1:
word = sys.argv[1]
sc = SparkContext()
textRDD = sc.textFile("input.txt")
textRDD = textRDD.map(lambda word: word.replace(',',' ').replace('.',' '). lower())
textRDD = textRDD.flatMap(lambda word: word.split())
textRDD = textRDD.filter(lambda word: search_word(word))
firstten = textRDD.take(10)
print(firstten)
command line example: spark-submit yourself
example text file:
Ere quitting, for the nonce, the Sperm Whale's head, I would have
you, as a sensible physiologist, simply--particularly remark its front
aspect, in all its compacted collectedness. I would have you investigate
it now with the sole view of forming to yourself some unexaggerated,
intelligent estimate of whatever battering-ram power may be lodged
there. Here is a vital point; for you must either satisfactorily settle
this matter with yourself, or for ever remain an infidel as to one of
the most appalling, but not the less true events, perhaps anywhere to be found in all recorded history.
expected result:
yourself -- it now with the sole view of forming to yourself some unexaggerated
The code above returns this:
produce produce
our our
new new
ebooks ebooks