0

I am writing a grep tool in pyspark that takes a word on the command line and searches a text file and returns any line that contain the word given on the command line. My search returns lines that are not the search word

#!/usr/bin/python

    import sys
    from pyspark import SparkContext

    def search_word(word):
            if (word)  != -1:
                    print ('%s\t%s' % ( word, word.strip() ))



    # assign search word given on command line
    if len(sys.argv) > 1:
           word = sys.argv[1]


         sc = SparkContext()
         textRDD = sc.textFile("input.txt")
         textRDD = textRDD.map(lambda word: word.replace(',',' ').replace('.',' '). lower())
         textRDD = textRDD.flatMap(lambda word: word.split())
         textRDD = textRDD.filter(lambda word: search_word(word))
         firstten = textRDD.take(10)
         print(firstten)

command line example: spark-submit yourself

example text file:

Ere quitting, for the nonce, the Sperm Whale's head, I would have
you, as a sensible physiologist, simply--particularly remark its front
aspect, in all its compacted collectedness. I would have you investigate
it now with the sole view of forming to yourself some unexaggerated,
intelligent estimate of whatever battering-ram power may be lodged
there. Here is a vital point; for you must either satisfactorily settle
this matter with yourself, or for ever remain an infidel as to one of
the most appalling, but not the less true events, perhaps anywhere to be found in all recorded history.

expected result:

yourself --  it now with the sole view of forming to yourself some unexaggerated

The code above returns this:

produce produce
our our
new new
ebooks  ebooks
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Alpha
  • 19
  • 4

1 Answers1

1

Was not really sure about your example in terms of data and results, but as far as I can see I see no need for flatMap or splitting.

Here goes, the single grep value approach and only a few lines of code:

import pyspark.sql.functions as f
df = spark.read.text("/FileStore/tables/sample_text.txt").toDF("text_string")
df.show(100, truncate=False)
grep_val = 'ZZZ'
df.where(df.text_string.contains(grep_val)).show(100, truncate=False)

returns:

+-------------------------+
|text_string              |
+-------------------------+
|Hi how are you today ZZZ |
|I am fine                |
|I am also tired          |
|You look good            |
|Can I stay with you?     |
|Bob will pop in later ZZZ|
|Oh really? Nice, cool    |
+-------------------------+

+-------------------------+
|text_string              |
+-------------------------+
|Hi how are you today ZZZ |
|Bob will pop in later ZZZ|
+-------------------------+

You would be better off following the standard approach using a grep_list, rlike and JOIN. See PySpark: Search For substrings in text and subset dataframe for general guidance in a more flexible manner.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83