7

I have some data in which column 'X' contains strings. I am writing a function, using pyspark, where a search_word is passed and all rows which do not contain the substring search_word within the column 'X' string are filtered out. The function must also allow for misspellings of the word, i.e. fuzzy matching. I have loaded the data into a pyspark dataframe and written a function using the NLTK and fuzzywuzzy python libraries to return True or False if the string contains the search_word.

My problem is that I cannot map the function to the dataframe correctly. Am I approaching this problem incorrectly? Should I be trying to do the fuzzy match through some kind of SQL query, or using an RDD perhaps?

I am new to pyspark so I feel like this question must have been answered before but I cannot find the answer anywhere. I have never done any NLP with SQL and I have never heard of SQL being capable of fuzzy matching a substring.

Update #1

The function looks like:

wf = WordFinder(search_word='some_substring')
result1 = wf.find_word_in_string(string_to_search='string containing some_substring or misspelled some_sibstrung')
result2 = wf.find_word_in_string(string_to_search='string not containing the substring')

result1 is True

result2 is False

  • What kind of answer do you expect without [your code](https://stackoverflow.com/help/mcve)? – Mr. T Jan 03 '18 at 10:59
  • @Piinthesky I have added the method above. There is no point in adding the actual code for the class and method because it is just some nltk tokenizers and lemmitizers and a fuzzywuzzy partial_ratio. It is completely irrelevant to the question. The question is about how to apply a function to a pyspark dataframe, whether you can filter rows with a boolean function, and whether sql or python is the best approach to the problem. – Dónal Flanagan Jan 03 '18 at 11:52

1 Answers1

5

An easy way is to use the built-in levenstein function. For example,

(
    spark.createDataFrame([("apple",), ("aple",), ("orange",), ("pear",)], ["fruit"])
    .withColumn("substring", func.lit("apple"))
    .withColumn("levenstein", func.levenshtein("fruit", "substring"))
    .filter("levenstein <= 1")
    .toPandas()
)

returns

   fruit substring  levenstein
0  apple     apple           0
1   aple     apple           1

If you want to use a vanilla Python function, like something from an NLTK package, you'll have to define a UDF that takes a string and returns a boolean.

santon
  • 4,395
  • 1
  • 24
  • 43