spark program to check if a given keyword exists in a huge text file or not

Question

To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.

Approach1:

def keywordExists(line):
   if (line.find(“my_keyword”) > -1):
       return 1
   return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)

Approach2:

var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)

Main difference is first one using map and then reducing whereas second one is filtering and doing a count.

Could anyone suggest which is efficient.

both are inefficient, you should stop the search if the keyword was found — Raphael Roth, Aug 19 '21 at 13:41

Raphael Roth · Answer 1 · 2021-08-19T13:57:46.553

1

I would suggest:

val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()

Benefit: The search can be stopped once 1 occurence of keyword was found

see also Spark: Efficient way to test if an RDD is empty

edited Aug 19 '21 at 13:57

answered Aug 19 '21 at 13:52

Raphael Roth

26,751
15
88
145

spark program to check if a given keyword exists in a huge text file or not

1 Answers1