0

To find out a given keyword exists in a huge text file or not, I came up wit below two approaches.

Approach1:

def keywordExists(line):
   if (line.find(“my_keyword”) > -1):
       return 1
   return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)

Approach2:

var keyword="my_keyword"
val rdd=sparkContext.textFile("test_file.txt")
val count= rdd.filter(line=>line.contains(keyword)).count
print(“Found” if count>0 else “Not Found”)

Main difference is first one using map and then reducing whereas second one is filtering and doing a count.

Could anyone suggest which is efficient.

Lavanya varma
  • 75
  • 1
  • 9

1 Answers1

1

I would suggest:

val wordFound = !rdd.filter(line=>line.contains(keyword)).isEmpty()

Benefit: The search can be stopped once 1 occurence of keyword was found

see also Spark: Efficient way to test if an RDD is empty

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145