1

I have about 60k file stored in HDFS, each file size is in range of kilo bytes 4kb-70kb. Am trying to process them by performing regex search on specific files I know yet, the processing takes too long, and it seems not right ...

the spark job is run on yarn

Hardware specs : 3 nodes, each has 4 core and 15G RAM

targeted_files = sc.broadcast(sc.textFile(doc).collect()) # 3 files

#hdfs://hadoop.localdomain/path/to/directory/ contains ~60K files
df = sc.wholeTextFiles(
    "hdfs://hadoop.localdomain/path/to/directory/").filter(
    lambda pairRDD: ntpath.basename(pairRDD[0]) in targeted_files.value)

print('Result : ', df.collect()) #when I run this step alone, took 15 mins to finish

df = df.map(filterMatchRegex).toDF(['file_name', 'result']) # this takes ~hour and still doesn't finish

would be using HDFS, spark for this task is correct ? also I thought in worst case scenario the processing time would be equal to threading approach using java ... what am I doing wrong ?

I came across this link which addresses the same problem, but am not sure how to handle it in pyspark it seems all/most of time taken during reading files from HDFS, is there a better way to read/store small files and process them with spark ?

Exorcismus
  • 2,243
  • 1
  • 35
  • 68

3 Answers3

1

It doesn't seem like a right use case of spark to be honest. Your dataset is pretty small, 60k * 100k = 6 000 mB = 6 GB, which is within reason of being run on a single machine. Spark and HDFS add material overhead to processing, so the "worst case" is going to be clearly slower than a multi-threaded approach on a single machine. In general, parallelization on a single machine (multi-threading) will be faster than parallelizing over a cluster of nodes (Spark)

information_interchange
  • 2,538
  • 6
  • 31
  • 49
0

In general the best tool to do search in a Hadoop setting is SOLR. It is optimized for searching, so though a tool like spark can get the job done you will never expect similar performance.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
  • will SOLR provide parallel processing for multiple files? – Exorcismus Jul 31 '19 at 15:28
  • @Exorcismus SOLR is indeed built to scale massively, so it can certainly work in parallel. However, upon reviewing the other answers, I must also say that if your total dataset is only a few GB, it feels a bit odd to use a multi-node setup. – Dennis Jaheruddin Aug 01 '19 at 08:11
  • @Exorcismus Also note that solr will not read the files raw from HDFS whenever you want to do something, which may well be the bottleneck that you are hitting. – Dennis Jaheruddin Aug 01 '19 at 08:12
0

Try df.coalesce(20) after loading to decrease the number of partitions and keep their size about ~128MB. Perform transformations and actions afterwards.