How to process very small files in Spark

Question

I have about 60k file stored in HDFS, each file size is in range of kilo bytes 4kb-70kb. Am trying to process them by performing regex search on specific files I know yet, the processing takes too long, and it seems not right ...

the spark job is run on yarn

Hardware specs : 3 nodes, each has 4 core and 15G RAM

targeted_files = sc.broadcast(sc.textFile(doc).collect()) # 3 files

#hdfs://hadoop.localdomain/path/to/directory/ contains ~60K files
df = sc.wholeTextFiles(
    "hdfs://hadoop.localdomain/path/to/directory/").filter(
    lambda pairRDD: ntpath.basename(pairRDD[0]) in targeted_files.value)

print('Result : ', df.collect()) #when I run this step alone, took 15 mins to finish

df = df.map(filterMatchRegex).toDF(['file_name', 'result']) # this takes ~hour and still doesn't finish

would be using HDFS, spark for this task is correct ? also I thought in worst case scenario the processing time would be equal to threading approach using java ... what am I doing wrong ?

I came across this link which addresses the same problem, but am not sure how to handle it in pyspark it seems all/most of time taken during reading files from HDFS, is there a better way to read/store small files and process them with spark ?

information_interchange · Answer 1 · 2019-07-31T15:12:57.240

1

It doesn't seem like a right use case of spark to be honest. Your dataset is pretty small, 60k * 100k = 6 000 mB = 6 GB, which is within reason of being run on a single machine. Spark and HDFS add material overhead to processing, so the "worst case" is going to be clearly slower than a multi-threaded approach on a single machine. In general, parallelization on a single machine (multi-threading) will be faster than parallelizing over a cluster of nodes (Spark)

edited Jul 31 '19 at 15:12

answered Jul 31 '19 at 14:33

information_interchange

2,538
6
31
49

yet, it doesn't even finishs, and I can't tell why – Exorcismus Jul 31 '19 at 15:28

score 0 · Answer 2 · answered Jul 31 '19 at 14:43

0

In general the best tool to do search in a Hadoop setting is SOLR. It is optimized for searching, so though a tool like spark can get the job done you will never expect similar performance.

answered Jul 31 '19 at 14:43

Dennis Jaheruddin

21,208
8
66
122

will SOLR provide parallel processing for multiple files? – Exorcismus Jul 31 '19 at 15:28
@Exorcismus SOLR is indeed built to scale massively, so it can certainly work in parallel. However, upon reviewing the other answers, I must also say that if your total dataset is only a few GB, it feels a bit odd to use a multi-node setup. – Dennis Jaheruddin Aug 01 '19 at 08:11
@Exorcismus Also note that solr will not read the files raw from HDFS whenever you want to do something, which may well be the bottleneck that you are hitting. – Dennis Jaheruddin Aug 01 '19 at 08:12

score 0 · Answer 3 · answered Jul 31 '19 at 17:35

0

Try df.coalesce(20) after loading to decrease the number of partitions and keep their size about ~128MB. Perform transformations and actions afterwards.

answered Jul 31 '19 at 17:35

Zbigniew Pomianowski

1

won't this decrease parallelism ? – Exorcismus Aug 01 '19 at 13:16

How to process very small files in Spark

3 Answers3