Storing and processing a lot of tiny files

Question

There are a lot of files with size from 1Kb to 5Mb on our servers. Total size of those files is about 7Tb. Process algorithm - read and make some decisions about this file. Files may have several formats: doc, txt, png, bmp and etc. Therefore I can't merge those files to get bigger files. How I can effectively store and process those files? What technology fits well to this task?

What kind of processing are you wanting to do with the files? — TomDunning, Nov 16 '17 at 18:52
@TomDunning, I want to extract data from those files. For example, if file is image, I will extract text using image processing library. If it is text file, I will just parse it. And sometimes I will load whole file to memory, sometimes only part of it. — Rustam Fatkullin, Nov 17 '17 at 06:10
If that’s the case what’s the problem with parsing the files as you receive them and saving the text to a database? Your largest files are going to be the images, most likely, so not huge quantities of text. Is there anything else you need from the files? You can of course retain the files after the data is extracted — TomDunning, Nov 17 '17 at 08:47

score 1 · Answer 1 · answered Nov 17 '17 at 10:32

You can use various technologies to store and process these files. Below mentioned are the technologies that you can use.

1 Apache Kafka: You can create different topics for each format and push your data in these topics. Advantage :

Based on your load you can easily increase your consumption speed.

2 Hadoop: you can store your data in hdfs the format and can design MR jobs to process.

3 You can use any document storage NOSQL database to store your data

Note: All the above solutions will store your data in distributed format and you can run it on commodity machines

Store your data in clouds(AWS, Google, Azure) and use there API to get and process the data. (If you want your data to be shared with the other applications also)

score 0 · Accepted Answer · answered Nov 17 '17 at 09:22

Start by segregating files into Different directories based on types. You can even have partition withing the individual directories. Example /data/images/YYYY-MM-DD , /data/text/YYYY-MM-DD
Use multipleInputs with appropriate InputFormat for each Path.
Normalize the data into a generic format before sending it to the reducer if needed.

There are ways to ingest data for your need .

Use Kafka to store data under different topics based on type(image , text ) and then copy to hdfs from kafka
Use Flume

As you have huge amount of data ,

please rollup the data in HDFS on a weekly basis . You can use oozie or falcon to automate the weekly rollup process
Use CombinedInPutFormat in your Spark or MR code.

Last but not the least map the data as table using Hive to expose it to external clients.

score 0 · Answer 3 · answered Nov 17 '17 at 10:42

Hadoop archieves (HAR) is usual way to address this. More details about this are available on : https://hadoop.apache.org/docs/r2.7.0/hadoop-archives/HadoopArchives.html

You also have option to use SequenceFile, HBase as described in : https://blog.cloudera.com/blog/2009/02/the-small-files-problem/

But, looking at your usecase HAR fits the bill.

Storing and processing a lot of tiny files

3 Answers3