There are a lot of files with size from 1Kb to 5Mb on our servers. Total size of those files is about 7Tb. Process algorithm - read and make some decisions about this file. Files may have several formats: doc, txt, png, bmp and etc. Therefore I can't merge those files to get bigger files. How I can effectively store and process those files? What technology fits well to this task?
-
1What kind of processing are you wanting to do with the files? – TomDunning Nov 16 '17 at 18:52
-
@TomDunning, I want to extract data from those files. For example, if file is image, I will extract text using image processing library. If it is text file, I will just parse it. And sometimes I will load whole file to memory, sometimes only part of it. – Rustam Fatkullin Nov 17 '17 at 06:10
-
If that’s the case what’s the problem with parsing the files as you receive them and saving the text to a database? Your largest files are going to be the images, most likely, so not huge quantities of text. Is there anything else you need from the files? You can of course retain the files after the data is extracted – TomDunning Nov 17 '17 at 08:47
3 Answers
You can use various technologies to store and process these files. Below mentioned are the technologies that you can use.
1 Apache Kafka: You can create different topics for each format and push your data in these topics. Advantage :
- Based on your load you can easily increase your consumption speed.
2 Hadoop: you can store your data in hdfs the format and can design MR jobs to process.
3 You can use any document storage NOSQL database to store your data
Note: All the above solutions will store your data in distributed format and you can run it on commodity machines
- Store your data in clouds(AWS, Google, Azure) and use there API to get and process the data. (If you want your data to be shared with the other applications also)

- 2,710
- 2
- 25
- 42
- Start by segregating files into Different directories based on types. You can even have partition withing the individual directories. Example /data/images/YYYY-MM-DD , /data/text/YYYY-MM-DD
- Use multipleInputs with appropriate InputFormat for each Path.
- Normalize the data into a generic format before sending it to the reducer if needed.
There are ways to ingest data for your need .
- Use Kafka to store data under different topics based on type(image , text ) and then copy to hdfs from kafka
- Use Flume
As you have huge amount of data ,
- please rollup the data in HDFS on a weekly basis . You can use oozie or falcon to automate the weekly rollup process
- Use CombinedInPutFormat in your Spark or MR code.
Last but not the least map the data as table using Hive to expose it to external clients.

- 2,839
- 2
- 21
- 31
Hadoop archieves (HAR) is usual way to address this. More details about this are available on : https://hadoop.apache.org/docs/r2.7.0/hadoop-archives/HadoopArchives.html
You also have option to use SequenceFile, HBase as described in : https://blog.cloudera.com/blog/2009/02/the-small-files-problem/
But, looking at your usecase HAR fits the bill.

- 711
- 1
- 4
- 18