We are working in a project to decode real time message files which are transmitted to us in the form of text files. The file is an unstructured text but we got a spec to decode it. There are different subjects and every subject receive atleast 800 message files per hour with an avg file size of 1 KB. The requirement is to decode all the files in real time as and when they arrive and store the decoded data in structured form in the database which has to be pulled to the front end application. Once the file is received, the ETA to appear in the front end is less than a minute.
This is the proposed data flow I am thinking of:-
Message Files(.txt) --> Decode --> Store in DB --> Web App
Can someone let me know your response on below questions?
- Can I use any streaming tool/tech to process the message files in real time?
- Is it possible to use Big Data Stack like Cloudera to process these files in real time? Since the size of every file is 1KB, will it not impact the storage and performance of Name node in HDFS? I refer Small File Big Data problem
- If I cannot use Big Data, is there an alternative processing strategy I can think of to achieve this ETA?