-1

We are working in a project to decode real time message files which are transmitted to us in the form of text files. The file is an unstructured text but we got a spec to decode it. There are different subjects and every subject receive atleast 800 message files per hour with an avg file size of 1 KB. The requirement is to decode all the files in real time as and when they arrive and store the decoded data in structured form in the database which has to be pulled to the front end application. Once the file is received, the ETA to appear in the front end is less than a minute.

This is the proposed data flow I am thinking of:-

Message Files(.txt) --> Decode --> Store in DB --> Web App

Can someone let me know your response on below questions?

  1. Can I use any streaming tool/tech to process the message files in real time?
  2. Is it possible to use Big Data Stack like Cloudera to process these files in real time? Since the size of every file is 1KB, will it not impact the storage and performance of Name node in HDFS? I refer Small File Big Data problem
  3. If I cannot use Big Data, is there an alternative processing strategy I can think of to achieve this ETA?
AngiSen
  • 915
  • 4
  • 18
  • 41
  • Hello @AngiSen I have some questions regarding your project: 1) what kind of process you want to apply to each file once received? is it a prerequisite to have the complete file content before starting processing it or it could be done by row for instance? what databases/storage systems is your environment already using? – abiratsis Mar 30 '19 at 09:01
  • Hello Alex, 1) I want to decode each and every file using a spec and generate attributes from it I cannot afford to read them row by row as they are unstructured text and the files have to be treated as a whole. Also, there are cases where i have to stitch multiple files(pages) to make them one file and send it for further decoding. 2) I dont have any DB at this moment but once i decode each file, i would get a structured schema to store it in some DB like SQL Server or Hive. – AngiSen Mar 30 '19 at 13:21
  • OK @Angi, if I understood well you have 2 file types **FT1: big files** and **FT2: chunks of text** the 2nd ones you want to transform into FT1 after applying some kind of processing. Is that correct? – abiratsis Mar 30 '19 at 14:57
  • Yes kind of. None of my file size exceeds 1 MB even after stitching them together. I want to apply the same kind of decode processing for both the files. Only thing is for certain files, I have to read them and identify if it has any other pages remaining and wait until all the pages of the file is received to stitch them. For some, it's direct as they are just one page files. – AngiSen Mar 30 '19 at 15:16

1 Answers1

0

You task has some unknown options.

What is expected total load? 10 subjects x 800 messages x 1kb of text per hour doesn't require any specific stuff and you can just use something simple like Spring Boot app or Go app. You are talking of BigData stack and I'm assume you would have a lot of subjects.

Big Data Stack like Cloudera has at least two good tools for high scale streaming processing: Kafka and Spark Streaming. Kafka is a message broker that can handle really a high load with support of replication, high availability, etc. Spark Streaming is an framework that allows you to process data on fly. Especially if you have some complex processing logic.

Regarding small files, it really depends on your case. Why and how do you need to store them?

  1. You can just not to store this files in HDFS and put already decoded data in HBase(or another DB, whatever you want). HBase would deal with files and regions by itself.

  2. If you want to store this undecoded files as some kind of raw data master set you can put a files in some temporary storage, compact several files into big one and write big one to HDFS. There are a lot of options to do it with Kafka, Spark Streaming or another similar framework.

Also, there are a plenty of different streaming frameworks like Apache Storm, Apache Flink, Apache Beam or Kafka Streams. Each of them has its own pros and cons.

Serge Harnyk
  • 1,279
  • 10
  • 19
  • Thanks Serge for your response. All my files have to be decoded in parallel and not supposed to be passed as a stream. I cannot merge them as one big file as each file is atomic for decoding. Also, I might receive parts of the file in different files which need to be stitched together. All these stuff should happen real time as and when the files are landed in our environment. I am not sure if Kafka or other streaming tools supports this as this is a full file delivery. Can i achieve this real time decoding of large volume of files using Python? – AngiSen Mar 29 '19 at 02:38