0

In DNA informatics the files are massive (300GB each and biobanks have hundreds of thousands of files) and they need to go through 6 or so lengthy downstream pipelines (hours to weeks). Because I do not work at the company that manufactures the sequencing machines, I do not have access to the data as it is being generated...nor do I write assembly lang.

What I would like to do is transform the lines of text from that 300GB file into stream events. Then pass these messages through the 6 pipelines with Kafka brokers handing off to SparkStreaming between each pipeline.

Is this possible? Is this the wrong use case? It would be nice to rerun single events as opposed to entire failed batches.

Desired Workflow:
------pipe1------
_------pipe2------
__------pipe3------
___------pipe4------


Current Workflow:
------pipe1------
_________________------pipe2------
__________________________________------pipe3------
___________________________________________________------pipe4------
Kermit
  • 4,922
  • 4
  • 42
  • 74
  • I'm not sure I see the need to Kafka here. Why not write the large files to storage (S3, HDFS, etc) and consume them in your processing from there? – Robin Moffatt Jan 24 '19 at 09:59
  • @RobinMoffatt That is what our industry does now (see "Current Workflow" above). We process all rows and write them to a new file before moving on to the next pipeline. – Kermit Jan 24 '19 at 15:07

1 Answers1

2

Kafka is not meant for sending files, only relatively small events. Even if you did send a file line-by-line, you would have need to know how to put the file back together to process it, and thus you're effectivly doing the same thing as streaming the files though a raw TCP socket.

Kafka has a maxiumum default message suze of 1MB, and while you can increase it, I wouldn't recommend pushing it much over the double digit MB sizes.

How can I send large messages with Kafka (over 15MB)?

If you really needed to get data like that though Kafka, the recommended pattern is to put your large files on an external storage (HDFS, S3, whatever), then put the URI to the file within the Kafka event, and let consumers deal with reading that datasource.

If the files have any structure at all to them (like pages, for example), then you could use Spark and a custom Hadoop InputFormat to serialize those, and process data in parallel that way. Doesn't necessarily have to be through Kafka, though. You could try Apache NiFi, which I hear processes larger files better (maybe not GB, though).

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thanks cricket. With respect to serialization, are you saying that these files would be more manageable in streaming systems using a format like Parquet/ Avro? – Kermit Jan 24 '19 at 15:04
  • 1
    Avro for Kafka is better than plaintext, and more usable than your own custom binary format. Parquet isn't "row or record oriented", so therefore doesn't work well in streaming systems, and it's meant for SQL-like columnar access – OneCricketeer Jan 24 '19 at 15:42