I'm trying to build a pipeline using Apache Beam 2.16.0 for processing large amount of XML files. Average count is seventy million per 24 hrs, and at peak load it can go up to half a billion. File sizes varies from ~1 kb to 200 kb (sometimes it can be even bigger, for example 30 mb)
File goes through various transformations and final destination is BigQuery table for further analysis. So, first I read xml file, then deserialize into POJO (with help of Jackson) and then apply all required transformations. Transformations works pretty fast, on my machine I was able to get about 40000 transformations per second, depending on file size.
My main concern is file reading speed. I have feeling that all reading is done only via one worker, and I don't understand how this can be paralleled. I tested on 10k test files dataset.
Batch job on my local machine (MacBook pro 2018: ssd, 16 gb ram and 6-core i7 cpu) can parse about 750 files/sec. If I run this on DataFlow, using n1-standard-4 machine, I can get only about 75 files/sec. It usually doesn't scale up, but even if it does (sometimes up to 15 workers), I can get only about 350 files/sec.
More interesting is streaming job. It immediately starts from 6-7 workers and on UI I can see 1200-1500 elements/sec, but usually it doesn't show speed, and if I select last item on page, it shows that it already processed 10000 elements.
The only difference between batch and stream job is this option for FileIO:
.continuously(Duration.standardSeconds(10), Watch.Growth.never()))
Why this makes such a big difference in processing speed?
Run parameters:
--runner=DataflowRunner
--project=<...>
--inputFilePattern=gs://java/log_entry/*.xml
--workerMachineType=n1-standard-4
--tempLocation=gs://java/temp
--maxNumWorkers=100
Run region and bucket region are the same.
Pipeline:
pipeline.apply(
FileIO.match()
.withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW)
.filepattern(options.getInputFilePattern())
.continuously(Duration.standardSeconds(10), Watch.Growth.never()))
.apply("xml to POJO", ParDo.of(new XmlToPojoDoFn()));
Example of xml file:
<LogEntry><EntryId>0</EntryId>
<LogValue>Test</LogValue>
<LogTime>12-12-2019</LogTime>
<LogProperty>1</LogProperty>
<LogProperty>2</LogProperty>
<LogProperty>3</LogProperty>
<LogProperty>4</LogProperty>
<LogProperty>5</LogProperty>
</LogEntry>
Real life file and project are much more complex, with lots of nested nodes and huge amount of transformation rules.
Simplified code on GitHub: https://github.com/costello-art/dataflow-file-io It contains only "bottleneck" part - reading files and deserializing into POJO.
If I can process about 750 files/sec on my machine (which is one powerful worker), then I expect to have about 7500 files/sec on similar 10 workers in Dataflow.