0

I'm trying to build a pipeline using Apache Beam 2.16.0 for processing large amount of XML files. Average count is seventy million per 24 hrs, and at peak load it can go up to half a billion. File sizes varies from ~1 kb to 200 kb (sometimes it can be even bigger, for example 30 mb)

File goes through various transformations and final destination is BigQuery table for further analysis. So, first I read xml file, then deserialize into POJO (with help of Jackson) and then apply all required transformations. Transformations works pretty fast, on my machine I was able to get about 40000 transformations per second, depending on file size.

My main concern is file reading speed. I have feeling that all reading is done only via one worker, and I don't understand how this can be paralleled. I tested on 10k test files dataset.

Batch job on my local machine (MacBook pro 2018: ssd, 16 gb ram and 6-core i7 cpu) can parse about 750 files/sec. If I run this on DataFlow, using n1-standard-4 machine, I can get only about 75 files/sec. It usually doesn't scale up, but even if it does (sometimes up to 15 workers), I can get only about 350 files/sec.

More interesting is streaming job. It immediately starts from 6-7 workers and on UI I can see 1200-1500 elements/sec, but usually it doesn't show speed, and if I select last item on page, it shows that it already processed 10000 elements.

The only difference between batch and stream job is this option for FileIO:

.continuously(Duration.standardSeconds(10), Watch.Growth.never()))

Why this makes such a big difference in processing speed?

Run parameters:

--runner=DataflowRunner
--project=<...>
--inputFilePattern=gs://java/log_entry/*.xml
--workerMachineType=n1-standard-4
--tempLocation=gs://java/temp
--maxNumWorkers=100

Run region and bucket region are the same.

Pipeline:

pipeline.apply(
  FileIO.match()
    .withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW)
    .filepattern(options.getInputFilePattern())
    .continuously(Duration.standardSeconds(10), Watch.Growth.never()))
  .apply("xml to POJO", ParDo.of(new XmlToPojoDoFn()));

Example of xml file:

<LogEntry><EntryId>0</EntryId>
    <LogValue>Test</LogValue>
    <LogTime>12-12-2019</LogTime>
    <LogProperty>1</LogProperty>
    <LogProperty>2</LogProperty>
    <LogProperty>3</LogProperty>
    <LogProperty>4</LogProperty>
    <LogProperty>5</LogProperty>
</LogEntry>

Real life file and project are much more complex, with lots of nested nodes and huge amount of transformation rules.

Simplified code on GitHub: https://github.com/costello-art/dataflow-file-io It contains only "bottleneck" part - reading files and deserializing into POJO.

If I can process about 750 files/sec on my machine (which is one powerful worker), then I expect to have about 7500 files/sec on similar 10 workers in Dataflow.

costello
  • 446
  • 7
  • 13
  • Interesting... Bundle sizes are in general larger in batch vs streaming mode, but not sure if this is related. Could you try making use of this, its depricated, but this is just to check something : Reshuffle.ViaRandomKey() between the FileIO.match() and your XmlToPojoDoFn() – Reza Rokni Dec 19 '19 at 10:10
  • @RezaRokni I didn't notice any difference. But during testing I changed read operation: FileIO.Match -> FileIO.ReadMatches -> read file as bytes -> convert bytes to POJO. I also did testing on my production code and on larger dataset (144k and 1m files) in batch mode: on n1-standard-2 I was able to get about 1000k files/sec with 17 workers. This is much better, but still I'm not close to 850 files/sec like on my local machine. Need to do more testing. I updated github code with this approach. – costello Dec 19 '19 at 17:28
  • Btw, when you are doing testing on your local machine, where are the files sitting? On your local disk or in a cloud bucket? – Reza Rokni Dec 20 '19 at 07:59
  • @RezaRokni locally (ssd). I wonder if I can achieve similar read performance on cloud – costello Dec 20 '19 at 09:08
  • Your files are sitting in Cloud Storage I assume, they will be pulled onto the workers. – Reza Rokni Dec 20 '19 at 13:10
  • Hi @costello, did you experience any timeouts in your code when trying to read 1 m files? I'm trying to read 10 million files from a bucket and my FileIO.match() step times out every time. I'd expect Beam to surely be able to do parallel reads from GCS and scale as necessary? I have maxNumWorkers set to 50 and machineType n1-standard-4 – bitnahian Mar 22 '21 at 06:02
  • @bitnahian Hi. I don't think so, as far as I remember we tested only on 1m. But we went harder way: compressing files into archives, and implementing custom reader. This resulted in ~100-1000x performance increase. I would recommend to create very simple job with FileIO.match() and ParDo for counting number of files matched. This should not timeout. If it still fails, then this is subject of separate topic, as it might be some bug in SDK. – costello Apr 06 '21 at 14:11

1 Answers1

0

I was trying to make a test code with some functionality to check the behavior of the FileIO.match and the number of workers [1].

In this code I set the value numWorkers to 50, but you can set the value you need. What I could see is that the FileIO.match method will find all the links that match these patterns but after that, you must deal with the content of each file separately.

For example, in my case I created a method that receives each file and then I divided the content by "new_line (\n)" character (but here you can handle it as you want, it depends also on the type of file, csv, xml, ...).

Therefore, I transformed each line to TableRow, format that BigQuery understands, and return each value separately (out.output(tab)), this way, Dataflow will handle the lines in different workers depending the workload of the pipeline, for example 3000 lines in 3 different workers, each one with 1000 lines.

At the end, since it is a batch process, Dataflow will wait to process all the lines and then insert this in BigQuery.

I hope this test code helps you with yours.

[1] https://github.com/GonzaloPF/dataflow-pipeline/blob/master/java/randomDataToBQ/src/main/fromListFilestoBQ.java

  • My main concern is file read performance. This is the slowest part in pipeline. So I wonder if it possible to speed up this process. Or reading a lot of small text files is bad idea at all? Is it better to read one big or more bigger files? What was your performance? Also I see that you work with csv files, and my case is xml, so it cannot be divided for different workers (I need read entire file at once to be able to parse it) – costello Jan 15 '20 at 10:23
  • It is better to read from different files because in this way I guess you force to parallelize the execution in different workers, for example for me, in this part I was able to see how the workers increase to read all the files, for example I used a bucket with 10,000 files and the workers increased to 69, around 144 files per worker. I know you are using XML, I did it with csv because it was easier to try to replicate the behavior. Perhaps an idea you could try is to specify the number of workers `numWorkers`, to see if the number of this increase. Let me know about this change. – Gonzalo Pérez Fernández Jan 15 '20 at 12:33
  • compressing files into archives + custom reader helped a lot. Like 100-1000 times better. I somehow missed your answer :| – costello Apr 06 '21 at 14:13
  • If you think the answer helped you, please upvote it. – Gonzalo Pérez Fernández Apr 10 '21 at 10:29