1

I'm using Spotify Scio to read logs that are exported from Stackdriver to Google Cloud Storage. They are JSON files where every line is a single entry. Looking at the worker logs it seems like the file is split into chunks, which are then read in any order. I've already limited my job to exactly 1 worker in this case. Is there a way to force these chunks to be read and processed in order?

As an example (textFile is basically a TextIO.Read):

val sc = ScioContext(myOptions)
sc.textFile(myFile).map(line => logger.info(line))

Would produce output similar to this based on the worker logs:

line 5
line 6
line 7
line 8
<Some other work>
line 1
line 2
line 3
line 4
<Some other work>
line 9
line 10
line 11
line 12

What I want to know is if there's a way to force it to read lines 1-12 in order. I've found that gzipping the file and reading it with the CompressionType specified is a workaround but I'm wondering if there are any ways to do this that don't involve zipping or changing the original file.

rav
  • 3,579
  • 1
  • 18
  • 18
Idrees Khan
  • 107
  • 2
  • 12
  • 1
    I had a similar issue/question recently and the feedback was basically 'No'. Seems like even when you run locally, Dataflow still reads in random order. One workaround I implemented for this, which isn't great, is to read the file in order in Pub/Sub and send messages to Dataflow, with Dataflow subscribing to the PubSub topic, instead of reading the file. When DataFlow finishes with each message, it sends the message back saying it's done, so PubSub sends the next one. It's a bit overboard, so would be great to hear better/built-in options... – VS_FF Feb 03 '17 at 16:01
  • That's unfortunate, I've been thinking of doing similar things, but I may just pre-zip everything since it at least seems reliable. If I understand correctly, the splitting won't happen when they are zipped and the logic proceeds in-order. I agree that there should be a simpler way, thanks! – Idrees Khan Feb 03 '17 at 16:48
  • 1
    Could you elaborate on your use case? Dataflow is intended for data-parallel processing, and it sounds like you are looking for a serial tool. – Sam McVeety Feb 03 '17 at 17:09
  • 1
    +1 to Sam's question. Please do not rely on observed in-order processing with zip files (or any other observed behavior which is not explicitly guaranteed by the Beam programming model) - this can change at any time without notice; it can change even if you use the same code and the same SDK version as we implement new optimizations backend-side. – jkff Feb 05 '17 at 06:50
  • I am trying to consume exported Stackdriver logs, run them through the transformation logic I have in Dataflow, and then forward them in order to other GC services. Thanks for the input Sam and jkff, I think I can alter the design a bit to have the sequential pieces outside of Dataflow. – Idrees Khan Feb 06 '17 at 16:48

1 Answers1

6

Google Cloud Dataflow / Apache Beam currently do not support sorting or preservation of order in processing pipelines. The drawback of allowing for sorted output is that it outputting such a result for large datasets eventually bottlenecks on a single machine, which is not scalable for large datasets.

Charles Chen
  • 346
  • 1
  • 4