Questions tagged [spotify-scio]

Scio is a Scala API for Google Cloud Dataflow and Apache Beam inspired by Spark and Scalding.

80 questions
0
votes
1 answer

Apache Beam Saving to BigQuery using Scio and explicitly specifying TriggeringFrequency

I'm using Spotify Scio to create a scala Dataflow pipeline which is triggered by a Pub/Sub message. It reads from our private DB and then inserts information into BigQuery. The problem is: I need to delete the previous data For this, I need to use…
MaxG
  • 1,079
  • 2
  • 13
  • 26
0
votes
1 answer

Can I set/unset a default Coder in Scio?

I would like to consistently apply a custom RicherIndicatorCoder for my case class RicherIndicator. Moreover if I fail to provide a new Coder for Tuples or KVs containing RicherIndicator then I would like to obtain a compile-time or runtime error…
0
votes
1 answer

Apache Beam - Unable to run Scio g8 starter project

I'm trying to get started with Scio and I've used their giter8 starter project. https://github.com/spotify/scio.g8 I'm using Java 8 on macOs and I'm getting this error when trying to run the project either with target/pack/bin/word-count --output=wc…
MichelDelpech
  • 853
  • 8
  • 36
0
votes
1 answer

Is it possible to control input processing time with scio JobTest?

We are using com.spotify.scio.testing.JobTest for end to end testing of our scio pipeline. The pipeline includes a DoFn that is sensitive to data sequencing, on a stream of configuration data which arrives infrequently. We are passing an ordered…
0
votes
1 answer

Export pubsub data to object storage using SCIO

I am trying to export Cloud Pub/Sub streams to Cloud Storage as described by this post by Spotify Reliable export of Cloud Pub/Sub streams to Cloud Storage or this post by Google Simple backup and replay of streaming events using Cloud Pub/Sub,…
Antoine Sauray
  • 241
  • 1
  • 3
  • 15
0
votes
1 answer

Dataflow TextIO.write issues with scaling

I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and…
0
votes
1 answer

How to deal with CoderException: cannot encode a null String with scio

I just started using scio and dataflow. Trying my code to one input file, worked fine. But when I add more files to the input, got the following exception: java.lang.RuntimeException: org.apache.beam.sdk.coders.CoderException: cannot encode a null…
yang
  • 498
  • 5
  • 22
0
votes
2 answers

Put SCollection from textFile to BigQuery with Scio

I have read some documents with textFile, and did a flatMap of the single words, adding some extra information for each word: val col = sc.textFile(args.getOrElse("input","documents/*")) .flatMap(_.split("\\s+").filter(_.nonEmpty)) val mapped =…
telex-wap
  • 832
  • 1
  • 9
  • 30
0
votes
0 answers

Why would I get a lot of None.get exceptions while draining a streaming pipeline?

I am running into issues where I have a streaming scio pipeline running on Dataflow that is deduplicating messages and performing some counting by key. When I try to drain the pipeline I get a large amount of None.get exceptions supposedly thrown in…
grindlemire
  • 135
  • 10
0
votes
1 answer

Update BigTable row in Apache Beam (Scio)

I have the following use case: There is a PubSub topic with data I want to aggregate using Scio and then save those aggregates into BigTable. In my pipeline there is a CountByKey aggregation. What I would like to do is to be able to increment value…
0
votes
2 answers

Scio all saveAs txt file methods output a txt file with part prefix

If I want to output a SCollection of TableRow or String to google cloud storage (GCS) I'm using saveAsTableRowJsonFile or saveAsTextFile, respectively. Both of these methods ultimately use private[scio] def pathWithShards(path: String) =…
0
votes
2 answers

Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?

The above image is the table schema for a big query table which is the input into an apache beam dataflow job that runs on spotify's scio. If you aren't familiar with scio it's a Scala wrapper around the Apache Beam Java SDK. In particular, a…
0
votes
1 answer

Fixed window over Unbounded input (PubSub) stops firing after workers autoscale up

using scio version 0.4.7, I have a streaming job that's listening to a pubsub topic, I'm using event processing here with 'timestamp' attribute present on the message properties in RFC3339 val rtEvents: SCollection[RTEvent] =…
ASingh
  • 53
  • 5
0
votes
1 answer

How to dedupe across over-lapping sliding windows in apache beam / dataflow

I have the following requirement: read events from a pub sub topic take a window of duration 30 mins and period 1 minute in that window if 3 events for a given id all match match some predicate then i need to raise an event in a different pub sub…
Luke De Feo
  • 2,025
  • 3
  • 22
  • 40
0
votes
1 answer

How to confugure third Party Lib of Scio Scala API in GCP

There is a third party Scio client library which provides a Scala API for Cloud Dataflow in order to access Cloud Bigtable. So, in this Process I am unable to configure Scala API in GCP. Please Help. Link:…