Scio is a Scala API for Google Cloud Dataflow and Apache Beam inspired by Spark and Scalding.
Questions tagged [spotify-scio]
80 questions
0
votes
1 answer
Apache Beam Saving to BigQuery using Scio and explicitly specifying TriggeringFrequency
I'm using Spotify Scio to create a scala Dataflow pipeline which is triggered by a Pub/Sub message. It reads from our private DB and then inserts information into BigQuery.
The problem is:
I need to delete the previous data
For this, I need to use…

MaxG
- 1,079
- 2
- 13
- 26
0
votes
1 answer
Can I set/unset a default Coder in Scio?
I would like to consistently apply a custom RicherIndicatorCoder for my case class RicherIndicator. Moreover if I fail to provide a new Coder for Tuples or KVs containing RicherIndicator then I would like to obtain a compile-time or runtime error…

Ralph Gonzalez
- 69
- 7
0
votes
1 answer
Apache Beam - Unable to run Scio g8 starter project
I'm trying to get started with Scio and I've used their giter8 starter project. https://github.com/spotify/scio.g8
I'm using Java 8 on macOs and I'm getting this error when trying to run the project either with target/pack/bin/word-count --output=wc…

MichelDelpech
- 853
- 8
- 36
0
votes
1 answer
Is it possible to control input processing time with scio JobTest?
We are using com.spotify.scio.testing.JobTest for end to end testing of our scio pipeline. The pipeline includes a DoFn that is sensitive to data sequencing, on a stream of configuration data which arrives infrequently.
We are passing an ordered…

Ralph Gonzalez
- 69
- 7
0
votes
1 answer
Export pubsub data to object storage using SCIO
I am trying to export Cloud Pub/Sub streams to Cloud Storage as described by this post by Spotify Reliable export of Cloud Pub/Sub streams to Cloud Storage or this post by Google Simple backup and replay of streaming events using Cloud Pub/Sub,…

Antoine Sauray
- 241
- 1
- 3
- 15
0
votes
1 answer
Dataflow TextIO.write issues with scaling
I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and…

Tuubeee
- 105
- 6
0
votes
1 answer
How to deal with CoderException: cannot encode a null String with scio
I just started using scio and dataflow. Trying my code to one input file, worked fine. But when I add more files to the input, got the following exception:
java.lang.RuntimeException: org.apache.beam.sdk.coders.CoderException: cannot encode a null…

yang
- 498
- 5
- 22
0
votes
2 answers
Put SCollection from textFile to BigQuery with Scio
I have read some documents with textFile, and did a flatMap of the single words, adding some extra information for each word:
val col = sc.textFile(args.getOrElse("input","documents/*"))
.flatMap(_.split("\\s+").filter(_.nonEmpty))
val mapped =…

telex-wap
- 832
- 1
- 9
- 30
0
votes
0 answers
Why would I get a lot of None.get exceptions while draining a streaming pipeline?
I am running into issues where I have a streaming scio pipeline running on Dataflow that is deduplicating messages and performing some counting by key. When I try to drain the pipeline I get a large amount of None.get exceptions supposedly thrown in…

grindlemire
- 135
- 10
0
votes
1 answer
Update BigTable row in Apache Beam (Scio)
I have the following use case:
There is a PubSub topic with data I want to aggregate using Scio and then save those aggregates into BigTable.
In my pipeline there is a CountByKey aggregation. What I would like to do is to be able to increment value…

Marcin Zablocki
- 10,171
- 1
- 37
- 47
0
votes
2 answers
Scio all saveAs txt file methods output a txt file with part prefix
If I want to output a SCollection of TableRow or String to google cloud storage (GCS) I'm using saveAsTableRowJsonFile or saveAsTextFile, respectively. Both of these methods ultimately use
private[scio] def pathWithShards(path: String) =…

Andrew Cassidy
- 2,940
- 1
- 22
- 46
0
votes
2 answers
Why is my PCollection (SCollection) size so larged compared to BigQuery Table input size?
The above image is the table schema for a big query table which is the input into an apache beam dataflow job that runs on spotify's scio. If you aren't familiar with scio it's a Scala wrapper around the Apache Beam Java SDK. In particular, a…

Andrew Cassidy
- 2,940
- 1
- 22
- 46
0
votes
1 answer
Fixed window over Unbounded input (PubSub) stops firing after workers autoscale up
using scio version 0.4.7, I have a streaming job that's listening to a pubsub topic, I'm using event processing here with 'timestamp' attribute present on the message properties in RFC3339
val rtEvents: SCollection[RTEvent] =…

ASingh
- 53
- 5
0
votes
1 answer
How to dedupe across over-lapping sliding windows in apache beam / dataflow
I have the following requirement:
read events from a pub sub topic
take a window of duration 30 mins and period 1 minute
in that window if 3 events for a given id all match match some predicate then i need to raise an event in a different pub sub…

Luke De Feo
- 2,025
- 3
- 22
- 40
0
votes
1 answer
How to confugure third Party Lib of Scio Scala API in GCP
There is a third party Scio client library which provides a Scala API for Cloud Dataflow in order to access Cloud Bigtable. So, in this Process I am unable to configure Scala API in GCP. Please Help.
Link:…