Questions tagged [spotify-scio]

Scio is a Scala API for Google Cloud Dataflow and Apache Beam inspired by Spark and Scalding.

80 questions
2
votes
2 answers

PubSub watermark not advancing

I've written an Apache Beam job using Scio with the purpose of generating session ids for incoming data records and then enriching them in some manner, before outputting them to BigQuery. Here's the code: val measurements =…
user9441243
2
votes
1 answer

Does Scio TypeSafe BigQuery support with clauses

val query = s"""#standardsql | WITH A AS (SELECT * FROM `prefix.andrews_test_table` LIMIT 1000) | select * from A""" @BigQueryType.fromQuery(query) class Test Is consistently giving me :40: error: Missing query. This query runs fine in…
Andrew Cassidy
  • 2,940
  • 1
  • 22
  • 46
2
votes
1 answer

"GC overhead limit exceeded" for long running streaming dataflow job

Running my streaming dataflow job for a longer period of time tends to end up in a "GC overhead limit exceeded" error which brings the job to a halt. How can I best proceed to debug this? java.lang.OutOfMemoryError: GC overhead limit exceeded at…
Brodin
  • 197
  • 1
  • 1
  • 14
2
votes
1 answer

Dataflow job halts with "Processing lull"

Running a streaming dataflow pipeline with quite a advanced group by using session windows I run into problems after a couple of hours of running. The job scales up in workers, but later starts getting load of logs with the following Processing lull…
Brodin
  • 197
  • 1
  • 1
  • 14
2
votes
1 answer

How to set up labels in google dataflow jobs using scio?

I want to set up labels for google dataflow jobs for cost allocation purpose. Here is an example of working Java Code: private DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).as(DataflowPipelineOptionsImpl.class);…
Pradeep
  • 6,303
  • 9
  • 36
  • 60
2
votes
1 answer

Scio / apache beam java.lang.IllegalArgumentException: unable to serialize method

I am trying to use dataflow to move some data from pub sub to cloud storage. I need to provide a timestamp to scio / beam so it can group the data into windows. I have a simple case class that models my event, It looks like this (some fields…
Luke De Feo
  • 2,025
  • 3
  • 22
  • 40
2
votes
2 answers

Scio: groupByKey doesn't work when using Pub/Sub as collection source

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to…
Kakaji
  • 1,421
  • 2
  • 15
  • 23
1
vote
1 answer

How to flatten SCollection[SCollection[SomeType]] into SCollection[SomeType]

I'm using Beam (and Scio, though feel free to answer this question for PCollections too) to read from multiple tables in BigQuery. Because I'm reading multiple datasets from a dynamically generated list (it is itself an SCollection[String], where…
pavelmk
  • 53
  • 4
1
vote
2 answers

Apache Beam Stateful DoFn Periodically Output All K/V Pairs

I'm trying to aggregate (per key) a streaming data source in Apache Beam (via Scio) using a stateful DoFn (using @ProcessElement with @StateId ValueState elements). I thought this would be most appropriate for the problem I'm trying to solve. The…
iralls
  • 376
  • 4
  • 16
1
vote
1 answer

Debugging SCollection contents when running tests

Is there any way to view the contents of an SCollection when running a unit test (PipelineSpec)? When running something in production on many machines there would be no way to see the entire collection in one machine, but I wonder is there a way to…
lf215
  • 1,185
  • 7
  • 41
  • 83
1
vote
2 answers

Parameterized tests SCIO (JobTest) and Scala test (forAll)

I want to do parameterized tests with SCIO JobTest and Scala Test. I use TableDrivenPropertyChecks that allows, via a a forAll to do parameterized tests. import org.scalatest.prop.TableDrivenPropertyChecks.{forAll => forAllParams, _} val jobArgs =…
1
vote
1 answer

Apache beam wildcard recursive search for files

I am using Spotify's Scio library for writing apache beam pipelines in scala. I want to search for files under a directory in a recursive way on a filesystem which can be hdfs, alluxio or GCS. Like *.jar should find all the files under the provided…
Zahid Adeel
  • 288
  • 1
  • 9
1
vote
2 answers

How to run a Scio pipeline on Dataflow from SBT (local)

I am trying to run my first Scio pipeline on Dataflow . The code in question can be found here. However I do not think that is too important. My first experiment was to read some local CSV files and write another local CSV file, using the…
1
vote
2 answers

Strange Google Dataflow job log entries

Recently my jobs logs in a job details view are full of entries such as: "Worker configuration: [machine-type] in [zone]." Jobs themselves seem to work fine, but these entries didn't show up before and I am worried I won't be able to spot…
przemod
  • 459
  • 3
  • 10
1
vote
1 answer

How can I extract date from a .txt file which name contains date? (Scala)

I have a .txt file as input for my beam programming project, using scala spotify scio. input= args.getOrElse("input", "/home/user/Downloads/trade-20181001.txt") How can I extract the date 2018-10-01 (1st October) from the file name? Thank you!
atjw94
  • 529
  • 1
  • 6
  • 22