0

If I want to output a SCollection of TableRow or String to google cloud storage (GCS) I'm using saveAsTableRowJsonFile or saveAsTextFile, respectively. Both of these methods ultimately use

private[scio] def pathWithShards(path: String) = path.replaceAll("\\/+$", "") + "/part" 

which enforces that file names start with "part". Is the only way to output a custom sharded file via to use saveAsCustomOutput?

Andrew Cassidy
  • 2,940
  • 1
  • 22
  • 46

2 Answers2

3

I had to do it in beam code via saveAsCustomOutput

import org.apache.beam.sdk.util.Transport
val jsonFactory: JsonFactory = Transport.getJsonFactory
val outputPath = "gs://foo/bar_" // file prefix will be bar_
@BigQueryType.toTable()
case class Clazz(foo: String, bar: String)
val collection: SCollection[Clazz] = ....
collection.map(Clazz.toTableRow).
          map(jsonFactory.toString).
          saveAsCustomOutput(name = "CustomWrite", io.TextIO.write()
            .to(outputPath)
            .withSuffix("")
            .withWritableByteChannelFactory(FileBasedSink.CompressionType.GZIP))
Andrew Cassidy
  • 2,940
  • 1
  • 22
  • 46
0

Scio's SCollection#saveAs* APIs are opinionated wrappers for common sinks that mimic behavior of other popular systems, and in this case, how Hadoop Map/Reduce prefix output files. So SCollection#saveAsCustomOutput is the right way to go if you want to access the lower level Beam API directly.

Neville Li
  • 420
  • 3
  • 10