2

I'm migrating a custom sink extending FileBasedSink from version 2.0.0 to 2.2.0. The class has changed and added two extra type parameters: UserT and DestinationT:

@Experimental(value=FILESYSTEM)
public abstract class FileBasedSink<UserT,DestinationT,OutputT>
extends java.lang.Object
implements java.io.Serializable, HasDisplayData

I've checked the doc of FileBasedSink but cannot find the purpose of it.

Of all type parameters only OutputT has a documentation:

* @param <OutputT> the type of values written to the sink.`
Paweł Szczur
  • 5,484
  • 3
  • 29
  • 32

1 Answers1

1

Note that this API is being redesigned and will be deprecated in the next version of Beam. However, meanwhile:

  • UserT is the type of PCollection elements to be written - the WriteFiles transform will be applicable to PCollection<UserT>.
  • OutputT is the low-level type of records that will be directly passed to your sink's Writer. It differs from UserT because some sinks have a "format function", e.g. Avro can convert any record to a GenericRecord. UserT is mapped to OutputT via DynamicDestinations.formatRecord.
  • DestinationT is a logical type for supporting writes to multiple destinations at the same time, e.g. writing events of different type to Avro files with different schemas in different directories. DestinationT functions as sort of a grouping key for records to be written, and records with the same DestinationT are written using the same configuration. See FileBasedSink.DynamicDestinations: getDestination extracts the destinations from a UserT record, and a bunch of other methods produce the configuration for a given destination, e.g. DynamicAvroDestinations.getSchema.

This API is not optimal - e.g. it introduces these high-level concepts (user type and destination) into code specific to file formats (e.g. writing to Avro files). That's why it is being redesigned. Stay tuned for the PR https://github.com/apache/beam/pull/3817 implementing the new API.

jkff
  • 17,623
  • 5
  • 53
  • 85
  • Thanks for explanation. From the API's user perspective it is helpful if it gets documented even if it's not optimal, because it's what users see. Would you recommend to keep the version 2.0.0 until the redesign is submitted? – Paweł Szczur Dec 05 '17 at 12:53