0

I am trying to write generic records to a GCS bucket as avro, On writing I can observe two things

  • The File type in GCS is application/octet-stream instead of avro
  • Some of the data are missing i.e are null, especially the nested fields

Here is the code sample, (this is in a DoFn)

String DatetimeSuffix = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date());

String fileSuffix = String.join("_","realtime_data", DatetimeSuffix);


dirGCSBucket = String.join("/",bucketURI,"year="+year, "month="+month, "day="+day, "hour="+hour);


Storage storage = StorageOptions.newBuilder().setProjectId(projectID).build().getService();
BlobId blobId = BlobId.of(gcsBucket, dirGCSBucket + "/" +  fileSuffix + ".avro");
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).build();

Blob blob = storage.create(blobInfo);

DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(new GenericDatumWriter<>(genericRecord.getSchema()));

try {
    dataFileWriter.create(genericRecord.getSchema(), Channels.newOutputStream(blob.writer()));
    dataFileWriter.append(genericRecord);
    dataFileWriter.close();
} catch (IOException e) {
    LOG.warn(String.format("Failed writing for: %s in file: %s", key.toString(), fileName));
    LOG.error("input Error", e)
}`

Does someone know why I am having these issues and/or is there a better way to do this?

The major expectation for me is to make sure I don't have data loss, when I write to a GCS bucket

jeks
  • 3
  • 2

0 Answers0