I am trying to write generic records to a GCS bucket as avro, On writing I can observe two things
- The File type in GCS is application/octet-stream instead of avro
- Some of the data are missing i.e are null, especially the nested fields
Here is the code sample, (this is in a DoFn)
String DatetimeSuffix = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date());
String fileSuffix = String.join("_","realtime_data", DatetimeSuffix);
dirGCSBucket = String.join("/",bucketURI,"year="+year, "month="+month, "day="+day, "hour="+hour);
Storage storage = StorageOptions.newBuilder().setProjectId(projectID).build().getService();
BlobId blobId = BlobId.of(gcsBucket, dirGCSBucket + "/" + fileSuffix + ".avro");
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).build();
Blob blob = storage.create(blobInfo);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(new GenericDatumWriter<>(genericRecord.getSchema()));
try {
dataFileWriter.create(genericRecord.getSchema(), Channels.newOutputStream(blob.writer()));
dataFileWriter.append(genericRecord);
dataFileWriter.close();
} catch (IOException e) {
LOG.warn(String.format("Failed writing for: %s in file: %s", key.toString(), fileName));
LOG.error("input Error", e)
}`
Does someone know why I am having these issues and/or is there a better way to do this?
The major expectation for me is to make sure I don't have data loss, when I write to a GCS bucket