1

Apache Hudi writes out each parquet file like below:

0743209d-51cb-4233-a7cd-5bb712fba1ff-0_21-64-5300_20211117172738.parquet

I'm trying to understand what each section of the file represents. Here is my current understanding but I would like confirmation and clarification from anyone that might know.

0743209d-51cb-4233-a7cd-5bb712fba1ff = file group/file name

-0 = file chunk

20211117172738 = timestamp of the batch

I'm unsure what the below section represents:

21-64-5300=?
cauthon
  • 161
  • 1
  • 10

1 Answers1

2

Here's what I discovered:

hudi file format -- 0743209d-51cb-4233-a7cd-5bb712fba1ff-0_21-64-5300_20211117172738.parquet
first part is a unique identifier of the file group.
next is write token.
and then the commit time.
Write token is to assist with detecting spark write failures.

public static String makeDataFileName(String instantTime, String writeToken, String fileId, String fileExtension) {
    return String.format("%s_%s_%s%s", fileId, writeToken, instantTime, fileExtension);
  }
cauthon
  • 161
  • 1
  • 10