I have a general question about keeping track of schemas in Datalake. In various logs, I have some fields which exist in every log. There are other fields which differ by log types. My team has a consensus to only add field, and not delete existing fields.
We first bring in all the logs into AWS S3 in JSON format, and then transform the logs into PARQUET, and here the schema becomes important. For the fields which exist in every log, we force the original data types, for example id or date. For the other fields which differ in log types, they are converted into JSON STRING and save as a single column.
In this case, is there any tools that can be used to find out the exact schema of the data? AWS GLUE doesn't seem to offer a way to catalog this kind of data. Or, in other case, please feel free to tell me an appropriate way of keeping track of schema evolution. Thanks much in advance!