Using AWS EMR on the 5.2.1 version as data processing environment, when dealing with a huge JSON file that has a complex schema with many nested fields, Hive can't process it and errors as it reaches the current limit of 4000 characters column length.
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidObjectException(message:Invalid column type name is too long: [...]
Looking at the documentation, there's already many issues regarding this problem or similar, though all unresolved [1, 2]. In this it is recommended to alter several fields of the Metastore to a different value in order to allow a greater length for struct definitions.
COLUMNS_V2.TYPE_NAME
TABLE_PARAMS.PARAM_VALUE
SERDE_PARAMS.PARAM_VALUE
SD_PARAMS.PARAM_VALUE
As stated in the first issue, the proposed solution mentions:
[...] after setting the values, the Metastore must also be configured and restarted."
Yet it is nowhere specified what must be also configured beside the DB values.
Thus, after updating the fields the current local Metastore (mysql in this case) from string
to mediumtext
and restarting the Metastore process, still can not get any progress as attempting to load the JSON continues to fail with same error.
Am I missing something or has anyone found an alternative workaround to overcome this problem?