4

Using AWS EMR on the 5.2.1 version as data processing environment, when dealing with a huge JSON file that has a complex schema with many nested fields, Hive can't process it and errors as it reaches the current limit of 4000 characters column length.

Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidObjectException(message:Invalid column type name is too long: [...]

Looking at the documentation, there's already many issues regarding this problem or similar, though all unresolved [1, 2]. In this it is recommended to alter several fields of the Metastore to a different value in order to allow a greater length for struct definitions.

  • COLUMNS_V2.TYPE_NAME
  • TABLE_PARAMS.PARAM_VALUE
  • SERDE_PARAMS.PARAM_VALUE
  • SD_PARAMS.PARAM_VALUE

As stated in the first issue, the proposed solution mentions:

[...] after setting the values, the Metastore must also be configured and restarted."

Yet it is nowhere specified what must be also configured beside the DB values.

Thus, after updating the fields the current local Metastore (mysql in this case) from string to mediumtext and restarting the Metastore process, still can not get any progress as attempting to load the JSON continues to fail with same error.

Am I missing something or has anyone found an alternative workaround to overcome this problem?

blamblam
  • 423
  • 6
  • 20
  • did you find any solution to this ? having exactly the same issue – Gaurav Shah Feb 14 '17 at 10:32
  • 1
    I think it is related to https://issues.apache.org/jira/browse/HIVE-15249 – Gaurav Shah Feb 14 '17 at 10:36
  • @GauravShah, still no luck, already tried in a locally w/ Hadoop & Hive (_local mysql Metastore_). I'm gonna try with the _embedded_ Derby database instead. – blamblam Feb 14 '17 at 11:01
  • did you look at the issue , seems very relavant – Gaurav Shah Feb 14 '17 at 11:14
  • it was added in hive 2.1 so very likely all the comments from other issues have not covered it – Gaurav Shah Feb 14 '17 at 11:17
  • Yes! It seems that the issue you mention is indeed related. From what I understand, regardless of what you set in the Metastore columns, there is also a check for column type name set to 2000 characters that leads to the error we encounter. – blamblam Feb 14 '17 at 11:47
  • 1
    you can go back to hive 1.x, and update column types. works for me – Gaurav Shah Feb 14 '17 at 18:05
  • Just came accross this as well. Using it on aws emr-5.4.0 which has hive-2.1.1. The issue: https://issues.apache.org/jira/browse/HIVE-15249 is resolved and the fix is in hive-2.3.0 and 3.0.0. So this could work with emr-5.10.0, which has Hive-2.3.1 (can't confirm it yet, as i haven't tested it yet.) – huch Dec 12 '17 at 13:41
  • Just want to confirm: Yes, emr-5.10.0 can do it! Tested with an avro schema with a bit more than 28000 chars. – huch Dec 12 '17 at 13:57

1 Answers1

1

Looking at the MetaStore source code as the Hive issue HIVE-15249 relates, there is also an additional check that applies beside the parameter constraints set in the MetaStore tables.

Particularly, in the HiveMetaStoreUtils.java file, the following declaration seems to be cause of the execution error:

public static final int MAX_MS_TYPENAME_LENGTH = 2000; // 4000/2, for an unlikely unicode case

Thus, changing the values for the mentioned in the MetaStore is not enough as the code constraint will throw the exception.

blamblam
  • 423
  • 6
  • 20