Exporting nested fields with invalid characters from Spark 2 to Parquet

Question

I am trying to use spark 2.0.2 to convert a JSON file into parquet.

The JSON file comes from an external source and therefor the schema can't be changed before it arrives.
The file contains a map of attributes. The attribute names arn't known before I receive the file.
The attribute names contain characters that can't be used in parquet.

{
    "id" : 1,
    "name" : "test",
    "attributes" : {
        "name=attribute" : 10,
        "name=attribute with space" : 100,
        "name=something else" : 10
    }
}

Both the space and equals character can't be used in parquet, I get the following error:

 org.apache.spark.sql.AnalysisException: Attribute name "name=attribute" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

As these are nested fields I can't rename them using an alias, is this true?
I have tried renaming the fields within the schema as suggested here: How to rename fields in an DataFrame corresponding to nested JSON. This works for some files, However, I now get the following stackoverflow:

java.lang.StackOverflowError 

at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:65) 
at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:258) 
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1563) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) 
at scala.collection.immutable.List.foreach(List.scala:381) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576) 
at scala.collection.immutable.List.foreach(List.scala:381) 
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1576) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) 
at scala.collection.immutable.List.foreach(List.scala:381) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578) 
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576) 
at scala.collection.immutable.List.foreach(List.scala:381) 
...
repeat
...

I want to do one of the following:

Strip invalid characters from the field names as I load the data into spark
Change the column names in the schema without causing stack overflows
Somehow change the schema to load the original data but use the following internally:

{
    "id" : 1,
    "name" : "test",
    "attributes" : [
        {"key":"name=attribute", "value" : 10},
        {"key":"name=attribute with space", "value"  : 100},
        {"key":"name=something else", "value" : 10}
    ]
}

This should not be marked as duplicate. This question is about nested columns which is completely different from the other question. @eliasah what do you think? — moon, Jun 28 '19 at 21:34

Eugene Lopatkin · Answer 1 · 2018-06-06T12:41:31.017

2

I solved the problem this way:

df.toDF(df
    .schema
    .fieldNames
    .map(name => "[ ,;{}()\\n\\t=]+".r.replaceAllIn(name, "_")): _*)

where I replaced all incorrect symbols by "_".

edited Jun 06 '18 at 12:41

answered Jun 04 '18 at 06:48

Eugene Lopatkin

2,351
1
22
34

This only works with schema without nested fields which OP's question is about nested fields. – moon Jun 26 '19 at 19:00
You could redo any schema you want. You just need implement the scheme tree descent an apply regexp pattern to all names. How to implement tree descent is out of scope of this question – Eugene Lopatkin Jul 09 '19 at 08:50

score 0 · Answer 2 · answered Jan 16 '17 at 17:30

The only solution I have found to work,so far, is to reload the data with a modified schema. The new schema will load the attributes into a map.

Dataset<Row> newData = sql.read().json(path);
StructType newSchema = (StructType) toMapType(newData.schema(), null, "attributes");
newData = sql.read().schema(newSchema).json(path);

private DataType toMapType(DataType dataType, String fullColName, String col) {
    if (dataType instanceof StructType) {
        StructType structType = (StructType) dataType;

        List<StructField> renamed = Arrays.stream(structType.fields()).map(
            f -> toMapType(f, fullColName == null ? f.name() : fullColName + "." + f.name(), col)).collect(Collectors.toList());
        return new StructType(renamed.toArray(new StructField[renamed.size()]));
    }
    return dataType;
}

private StructField toMapType(StructField structField, String fullColName, String col) {
    if (fullColName.equals(col)) {
        return new StructField(col, new MapType(DataTypes.StringType, DataTypes.LongType, true), true, Metadata.empty());
    } else if (col.startsWith(fullColName)) {
        return new StructField(structField.name(), toMapType(structField.dataType(), fullColName, col), structField.nullable(), structField.metadata());
    }
    return structField;

}

score 0 · Answer 3 · answered Jan 19 '17 at 12:06

I have the same problem with @:.

In our case, we solved flattering the DataFrame.

  val ALIAS_RE: Regex = "[_.:@]+".r
  val FIRST_AT_RE: Regex = "^_".r

  def getFieldAlias(field_name: String): String = {
    FIRST_AT_RE.replaceAllIn(ALIAS_RE.replaceAllIn(field_name, "_"), "")
  }

  def selectFields(df: DataFrame, fields: List[String]): DataFrame = {
    var fields_to_select = List[Column]()
    for (field <- fields) {
      val alias = getFieldAlias(field)
      fields_to_select +:= col(field).alias(alias)
    }

    df.select(fields_to_select: _*)
  }

So the following json:

{ 
  object: 'blabla',
  schema: {
    @type: 'blabla',
    name@id: 'blabla'
  }
}

That will be transformed [object, schema.@type, schema.name@id]. @ and dots (in your case =) will create problems for SparkSQL.

So after our SelectFields you can end with [object, schema_type, schema_name_id]. Flattered DataFrame.

Exporting nested fields with invalid characters from Spark 2 to Parquet

3 Answers3

Linked