I am trying to use spark 2.0.2 to convert a JSON file into parquet.
- The JSON file comes from an external source and therefor the schema can't be changed before it arrives.
- The file contains a map of attributes. The attribute names arn't known before I receive the file.
- The attribute names contain characters that can't be used in parquet.
{
"id" : 1,
"name" : "test",
"attributes" : {
"name=attribute" : 10,
"name=attribute with space" : 100,
"name=something else" : 10
}
}
Both the space and equals character can't be used in parquet, I get the following error:
org.apache.spark.sql.AnalysisException: Attribute name "name=attribute" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
- As these are nested fields I can't rename them using an alias, is this true?
- I have tried renaming the fields within the schema as suggested here: How to rename fields in an DataFrame corresponding to nested JSON. This works for some files, However, I now get the following stackoverflow:
java.lang.StackOverflowError at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:65) at org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:258) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1563) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1576) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1579) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1578) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1578) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1576) at scala.collection.immutable.List.foreach(List.scala:381) ... repeat ...
I want to do one of the following:
- Strip invalid characters from the field names as I load the data into spark
- Change the column names in the schema without causing stack overflows
- Somehow change the schema to load the original data but use the following internally:
{
"id" : 1,
"name" : "test",
"attributes" : [
{"key":"name=attribute", "value" : 10},
{"key":"name=attribute with space", "value" : 100},
{"key":"name=something else", "value" : 10}
]
}