21

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format.

However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.

How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.

[1] Spark's CatalystSchemaConverter

def checkFieldName(name: String): Unit = {
  // ,;{}()\n\t= and space are special characters in Parquet schema
  checkConversionRequirement(
    !name.matches(".*[ ,;{}()\n\t=].*"),
    s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
             |Please use alias to rename it.
           """.stripMargin.split("\n").mkString(" ").trim
  )
}
blackbishop
  • 30,945
  • 11
  • 55
  • 76
codehammer
  • 876
  • 2
  • 10
  • 27

7 Answers7

14

For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:

file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", ""))

df = spark.read.schema(df.schema).parquet(file)

Jan C. Schäfer
  • 173
  • 1
  • 5
  • Holy shit! It works! I think this should be a bug that renaming doesn't work at all. – Sraw Jul 05 '19 at 03:39
  • 1
    this does not work in scala: error: value columns is not a member of Array[org.apache.spark.sql.Row] – Dima Lituiev Oct 22 '19 at 23:14
  • 5
    This works for me, but the contents of the column are null. .... – NigelLegg Dec 03 '20 at 17:22
  • this worked on col names, but the content is all nan now!! – Kai Feb 23 '21 at 19:23
  • 1
    As others have said, this doesn't work. The if you inspect `df.schema` you see it has no reference to the original column names, so when reading it fails to find the columns, and hence all values are null. The only solution I have found so far is to read with pandas, rename the columns, and then write it back. Koalas may be an option for those whose data is too large for pandas. – s_pike Jul 22 '21 at 10:13
14

You can use a regex to replace all invalid characters with an underscore before you write into parquet. Additionally, strip accents from the column names too.

Here's a function normalize that do this for both Scala and Python :

Scala

/**
  * Normalize column name by replacing invalid characters with underscore
  * and strips accents
  *
  * @param columns dataframe column names list
  * @return the list of normalized column names
  */
def normalize(columns: Seq[String]): Seq[String] = {
  columns.map { c =>
    org.apache.commons.lang3.StringUtils.stripAccents(c.replaceAll("[ ,;{}()\n\t=]+", "_"))
  }
}

// using the function
val df2 = df.toDF(normalize(df.columns):_*)

Python

import unicodedata
import re

def normalize(column: str) -> str:
    """
    Normalize column name by replacing invalid characters with underscore
    strips accents and make lowercase
    :param column: column name
    :return: normalized column name
    """
    n = re.sub(r"[ ,;{}()\n\t=]+", '_', column.lower())
    return unicodedata.normalize('NFKD', n).encode('ASCII', 'ignore').decode()


# using the function
df = df.toDF(*map(normalize, df.columns))

blackbishop
  • 30,945
  • 11
  • 55
  • 76
3

This is my solution using Regex in order to rename all the dataframe's columns following the parquet convention:

df.columns.foldLeft(df){
  case (currentDf,  oldColumnName) => currentDf.withColumnRenamed(oldColumnName, oldColumnName.replaceAll("[ ,;{}()\n\t=]", ""))
}

I hope it helps,

Shaido
  • 27,497
  • 23
  • 70
  • 73
mahmoud mehdi
  • 1,493
  • 1
  • 19
  • 28
2

I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.

Sorry but I have only the pyspark code ready:

from pyspark.sql import functions as F

df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)
Martin
  • 818
  • 9
  • 20
1

Using alias to change your field names without those special characters.

Kent Yao
  • 36
  • 3
0

I have encounter this error "Error in SQL statement: AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping Or you can use alias to rename it."

The issue was because I used MAX(COLUM_NAME) when creating a table based on a parquet / Delta table, and the new name of the new table was "MAX(COLUM_NAME)" because forgot to use Aliases and parquet files doesn't support brackets '()'

Solved by using aliases (removing the brackets)

0

It was fixed in Spark 3.3.0 release at least for the parquet files (I tested), it might work with JSON as well.

Andrey
  • 59,039
  • 12
  • 119
  • 163