My answer provides a working code snippet that illustrates the problem of having dots in column names and explains how you can easily remove dots from column names.
Let's create a DataFrame with some sample data:
schema = StructType([
StructField("person.name", StringType(), True),
StructField("person", StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)]))
])
data = [
("charles", Row("chuck", 42)),
("larry", Row("chipper", 48))
]
df = spark.createDataFrame(data, schema)
df.show()
+-----------+-------------+
|person.name| person|
+-----------+-------------+
| charles| [chuck, 42]|
| larry|[chipper, 48]|
+-----------+-------------+
Let's illustrate that selecting person.name
will return different results depending on if backticks are used or not.
cols = ["person.name", "person", "person.name", "`person.name`"]
df.select(cols).show()
+-----+-----------+-----+-----------+
| name| person| name|person.name|
+-----+-----------+-----+-----------+
|chuck|[chuck, 42]|chuck| charles|
|larry|[larry, 73]|larry| lawrence|
+-----+-----------+-----+-----------+
You definitely don't want to write or maintain code that changes results based on the presence of backticks. It's always better to replace all the dots with underscores when starting the analysis.
clean_df = df.toDF(*(c.replace('.', '_') for c in df.columns))
clean_df.select("person_name", "person.name", "person.age").show()
+-----------+-----+---+
|person_name| name|age|
+-----------+-----+---+
| charles|chuck| 42|
| lawrence|larry| 73|
+-----------+-----+---+
This post explains how and why to avoid dots in PySpark columns names in more detail.