9

I have a DataFrame df which has this schema:

root
 |-- person.name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

When I do df.select("person.name") I obviously fetch the sub-field name from person. How could I select the column person.name?

zero323
  • 322,348
  • 103
  • 959
  • 935
belka
  • 1,480
  • 1
  • 18
  • 31

3 Answers3

21

For the column name that contains .(dot) you can use the ` character to enclose the column name

df.select("`person.name`") 

This selects the outer String person.name: string (nullable = true)

And df.select("person.name")

This gets the person name which is struct

 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)

If you have a column name you can just prepend and append ` character for the column name as

"`" + columnName + "`"

I hope this was helpful!

koiralo
  • 22,594
  • 6
  • 51
  • 72
  • How to do it when my column name is actually a string variable? – belka Feb 28 '18 at 15:10
  • Would you take a look at this? Thanks! https://stackoverflow.com/questions/48986102/dataframe-user-defined-function-not-applied-unless-i-change-column-name?noredirect=1#comment85050193_48986102 – belka Feb 28 '18 at 15:10
  • 1
    I am not sure the right wat but It works as "```" + columnName + "```" jus like this – koiralo Feb 28 '18 at 15:23
  • Yes it works thanks! try this: df.select(" ` " + col_name + " ` ").show() – belka Feb 28 '18 at 15:23
2

My answer provides a working code snippet that illustrates the problem of having dots in column names and explains how you can easily remove dots from column names.

Let's create a DataFrame with some sample data:

schema = StructType([
    StructField("person.name", StringType(), True),
    StructField("person", StructType([
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True)]))
])
data = [
    ("charles", Row("chuck", 42)),
    ("larry", Row("chipper", 48))
]
df = spark.createDataFrame(data, schema)
df.show()
+-----------+-------------+
|person.name|       person|
+-----------+-------------+
|    charles|  [chuck, 42]|
|      larry|[chipper, 48]|
+-----------+-------------+

Let's illustrate that selecting person.name will return different results depending on if backticks are used or not.

cols = ["person.name", "person", "person.name", "`person.name`"]
df.select(cols).show()
+-----+-----------+-----+-----------+
| name|     person| name|person.name|
+-----+-----------+-----+-----------+
|chuck|[chuck, 42]|chuck|    charles|
|larry|[larry, 73]|larry|   lawrence|
+-----+-----------+-----+-----------+

You definitely don't want to write or maintain code that changes results based on the presence of backticks. It's always better to replace all the dots with underscores when starting the analysis.

clean_df = df.toDF(*(c.replace('.', '_') for c in df.columns))
clean_df.select("person_name", "person.name", "person.age").show()
+-----------+-----+---+
|person_name| name|age|
+-----------+-----+---+
|    charles|chuck| 42|
|   lawrence|larry| 73|
+-----------+-----+---+

This post explains how and why to avoid dots in PySpark columns names in more detail.

Powers
  • 18,150
  • 10
  • 103
  • 108
1

To access the column name with a period using pyspark, do this:

spark.sql("select person.name from person_table")

Note: person_table is a registerTempTable on df.

itsols
  • 5,406
  • 7
  • 51
  • 95
Srini GL
  • 11
  • 2