PySpark DataFrame When to use/ not to use Select

Question

Based on PySpark document:

A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SQLContext

Meaning I can use Select for showing the value of a column, however, I saw sometimes these two equivalent codes are used instead:

# df is a sample DataFrame with column a
df.a
# or
df['a']

And sometimes when I use select I might get an error instead of them and vice versa sometimes I have to use Select.

For example, this is a DataFrame for finding a dog in a given picture problem:

joined_df.printSchema()
root
 |-- folder: string (nullable = true)
 |-- filename: string (nullable = true)
 |-- width: string (nullable = true)
 |-- height: string (nullable = true)
 |-- dog_list: array (nullable = true)
 |    |-- element: string (containsNull = true)

If I want to select the dog details and show 10 rows, this code shows an error:

print(joined_df.dog_list.show(truncate=False))

Traceback (most recent call last):
 File "<stdin>", line 2, in <module>
    print(joined_df.dog_list.show(truncate=False))
TypeError: 'Column' object is not callable

And this is not:

print(joined_df.select('dog_list').show(truncate=False))

Question1: When I have to use Select and when I have to use df.a or df["a"]

Question2: what is the meaning of the error above? 'Column' object is not callable

score 2 · Accepted Answer · answered Feb 16 '20 at 15:39

2

df.col_name return a Column object but df.select("col_name") return another dataframe

see this for documentation

The key here is Those two methods are returning two different objects, that is why your print(joined_df.dog_list.show(truncate=False)) give you the error. Meaning that the Column object does not have this .show method but the dataframe does.

So when you call a function, function takes Column as input, you should use df.col_name, if you want to operate at dataframe level, you want to use df.select("col_name")

answered Feb 16 '20 at 15:39

E.ZY.

675
5
12

Thank you, then why it shows this error 'Column' object is not callable.? This error is very confusing and we see it in many different situations. – G. Hak. Feb 16 '20 at 15:57
mostly, it is because the Column object does not have such a method, or the Column object cant not be used in a function call. I agree it is not very intuitive [Here](https://stackoverflow.com/questions/21324940/python-what-typeerror-xxx-object-is-not-callable-means) is a general one on python object – E.ZY. Feb 17 '20 at 15:33

PySpark DataFrame When to use/ not to use Select

1 Answers1