16

I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected.

Here is an example:

df = sqlContext.createDataFrame([Row(array=[1,2,3])])
df['array'].collect()

This produces the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable

How can I use the collect() function on a single column?

zero323
  • 322,348
  • 103
  • 959
  • 935
Michal
  • 1,863
  • 7
  • 30
  • 50

1 Answers1

22

Spark >= 2.0

Starting from Spark 2.0.0 you need to explicitly specify .rdd in order to use flatMap

df.select("array").rdd.flatMap(lambda x: x).collect()

Spark < 2.0

Just select and flatMap:

df.select("array").flatMap(lambda x: x).collect()
## [[1, 2, 3]] 
10465355
  • 4,481
  • 2
  • 20
  • 44
zero323
  • 322,348
  • 103
  • 959
  • 935
  • so using select instead of subsetting essentially turns this in to a one column dataframe instead of a Column – Michal Feb 19 '16 at 01:16
  • Thats right. `Column` is just a SQL DSL expression not a standalone data structure. – zero323 Feb 19 '16 at 13:41
  • 1
    What is the equivalent in spark 2.0? I can't see flatMap as a method on DataFrame – ThatDataGuy Nov 21 '16 at 16:40
  • 1
    @ThatDataGuy you need to explicitly pass `.rdd` now. Once it was wrapped in. e.g. `df.select("array").rdd.flatMap(lambda x: x).collect()` – David Arenburg Jan 23 '17 at 07:24
  • Converting a dataframe to rdd creates an overhead. Try avoiding it with something like `data = list(map(lambda x: x[0], df.select("array").collect()))` Flatten the list using normal python code – vinayb21 Oct 21 '20 at 14:09