First element of each dataframe partition Spark 2.0

Question

I need to retrieve the first element of each dataframe partition. I know that I need to use mapPartitions but it is not clear for me how to use it.

Note: I am using Spark2.0, the dataframe is sorted.

Zyoma · Accepted Answer · 2016-09-28T10:36:15.310

1

I believe it should look something like following:

import org.apache.spark.sql.catalyst.encoders.RowEncoder
...
implicit val encoder = RowEncoder(df.schema)
val newDf = df.mapPartitions(iterator => iterator.take(1))

This will take 1 element from each partition in DataFrame. Then you can collect all the data to your driver i.e.:

nedDf.collect()

This will return you an array with a number of elements equal to number of your partitions.

UPD updated in order to support Spark 2.0

edited Sep 28 '16 at 10:36

answered Sep 28 '16 at 09:52

Zyoma

1,528
10
17

I'm looking at method signature here http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html and wonder don't you need Encoder as the second parameter in this method call? – MaxNevermind Sep 28 '16 at 10:10
Trying this solution returns " Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._" even using spar.implicits, the errors persists – syl Sep 28 '16 at 10:14
Replacing mapPartitions with foreachPartition works but it is returning an empty list () – syl Sep 28 '16 at 10:20
Do you mean `collect` ? It returns Array of Rows. The `mapPartitions` operation without `collect` will return a new DataFrame. – Zyoma Sep 28 '16 at 11:19

First element of each dataframe partition Spark 2.0

1 Answers1