0

I need to retrieve the first element of each dataframe partition. I know that I need to use mapPartitions but it is not clear for me how to use it.

Note: I am using Spark2.0, the dataframe is sorted.

syl
  • 419
  • 2
  • 5
  • 17

1 Answers1

1

I believe it should look something like following:

import org.apache.spark.sql.catalyst.encoders.RowEncoder
...
implicit val encoder = RowEncoder(df.schema)
val newDf = df.mapPartitions(iterator => iterator.take(1))

This will take 1 element from each partition in DataFrame. Then you can collect all the data to your driver i.e.:

nedDf.collect()

This will return you an array with a number of elements equal to number of your partitions.

UPD updated in order to support Spark 2.0

Zyoma
  • 1,528
  • 10
  • 17
  • I'm looking at method signature here http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html and wonder don't you need Encoder as the second parameter in this method call? – MaxNevermind Sep 28 '16 at 10:10
  • Trying this solution returns " Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._" even using spar.implicits, the errors persists – syl Sep 28 '16 at 10:14
  • Replacing mapPartitions with foreachPartition works but it is returning an empty list () – syl Sep 28 '16 at 10:20
  • Do you mean `collect` ? It returns Array of Rows. The `mapPartitions` operation without `collect` will return a new DataFrame. – Zyoma Sep 28 '16 at 11:19