How can I extract a concrete subsample of a dataframe and save in another dataframe in pyspark?

Question

I have a dataframe called 'df1' which has X rows, suppose 1000. What I want to do is to get a concrete subsample of that dataframe and save as another. For example, I want to extract the rows 400 to 700 from 'df1' and save it as 'df2'.

I know that one possible way is getting the content of 'df1' as a vector with:

list = df1.collect()
subsample = list[400:700]
df2 = sc.createDataFrame(subsample, attributes)

But my question is: is there any other way of getting the same result not loading the data in a list? I ask this because when you have a huge dataset maybe it will not be efficient loading data with collect and generating another dataframe.

Thanks.

No, but if you have an attribute 'IDENT' which is incremental, equivalent to a representation of a row number? — jartymcfly, Oct 20 '17 at 07:14

MaFF · Answer 1 · 2017-10-20T07:42:06.283

Pyspark dataframes don't have indexes. You can create one, but note that any shuffle operation (group, join...) that took place before the creation of the index might have changed the order of your rows.

import pyspark.sql.functions as psf
start = 400
end = 700
df2 = df1.rdd.zipWithIndex()\
    .map(lambda l: [l[1]] + list(l[0]))\
    .toDF(["index"] + df1.columns)\
    .filter(psf.col("index").between(start, end))

Another way is to collect only the first rows of your dataframe into a list:

df2 = spark.createDataFrame(df1.head(end)[start:], df1.columns)

or using Pandas:

df2 = spark.createDataFrame(df1.limit(end).toPandas().iloc[start:, :])

How can I extract a concrete subsample of a dataframe and save in another dataframe in pyspark?

1 Answers1