1

I have a dataframe called 'df1' which has X rows, suppose 1000. What I want to do is to get a concrete subsample of that dataframe and save as another. For example, I want to extract the rows 400 to 700 from 'df1' and save it as 'df2'.

I know that one possible way is getting the content of 'df1' as a vector with:

list = df1.collect()
subsample = list[400:700]
df2 = sc.createDataFrame(subsample, attributes)

But my question is: is there any other way of getting the same result not loading the data in a list? I ask this because when you have a huge dataset maybe it will not be efficient loading data with collect and generating another dataframe.

Thanks.

jartymcfly
  • 1,945
  • 9
  • 30
  • 51

1 Answers1

0

Pyspark dataframes don't have indexes. You can create one, but note that any shuffle operation (group, join...) that took place before the creation of the index might have changed the order of your rows.

import pyspark.sql.functions as psf
start = 400
end = 700
df2 = df1.rdd.zipWithIndex()\
    .map(lambda l: [l[1]] + list(l[0]))\
    .toDF(["index"] + df1.columns)\
    .filter(psf.col("index").between(start, end))

Another way is to collect only the first rows of your dataframe into a list:

df2 = spark.createDataFrame(df1.head(end)[start:], df1.columns)

or using Pandas:

df2 = spark.createDataFrame(df1.limit(end).toPandas().iloc[start:, :])
MaFF
  • 9,551
  • 2
  • 32
  • 41