0

I have a big dataframe(20 Million rows, 35 columns) in koalas on a databricks notebook. I have performed some transform and join(merge) operations on it using python such as:

mdf.path_info =  mdf.path_info.transform(modify_path_info)
x = mdf[['providerid','domain_name']].groupby(['providerid']).apply(domain_features)
mdf = ks.merge( mdf, x[['domain_namex','domain_name_grouped']], left_index=True, right_index=True)
x = mdf.groupby(['providerid','uid']).apply(userspecificdetails)
mmdf = mdf.merge(x[['providerid','uid',"date_last_purch","lifetime_value","age"]], how="left", on=['providerid','uid'])

After these operations, I want to display some rows of the dataframe to verify the resultant dataframe. I am trying to print/display as little as 1-5 rows of this big dataframe but because of spark's nature of lazy evaluation, all the print commands starts 6-12 spark jobs and runs forever after which cluster goes to an unusable state and then nothing happens.

mdf.head() 

display(mdf)

mdf.take([1])

mdf.iloc[0]

also tried converting into a spark dataframe and then trying:

df = mdf.to_spark()

df.show(1)

df.rdd.takeSample(False, 1, seed=0)

df.first()

The cluster configuration I am using is 8worker_4core_8gb, meaning each worker and driver node is 8.0 GB Memory, 4 Cores, 0.5 DBU on the Databricks Runtime Version: 7.0 (includes Apache Spark 3.0.0, Scala 2.12)

Can someone please help by suggesting a faster rather fastest way to get/print one row of the big dataframe and which does not wait to process the whole 20Million rows of the dataframe.

Pallavi Jain
  • 21
  • 1
  • 3

2 Answers2

1

As you write because of lazy evaluation, Spark will perform your transformations first and then show the one line. What you can do is reduce the size of your input data, and do the transformations on a much smaller dataset e.g.:

https://spark.apache.org/docs/3.0.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample

df.sample(False, 0.1, seed=0)
Adam Dukkon
  • 273
  • 1
  • 6
0

You could cache the computation result after you convert to spark dataframe and then call the action.

df = mdf.to_spark()

# caches the result so the action called after this will use this cached
# result instead of re-computing the DAG
df.cache() 

df.show(1)

You may want to free up the memory used for caching with:

df.unpersist()
sujanay
  • 126
  • 3