Finding number of rows in a dataframe when previewing a large dataset (Foundry Platform)

Question

Is there a way in a Foundry Code Repository to be able to print the shape of DataFrame, like in pandas how one can do df.shape()? I am interested in getting the correct number of rows in the dataset.

I am using this function to print the shape but this function only prints the top 10000 rows as it is a limit of Foundry preview. How can I find out the total number of rows in the preview?

def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

domdomegg · Accepted Answer · 2022-07-06T11:57:34.240

The way Foundry preview works is that it samples all the input datasets. This is generally the first 10,000 rows for datasets above this size. It then runs the transform on these preview inputs.

Therefore, when running in preview mode the 'shape' of the dataset is correctly reported as having 10,000 rows as that's what the transform is running on.

If you want the transform to run on all the data, you'll need to run it as a full build.

If there's a reason you want to run the transform on the preview samples (10,000 rows) but know the size of the full dataset, you could add an upstream full build which calculates the shape of the dataset, then you can read this into preview, e.g. a dataset that looks like:

rows	columns
123456	10

You could build this dataset with a transform like:

from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output


@transform_df(
    Output("/path/to/your/dataset_shape"),
    source_df=Input("/path/to/your/dataset"),
)
def compute(source_df):
    return (
        source_df
        .agg(F.count("*").alias("rows"))
        .withColumn("columns", F.lit(len(source_df.columns)))
    )

Alternatively, you could use the Foundry Stats API although this is likely to be significantly more complicated.

Finding number of rows in a dataframe when previewing a large dataset (Foundry Platform)

1 Answers1