The way Foundry preview works is that it samples all the input datasets. This is generally the first 10,000 rows for datasets above this size. It then runs the transform on these preview inputs.
Therefore, when running in preview mode the 'shape' of the dataset is correctly reported as having 10,000 rows as that's what the transform is running on.
If you want the transform to run on all the data, you'll need to run it as a full build.
If there's a reason you want to run the transform on the preview samples (10,000 rows) but know the size of the full dataset, you could add an upstream full build which calculates the shape of the dataset, then you can read this into preview, e.g. a dataset that looks like:
You could build this dataset with a transform like:
from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output
@transform_df(
Output("/path/to/your/dataset_shape"),
source_df=Input("/path/to/your/dataset"),
)
def compute(source_df):
return (
source_df
.agg(F.count("*").alias("rows"))
.withColumn("columns", F.lit(len(source_df.columns)))
)
Alternatively, you could use the Foundry Stats API although this is likely to be significantly more complicated.