3

If anyone experimented with pandas-profiling package, help me with any insights you might have with making it run faster. The output report from the package is very neat and detailed, but creating the report takes way too long even with moderate sized dataset. About 10 columns and 400K rows from Kaggle bulldozers dataset took 21 min (non-gpu). Wondering if its worth investigating further.

df.shape
(401125, 9)


start = datetime.datetime.now()
profile = df.profile_report(title="Exploring Dataset")
profile.to_file(output_file=Path("./data_report.html"))

end = datetime.datetime.now()
print(end-start)

0:21:23.976324
RadV
  • 53
  • 1
  • 5

1 Answers1

5

Depending on what you are interested in, you can disable other functionalities of pandas-profiling that consume most time, because it is modular. This is currently your go-to solution in speeding up, together with sampling your dataset.

There are several related issues here:

In the long run, we plan on allow for better parallelization and more sensible defaults: https://github.com/pandas-profiling/pandas-profiling/issues/279

Edit:

Since v2.4 there is minimal mode, that configures the package to automatically use lower-computational settings: https://github.com/pandas-profiling/pandas-profiling#large-datasets

Simon
  • 5,464
  • 6
  • 49
  • 85