I have data pipeline operator used to collect metrics of the data.
The data product for which I am collecting metrics is called foo
I have the following
`foo.select(foo.id).count()` => 2M+
`foo.filter(foo.id.startswith("foobar")).count() => 1M
I do a bunch of other operations(count and collect)
The count()
take a very long time :( (around 30minutes)
How do folks usually solve problems of this nature?
Also, I do not care about the exact count
. I need an approx(+-50,000)
I have also tried countApprox
but there is no change in the amount of time taken
Config
Number of cores = 150
driver-memory = 15g
executory-memory = 15g