How to spark count() faster for huge dataframes?

Asked Jul 30 '18 at 23:39

Active Jul 31 '18 at 01:13

Viewed 49 times

I have data pipeline operator used to collect metrics of the data. The data product for which I am collecting metrics is called foo

I have the following

`foo.select(foo.id).count()` => 2M+
`foo.filter(foo.id.startswith("foobar")).count() => 1M

I do a bunch of other operations(count and collect) The count() take a very long time :( (around 30minutes)

How do folks usually solve problems of this nature? Also, I do not care about the exact count. I need an approx(+-50,000)

I have also tried countApprox but there is no change in the amount of time taken

Config

Number of cores = 150
driver-memory = 15g
executory-memory = 15g

edited Jul 31 '18 at 01:13

asked Jul 30 '18 at 23:39

suprita shankar

@user6910411 - this is not a duplicate of it. am I missing something? – suprita shankar Jul 31 '18 at 00:42
If you want to count it multiple times and you don't want to wait. You can do `foo.cache()`. This is only when you are testing. Do NOT do this on real data. Because, all the data is loaded into driver. – Sailesh Kotha Jul 31 '18 at 01:45
It's essential that you should provide these following info : + your Foo class's definition + where did you read/construct the dataframe: 'foo' from? – tauitdnmd Jul 31 '18 at 05:01
@supritashankar You want to improve performance, right? Then, instead of using `.filter > .count` apply a single aggregation processing all conditions (expressed as `CASE ... WHEN` expressions) at once. – zero323 Jul 31 '18 at 15:13

0 Answers0