0

I have data pipeline operator used to collect metrics of the data. The data product for which I am collecting metrics is called foo

I have the following

`foo.select(foo.id).count()` => 2M+
`foo.filter(foo.id.startswith("foobar")).count() => 1M

I do a bunch of other operations(count and collect) The count() take a very long time :( (around 30minutes)

How do folks usually solve problems of this nature? Also, I do not care about the exact count. I need an approx(+-50,000)

I have also tried countApprox but there is no change in the amount of time taken

Config

Number of cores = 150
driver-memory = 15g
executory-memory = 15g
suprita shankar
  • 1,554
  • 2
  • 16
  • 47
  • @user6910411 - this is not a duplicate of it. am I missing something? – suprita shankar Jul 31 '18 at 00:42
  • If you want to count it multiple times and you don't want to wait. You can do `foo.cache()`. This is only when you are testing. Do NOT do this on real data. Because, all the data is loaded into driver. – Sailesh Kotha Jul 31 '18 at 01:45
  • It's essential that you should provide these following info : + your Foo class's definition + where did you read/construct the dataframe: 'foo' from? – tauitdnmd Jul 31 '18 at 05:01
  • @supritashankar You want to improve performance, right? Then, instead of using `.filter > .count` apply a single aggregation processing all conditions (expressed as `CASE ... WHEN` expressions) at once. – zero323 Jul 31 '18 at 15:13

0 Answers0