0

In R, I have a large list of large dataframes consisting of two columns, value and count. The function which I am using in the previous step returns the value of the observation in value, the corresponding column count shows how many times this specific value has been observed. The following code produces one dataframe as an example - however all dataframes in the list do have different values resp. value ranges:

d <- as.data.frame(
  cbind(
    value = runif(n = 1856, min = 921, max = 4187),
    count = runif(n = 1856, min = 0, max = 20000)
  )
)

Now I would like to aggregate the data to be able to create viewable visualizations. This aggregation should be applied to all dataframes in a list, which do each have different value ranges. I am looking for a function, cutting the data into new values and counts, a little bit like a histogram function. So for example, for all data from a value of 0 to 100, the counts should be summated (and so on, in a defined interval, with a clean interval border starting point like 0).

My first try was to create a simple value vector, where each value is repeated in a number of times that is determined by the count field. Then, the next step would have been applying the hist() function without plotting to obtain the aggregated values and counts which can be defined in the hist()'s arguments. However, this produces too large vectors (some Gb for each) that R cannot handle anymore. I appreciate any solutions or hints!

Arne
  • 167
  • 7

1 Answers1

0

I am not entirely sure I understand your question correctly, but this might solve your problem or at least point you in a direction. I make a list of data-frames and then generate a new column containing the result of applying the binfunction to each dataframe by using mapfrom the purrr package.

library(tidyverse)

d1 <- d2 <- tibble(
  value = runif(n = 1856, min = 921, max = 4187),
  count = runif(n = 1856, min = 0, max = 20000)
)

d <- tibble(name = c('d1', 'd2'), data = list(d1, d2))

binfunction <- function(data) {
  data %>% mutate(bin = value - (value %% 100)) %>% 
    group_by(bin) %>% 
    mutate(sum = sum(count)) %>% 
    select(bin, sum)
}

d_binned <- d %>% 
  mutate(binned = map(data, binfunction)) %>% 
  select(-data) %>% 
  unnest() %>% 
  group_by(name, bin) %>% 
  slice(1L)

d_binned
#> Source: local data frame [66 x 3]
#> Groups: name, bin [66]
#> 
#> # A tibble: 66 x 3
#>     name   bin      sum
#>    <chr> <dbl>    <dbl>
#>  1    d1   900 495123.8
#>  2    d1  1000 683108.6
#>  3    d1  1100 546524.4
#>  4    d1  1200 447077.5
#>  5    d1  1300 604759.2
#>  6    d1  1400 506225.4
#>  7    d1  1500 499666.5
#>  8    d1  1600 541305.9
#>  9    d1  1700 514080.9
#> 10    d1  1800 586892.9
#> # ... with 56 more rows

d_binned %>% 
  ggplot(aes(x = bin, y = sum, fill = name)) +
  geom_col() + 
  facet_wrap(~name)

See this comment for my inspiration for the binning. It bins the data in groups of 100, so e.g. bin 1100 represents 1100 to <1200 etc. I imagine you can adapt the binfunction to your needs.

emiltb
  • 391
  • 3
  • 13