1

I have a large data frame called data_frame with 3 columns PRE, STATUS, and CHR that look like this:

PRE         STATUS   CHR
1_752566    GAINED   1
1_776546    LOST     1
1_832918    NA       1
1_842013    LOST     1
1_846864    GAINED   1
11_8122943  NA       11
11_8188699  GAINED   11
11_8321128  NA       11
23_95137734 NA       23
23_95146814 GAINED   23

From here I'd like to group CHR by number and then find the sum of each group. If possible, I would like a new data table (let's call it TOTAL) showing the sums of each group number like this:

CHR      TOTAL_SUM
1        5
11       3
23       2

from here I would like to create another data table called BY_STATUS with 3 columns CHR, 'SUM _GAINED', 'SUM_LOST' where 'SUM_GAINED is the sum of CHR that matches with the 'STATUS' output 'GAINED' and 'SUM_LOST' is the sum of CHR that matches with the 'STATUS' output 'LOST' like this:

CHR      SUM _GAINED    SUM_LOST
1        2              2
11       1              0
23       1              0

I would then create two different plots: 1st plot would be for the data table TOTAL to visualize the sums of each number where my x-axis is NUM and my y-axis is SUM

2nd plot would be for the data table BY_STATUS to visualize the different frequencies of each number in CHR based on both SUM_GAINED and SUM_LOST where my x-axis is CHR and my y-axis is both SUM_GAINED and SUM_LOST. Maybe a side-by-side comparison of the two different y-axis?

1 Answers1

1

We can convert the column to logical and count (sum) the TRUE values for GAINED and LOST after grouping by 'CHR'

library(dplyr)
df %>%
   group_by(CHR) %>%
   summarise(SUM_GAINED = sum(STATUS == "GAINED", na.rm = TRUE),
       SUM_LOST = sum(STATUS == "LOST", na.rm =TRUE))

-output

# A tibble: 3 × 3
    CHR SUM_GAINED SUM_LOST
  <int>      <int>    <int>
1     1          2        2
2    11          1        0
3    23          1        0

Or use pivot_wider

library(tidyr)
df %>% 
 drop_na() %>% 
 pivot_wider(id_cols = CHR, names_from = STATUS, 
  values_from = STATUS, values_fn = length, values_fill = 0)
# A tibble: 3 × 3
    CHR GAINED  LOST
  <int>  <int> <int>
1     1      2     2
2    11      1     0
3    23      1     0

For plotting, it may be better to have it in long format with ggplot

library(ggplot2)
df %>%
  drop_na(STATUS) %>% 
  count(CHR, STATUS) %>%
  ggplot(aes(x = CHR, y = n, fill = STATUS)) + 
   geom_col(position="dodge")

With base R, this can be done using table and barplot

barplot(table(df[-1]), beside = TRUE, legend = TRUE)
akrun
  • 874,273
  • 37
  • 540
  • 662