2

I am trying to generate a stacked area graph in r using below command:

ggplot(p_ash_r_100,aes(x=SMPL_TIME,y=SMPL_CNT,col=EVENT,group=1))+ geom_area()

Here EVENT is the 3rd variable which I want to chart out based on time and sample counts in the ORACLE DB.

But the graph with above commands is returning empty.

My questions are:

  1. How to fix the empty graph problem.

  2. How to filter only the 10 top variables based on the amount of data when displaying or earlier? We can do it easily in excel as I show here in the image file.

my dataset looks like this:

> p_ash_r_100
          SMPL_TIME        SQL_ID                        MODULE                        EVENT SMPL_CNT
1   11-APR-17 09:00 03d5x9busf1d8                      SQL*Plus                          CPU        1
2   11-APR-17 09:00 2pb7bzzadj0pn OGG-RCASI004-OPEN_DATA_SOURCE      db file sequential read        1
3   11-APR-17 09:00        NO_SQL                    GoldenGate                          CPU        1
4   11-APR-17 09:00        NO_SQL                    MMON_SLAVE                          CPU        1
5   11-APR-17 09:00        NO_SQL                        NO_SQL              Log archive I/O        1
6   11-APR-17 09:00        NO_SQL                       XStream                          CPU        1
7   11-APR-17 09:00 acuzxh557cq81                    GoldenGate      db file sequential read        1
8   11-APR-17 09:00 cqtby4bsrmxzh                    GoldenGate                          CPU        1
9   11-APR-17 09:00 dgzp3at57cagd                    GoldenGate      db file sequential read        2
10  11-APR-17 09:00 fjp9t92a5yx1v                    GoldenGate      db file sequential read        1
11  11-APR-17 09:00 guh1sva39p9db                    GoldenGate      db file sequential read        1
12  11-APR-17 09:01 0hz0dhgwk12cd                    GoldenGate            direct path write        1
13  11-APR-17 09:01 2jafq5d4n0akv                    GoldenGate                          CPU        1
14  11-APR-17 09:01 37cspa0acgqxp                    GoldenGate      db file sequential read        2
15  11-APR-17 09:01 79rugrngrvpt1 OGG-RADDR025-OPEN_DATA_SOURCE      db file sequential read        1
16  11-APR-17 09:01 7k6zp92kbv28m                    GoldenGate                          CPU        1
17  11-APR-17 09:01 7nvtkfc0bt8vv                    GoldenGate      db file sequential read        1
18  11-APR-17 09:01 7pvpzvd1g769d                    GoldenGate                          CPU        1
19  11-APR-17 09:01 9gduk46rmt5jy                    GoldenGate      db file sequential read        1
20  11-APR-17 09:01        NO_SQL                    GoldenGate                          CPU 

   7

Adding image of the dataset below for ease of understanding

Adding image of the dataset here for ease of understanding

The end graph which I want to get it something like this one from excel=>

The end graph which i want to get it something like this one from excel

Value filters in excel to get Top 10 events in excel =>

Value filters in excel to get Top 10 events  in excel

IRTFM
  • 258,963
  • 21
  • 364
  • 487
Sach
  • 161
  • 2
  • 10

1 Answers1

1

I'll start with the second question, which is easier. Using the dplyr package, you can use top_n to get the n largest rows for a given column. For example:

> top_n(p_ash_r_100a, 3, SMPL_CNT) %>% arrange(desc(SMPL_CNT))
# A tibble: 3 × 5
            SMPL_TIME        SQL_ID     MODULE                   EVENT SMPL_CNT
               <dttm>         <chr>      <chr>                   <chr>    <int>
1 2017-04-11 09:01:00        NO_SQL GoldenGate                     CPU        7
2 2017-04-11 09:00:00 dgzp3at57cagd GoldenGate db file sequential read        2
3 2017-04-11 09:01:00 37cspa0acgqxp GoldenGate db file sequential read        2

Note that you will get more than n rows if there are ties for nth place. Thus top_n(p_ash_r_100, 10, SMPL_CNT) will return the entire sample data set because of the 17-way tie for 4th.

As for the first question, the documentation for geom_area provides a clue:

An area plot is the continuous analog of a stacked bar chart (see geom_bar), and can be used to show how composition of the whole varies over the range of x.

This suggests that geom_area expects the column mapped to x should be numeric. Based on the listing for p_ash_r_100, SMPL_TIME appears to be a character vector. With the lubridate package, we can convert SMPL_TIME to a date-time with dmy_hm:

p_ash_r_100a <- p_ash_r_100 %>%
  mutate_at(vars(SMPL_TIME), dmy_hm)

However, this isn't enough to get the plot you want since there are multiple values of y for each combination of x and fill (which is the correct aesthetic for geom_area, not "col"). We need to summarise the data before plotting:

p_ash_r_100a %>%
  group_by(SMPL_TIME, EVENT) %>%
  summarise(total = sum(SMPL_CNT)) %>%
  ggplot(aes(SMPL_TIME, total, fill = EVENT)) +
  geom_area()

first stacked area plot

Yet the plot is still not correct. This is because every combination of SMPL_TIME and EVENT is not represented in the data set. We need to explicitly tell geom_area that y is equal to zero for those missing rows. One way is to use the handy fill argument in tidyr::spread.

group_by(p_ash_r_100a, SMPL_TIME, EVENT) %>%
  summarise(smpl_sum = sum(SMPL_CNT)) %>%
  spread(EVENT, smpl_sum, fill = 0) %>% 
  gather(EVENT, smpl_sum, CPU, `db file sequential read`, 
         `direct path write`,
         `Log archive I/O`) %>%
  ggplot(aes(x = SMPL_TIME, y = smpl_sum, fill = EVENT)) +
  geom_area()

enter image description here