0

I have a data frame with the letters of the English alphabet and their frequency. Now it would be nice to also know the frequency of the vowels and the consonants and the total number of occurrences - and since I want to plot all of this information, I need it to be in one data frame.

So I often find myself in a situation like this:

df <- data.frame(letter = letters, freq = sample(1:100, length(letters)))

df_vowels <- data.frame(letter = "vowels", freq = sum(df[df$letter %in% c("a", "e", "i", "o", "u"), ]$freq))
df_consonants <- data.frame(letter = "consonants", freq = sum(df[!df$letter %in% c("a", "e", "i", "o", "u"), ]$freq))
df_totals <- data.frame(letter = "totals", freq = sum(df$freq))

df <- rbind(df, df_vowels, df_consonants, df_totals)

Am I doing this the right way or is there a more elegant solution for this?

Looks like my description was terribly confusing:

Basically, I want to add new categories (rows) to the data frame. In this very simple example, it's simply summarized data.

(For time series plots I'm using the aggregate function.)

enter image description here

joran
  • 169,992
  • 32
  • 429
  • 468
not_a_number
  • 305
  • 1
  • 6
  • 18
  • 1
    In your example dataset, there is only a single element for each letter. So, why do you need aggregate – akrun Jul 04 '15 at 18:22
  • Sorry, I made a mistake. I will fix this immediately. – not_a_number Jul 04 '15 at 18:28
  • Your example still has only one row for each letter. You can check the example in my post. – akrun Jul 04 '15 at 18:43
  • The expected result you showed may not be necessary for the plot. If you have provided more info about the plot, it may be helpful. – akrun Jul 04 '15 at 18:57
  • Please see my update. – not_a_number Jul 04 '15 at 19:22
  • So, you need to group by 'letters' to get the `sum`. It is a totally different question now. – akrun Jul 04 '15 at 19:25
  • Kludging extra rows onto a df just to get things nice for ggplot tends to be a bad idea. See my updated example for how to dynamically use dplyr to append/rbind the two summary rows then ggplot plot the resulting df directly, without overwriting the df. – smci Jul 04 '15 at 19:31
  • I seriously didn't mean to waste your time. I'm sorry I led you into a wrong direction. – not_a_number Jul 04 '15 at 19:33
  • It's okay. I was trying in a different direction as the `aggregate` step was confusing – akrun Jul 04 '15 at 19:34
  • Third update of my answer. Works. Quite elegant if I say so myself. – smci Jul 04 '15 at 20:28

2 Answers2

2

EDIT: here's a pretty elegant answer to the third version of your question:

df <- data.frame(letter = letters, freq = sample(1:100, length(letters)),
                 stringsAsFactors=F)

df = df %>% group_by(letter) %>% summarize(freq = sum(freq))

df.tots = df %>% group_by(is_vowel = letter %in% c('a','e','i','o','u')) %>%
                 summarize(freq=sum(freq))

# Now we just rbind your three summary rows onto the df, then pipe it into your ggplot  
df %>%
  rbind(c('vowels',     df.tots[df.tots$is_vowel==T,]$freq)) %>%
  rbind(c('consonants', df.tots[df.tots$is_vowel==F,]$freq)) %>%
  rbind(c('total',      sum(df.tots$freq)))                  %>%
  ggplot( ... your_ggplot_command_goes_here ...)

  #qplot(data=..., x=letter, y=freq, stat='identity', geom='histogram')
  # To keep your x-axis in order, i.e. our summary rows at bottom,
  # you have to explicitly set order of factor levels:
  # df$letter = factor(df$letter, levels=df$letter)

Voila!

Notes:

  1. We needed data.frame(... stringsAsFactors=F) so we could later append the rows 'vowels', 'consonants', 'total' because those wouldn't occur in the factor levels of 'letters'
  2. Note that dplyr group_by(is_vowel = ...) allows us to simultaneously insert a new column (mutate), then split on that expression (group_by), all in one compact line. Neat. Never knew could do that.
  3. You should be able to get bind_rows working at the end, I couldn't.

EDIT: second version. You were saying you want to do an aggregation so we take it each letter has >1 record in df. You seem to be just splitting your df into vowels and consonants, then merging again, so I don't see that new colunms are necessary, other than is_vowel. One way is with dplyr:

require(dplyr)
#  I don't see why you don't just overwrite df here with df2, the df of totals...
df2 = df %>% group_by(letter) %>% summarize(freq = sum(freq))
   letter     freq
1       a      150
2       b       33
3       c       54
4       d      258
5       e      285
6       f      300
7       g      198
8       h       27
9       i       36
10      j      189
..    ...      ...

# Now add a logical column, so we can split on it when aggregating
# df or df2 ....
df$is_vowel = df$letter %in% c('a','e','i','o','u')

# Then your total vowels are:
df %>% filter(is_vowel==T) %>% summarize(freq = sum(freq))
     freq
      312
# ... and total consonants ...
df %>% filter(is_vowel==F) %>% summarize(freq = sum(freq))
     freq
     1011

here's another way, a one-liner if you want to avoid dplyr:

split(df, df$letter %in% c("a", "e", "i", "o", "u") )

By the way, you can form the list(/set) of consonants more easily by just subtracting vowels from all letters:

setdiff(letters, c("a", "e", "i", "o", "u"))
# "b" "c" "d" "f" "g" "h" "j" "k" "l" "m" "n" "p" "q" "r" "s" "t" "v" "w" "x" "y" "z"
smci
  • 32,567
  • 20
  • 113
  • 146
  • Anyway this gives you the idea, please update your example if you meant something different. – smci Jul 04 '15 at 18:48
  • I think I meant something different. Please see my update. – not_a_number Jul 04 '15 at 19:22
  • Third update. Not doing any more for you :) Dynamically creating an object then piping into ggplot with dplyr's `%>%` is pretty elegant. You could wrap a function around this. – smci Jul 04 '15 at 20:25
  • 1
    I think you've earned the right to be sarcastic. :-) Seriously, thanks a lot!! I didn't know about piping in R. – not_a_number Jul 04 '15 at 20:55
  • Sadly, the piping is messed up with the crappy necessity to set order on the row levels `df$letter = factor(df$letter, levels=df$letter)` – smci Jul 04 '15 at 21:02
  • Oops, if you look closely, the y-axis values on that 'histogram' are not in numerical order, they're factor order... sigh... you can take it from there. – smci Jul 04 '15 at 21:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/82395/discussion-between-not-a-number-and-smci). – not_a_number Jul 05 '15 at 08:25
2

You can try

 v2 <- with(df, tapply(freq, c('consonants', 'vowels')[letter %in% 
              v1+1L], FUN=sum))

 df1 <- rbind(df, data.frame(letter=c(names(v2),"Total"), 
            freq=c(v2, sum(v2)), stringsAsFactors=FALSE))
 library(ggplot2)
 ggplot(df1, aes(x=letter, y=freq)) +
                  geom_bar(stat='identity')

data

set.seed(24)
df <- data.frame(letter= sample(letters,200, replace=TRUE),
 freq = sample(1:100, 200, replace=TRUE), stringsAsFactors=FALSE)
v1 <- c("a", "e", "i", "o", "u")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • FYI `dcast` requires 1.9.5 which is the devel version; only 1.9.4 is available on CRAN currently – smci Jul 04 '15 at 18:40
  • @smci Instructions to install the devel version are [here](https://github.com/Rdatatable/data.table/wiki/Installation) – akrun Jul 04 '15 at 18:41
  • Yes, just letting people know otherwise they'll see `Error: object 'dcast' not found` – smci Jul 04 '15 at 18:49
  • @smci Thanks for the comment. Yes, I added that info in the post – akrun Jul 04 '15 at 18:49