0

I want to apply a math calculation which is (Occ_1+1)/(Totl_1+Unique_words) , (Occ_2+1)/(Totl_2+Unique_words) and (Occ_3+1)/(Totl_3+Unique_words) and create a new column as Probability_1, Probability_2, Probability_3

Right now i am doing every calculation seperately and combining them together.

Ex: for (Occ_1+1) i am doing sapply(df$Occ_1, function(x){x+1}).

i have almost 50 Occ_ and 50 Totl_ so my code is getting very lengthy if i do all the calculations seperately.
Is there a way to do all the calculations at once.

sample DF till Occ_3 and Totl_3 only

 word        Occ_1  Occ_2  Occ_3  Totl_1 Totl_2 Totl_3 Unique_words
  <chr>      <int>  <int>  <int>  <int>  <int>  <int>        <int>
 1 car          0     1     0     11      9      7           17
 2 saturn       2     0     2     11      9      7           17
 3 survival     1     2     0     11      9      7           17
 4 baseball     1     1     0     11      9      7           17
 5 color        0     0     1     11      9      7           17
 6 muscle       0     1     0     11      9      7           17
james joyce
  • 483
  • 7
  • 24

3 Answers3

3

I would just gather all the Occ.. , Tot.. columns together and perform the required arithmetic

occ_cols <- grep("^Occ", names(df))
tot_cols <- grep("^Totl", names(df))

df[paste0("Probability_", 1:length(occ_cols))] <- 
      (df[occ_cols] + 1)/(df[tot_cols] + df$Unique_words)

df
#      word Occ_1 Occ_2 Occ_3 Totl_1 Totl_2 Totl_3 Unique_words Probability_1
#1      car     0     1     0     11      9      7           17    0.03571429
#2   saturn     2     0     2     11      9      7           17    0.10714286
#3 survival     1     2     0     11      9      7           17    0.07142857
#4 baseball     1     1     0     11      9      7           17    0.07142857
#5    color     0     0     1     11      9      7           17    0.03571429
#6   muscle     0     1     0     11      9      7           17    0.03571429

#  Probability_2 Probability_3
#1    0.07692308    0.04166667
#2    0.03846154    0.12500000
#3    0.11538462    0.04166667
#4    0.07692308    0.04166667
#5    0.03846154    0.08333333
#6    0.07692308    0.04166667

However, make sure all your Occ.. and Tot.. columns are in the same order. For this example, we have Occ_1, Occ_2, Occ_3 followed by Totl_1, Totl_2 and Totl_3.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks sir,actually my column name is not this, what if i have `Occ_personal`,`Occ_company`, `Occ_location` and `Totl_personal`,`Totl_company`, `Totl_location`..then will this work.?......and yes they are in this order only like all `Occ_` are first and all `Totl_` are last – james joyce Jan 14 '19 at 11:36
  • @jamesjoyce yes, this will still work, since we want to group all `Occ..` and `Totl..` columns together (irrespective of suffix) even in this case. – Ronak Shah Jan 14 '19 at 11:40
2

I'm going to suggest a different approach from the other two answers. I think you're working in the wrong data format here, namely your data is wide when it should be long. If you are not familiar with those terms, there are lots of explanations online you should check out. The best one, in my opinion, would be this one.

Using the tidyr package, I'd solve your problem like this:

library(tidyverse)

Step one is to separate the Occ and Totl columns into 2 data frames, which we are going to merge together later. Using the gather function, I'm converting these columns to key-value pairs. We're extracting the numeric value out of the key so that we can match Occ_1 to Totl_1 later.

df_occ <- df %>%
  gather(group, occ, contains("Occ")) %>%
  select(word, group, occ) %>%
  mutate(group = str_extract(group, "[0-9]") %>% as.integer())

df_totl <- df %>%
  gather(group, totl, contains("Totl")) %>%
  select(word, group, totl) %>%
  mutate(group = str_extract(group, "[0-9]") %>% as.integer())

Once we have these two data frames, we're merging them back together. We take the word and Unique_words columns from the original data frame, then add the Occ data frame and lastly the Totl data frame by group. Finally, we can do the calculation you want with one line of code.

df_merge <- df %>%
  select(word, Unique_words) %>%
  left_join(df_occ, by = 'word') %>%
  left_join(df_totl, by = c('word', 'group')) %>%
  mutate(prob = (occ + 1) / (totl + Unique_words))

If you want to convert this back to a wide format, you would use the inverse of the gather function, namely spread.

df_wide <- df_merge %>%
  select(word, group, prob) %>%
  mutate(group = paste0("Prob_", group)) %>%
  spread(group, prob)

The advantages of this approach:

  1. Your code is more legible, with every operation on its own line and avoiding square brackets (which often create code that is hard to read).
  2. Your code shows intermediate steps.
  3. The approach is more flexible, and hopefully also makes other processing steps easier.
A. Stam
  • 2,148
  • 14
  • 29
  • I just now did this ....very clean way to handle this type of data .. will help me to learn R...bookmarking the link...Thanks Sir. – james joyce Jan 14 '19 at 12:06
1

That's actually what's called vectorization of a function and it can increase the performance of your code significantly.

But first, just so you know for future questions, it's much easier to provide an example data using dput

dput(df)

Then someone who wants to answer the question can simply use the output:

df <- dget(structure(list(word = structure(c(2L, 5L, 6L, 1L, 3L, 4L), .Label = c("baseball", 
"car", "color", "muscle", "saturn", "survival"), class = "factor"), 
    Occ_1 = c(0L, 2L, 1L, 1L, 0L, 0L), Occ_2 = c(1L, 0L, 2L, 
    1L, 0L, 1L), Occ_3 = c(0L, 2L, 0L, 0L, 1L, 0L), Totl_1 = c(11L, 
    11L, 11L, 11L, 11L, 11L), Totl_2 = c(9L, 9L, 9L, 9L, 9L, 
    9L), Totl_3 = c(7L, 7L, 7L, 7L, 7L, 7L), Unique_words = c(17L, 
    17L, 17L, 17L, 17L, 17L), Probability_1 = c(0.0357142857142857, 
    0.107142857142857, 0.0714285714285714, 0.0714285714285714, 
    0.0357142857142857, 0.0357142857142857), Probability_2 = c(0.0769230769230769, 
    0.0384615384615385, 0.115384615384615, 0.0769230769230769, 
    0.0384615384615385, 0.0769230769230769), Probability_3 = c(0.0416666666666667, 
    0.125, 0.0416666666666667, 0.0416666666666667, 0.0833333333333333, 
    0.0416666666666667)), row.names = c(NA, -6L), class = "data.frame"))

Anyway, here is a way to do what you want:

df$Probability_1 <- (df$Occ_1 + 1) / (df$Totl_1 + df$Unique_words)
df$Probability_2 <- (df$Occ_2 + 1) / (df$Totl_2 + df$Unique_words)
df$Probability_3 <- (df$Occ_3 + 1) / (df$Totl_3 + df$Unique_words)

Or if you prefer dplyr:

library("dplyr")
df_new <- df %>% 
  mutate(
    Probability_1 = (Occ_1 + 1) / (Totl_1 + Unique_words),
    Probability_2 = (Occ_2 + 1) / (Totl_2 + Unique_words),
    Probability_3 = (Occ_3 + 1) / (Totl_3 + Unique_words)        
  )

Update

I missed the point of the question. It's acutally about the number of Occ and Totl variables. I would solve this with a for loop, which should still be very effective:

for(i in gsub("^Occ_", "", grep("^Occ_*", colnames(df), value = TRUE))) {
  df[paste0("Probability_", i)] <- 
    (df[paste0("Occ_", i)] + 1) / (df[paste0("Totl_", i)] + df$Unique_words)
}
JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • Thanks for the response sir, this is what i have been doing,i have `58` class,so i have to write this code for `58` times and thats what i don't want.Thanks – james joyce Jan 14 '19 at 11:31