0

I have a lot of old untidy data in excel files (50 sheets with 400-500 rows each). A part of my data looks like this:

Elements= c("Project name ONE","John","Smith","Sara","Project name TWO","stardust","soil","sunflower","juice","doe","tobacco", "Project name THREE","phi","rho","omega")

Units= c("NA", "3", "5", "6", "NA", "21", "19", "31", "24", "1", "5", "NA", "21", "21", "22")

df= data.frame(Elements, Units)

In my large data set each project has a very different number of rows.

I want to create new column "Group" where each line for every project is defined. For the above sample the results would be like this

Group =c(1,1,1,1,2,2,2,2,2,2,2,3,3,3,3)

df =c(Elements, Units, Group)

But I also would like to sum values from the column "Units" for all values below each "empty" cell into a new column "Sum".

Sum= c("14", "NA", "NA", "NA", "101", "NA", "NA", "NA", "NA", "NA", "NA", "9", "NA", "NA", "NA")

My final product would then look like this:

df =c(Elements, Units, Group, Sum)
deschen
  • 10,012
  • 3
  • 27
  • 50
Krag
  • 1
  • 1
  • Welcome to SO. Summing a set of values results in a single number. So it's not clear where you want to put your single value that is the sum of`Units` for all values below each empty cell. Please provide the output you expect based on the sample data you provide. And next time, please provide your input data in a form more ameanable to using directly in R. (The output from `dput()` would be ideal.) – Limey Feb 15 '22 at 15:36
  • Thank you @Limey - I have now edited my question. Hope it clarifies my problem – Krag Feb 15 '22 at 17:28

2 Answers2

0

You could do sth. like this:

Note that I changed your input example so that missings are not declared as strings of "NA", but are real missings (NA):

df <- data.frame(Elements = c("Project name ONE","John","Smith","Sara","Project name TWO","stardust","soil","sunflower","juice","doe","tobacco", "Project name THREE","phi","rho","omega"),
                 Units    = c(NA, "3", "5", "6", NA, "21", "19", "31", "24", "1", "5", NA, "21", "21", "22"))

library(tidyverse)
df %>%
  mutate(project = if_else(is.na(Units), Elements, NA_character_),
         Units   = as.numeric(Units)) %>%
  fill(project) %>%
  group_by(project) %>%
  filter(row_number() != 1) %>%
  mutate(Sum = if_else(row_number() == 1, sum(Units, na.rm = TRUE), NA_real_)) %>%
  ungroup()

# A tibble: 12 x 4
   Elements  Units project              Sum
   <chr>     <dbl> <chr>              <dbl>
 1 John          3 Project name ONE      14
 2 Smith         5 Project name ONE      NA
 3 Sara          6 Project name ONE      NA
 4 stardust     21 Project name TWO     101
 5 soil         19 Project name TWO      NA
 6 sunflower    31 Project name TWO      NA
 7 juice        24 Project name TWO      NA
 8 doe           1 Project name TWO      NA
 9 tobacco       5 Project name TWO      NA
10 phi          21 Project name THREE    64
11 rho          21 Project name THREE    NA
12 omega        22 Project name THREE    NA

So what are we doing?

  • We define the project (or group) by taking the "Elements" for which Units are NA and then fill it downwards.
  • We also turn your Units column to numeric (in your example it is a character variable).
  • We then group by the project.
  • We filter out the first row because it contains the project name which now has its own column.
  • And then we calculate the sum of Units for each project and put taht into the first row of each project.

If you don't want to cut teh first row that contains the project name in "Elements" you can simply delete the code line with filter(...).

deschen
  • 10,012
  • 3
  • 27
  • 50
  • Thank you @deschen, this solution solved my problem. The breaking down of the steps is very helpful for better understanding – Krag Feb 15 '22 at 19:14
  • If this or another solution solves your issue, please accept the answer so that others can benefit from it. – deschen Feb 15 '22 at 19:56
  • I do not seem to be able to vote for the answer, don't have enough credits? – Krag Mar 02 '22 at 14:30
  • You can accept an answer (click on the checkmark symbol), but you might not be able to upvote, but that‘s fine. Accepting would be enough. – deschen Mar 02 '22 at 15:25
0

You could also do:

df %>%
  type_convert() %>%
  group_by(grp = cumsum(is.na(Units))) %>%
  mutate(Sum = (NA^(row_number() != 1))* sum(Units, na.rm = TRUE))

# A tibble: 15 x 4
# Groups:   grp [3]
   Elements           Units   grp   Sum
   <chr>              <dbl> <int> <dbl>
 1 Project name ONE      NA     1    14
 2 John                   3     1    NA
 3 Smith                  5     1    NA
 4 Sara                   6     1    NA
 5 Project name TWO      NA     2   101
 6 stardust              21     2    NA
 7 soil                  19     2    NA
 8 sunflower             31     2    NA
 9 juice                 24     2    NA
10 doe                    1     2    NA
11 tobacco                5     2    NA
12 Project name THREE    NA     3    64
13 phi                   21     3    NA
14 rho                   21     3    NA
15 omega                 22     3    NA
Onyambu
  • 67,392
  • 3
  • 24
  • 53