0

I have data in the following format:

dataset <- data.frame(taxa = c("k__Archaea| p__Crenarchaeota", "k__Archaea| p__Euryarchaeota", "k__Bacteria| p__[Thermi]"),
                      "11908.MM.0008.Inf.6m.Stool" =c(0,1760,0),
                      "11908.MM.01115.Inf.6m.Stool" =c(0,1517,0),
                      "11908.MM.0044.Inf.6m.Stool" =c(0,10815,0),
                      "11908.MM.0125.Mom.6m.Stool" = c(0,4719,0))
view(dataset)

And I would like to convert it to the following format:

fix_dataset <- data.frame(study_id = c(0008, 0115, 0044, 0125),
individual = c("Inf", "Inf", "Inf", "Mom" ),
`k__Archaea| p__Crenarchaeota` = c(0,0,0,0),
 `k__Archaea| p__Euryarchaeota`= c(1760, 1517,10815, 4719),
`` = c(0,0,0,0),
timept1 = c("6m", "6m", "6m", "6m"))

view(fix_dataset)

I am trying to cut out the beginning number series 11908 and "Stool" from each column name, split out the other parts of the column names and convert from wide to long format.

I'm using the following code

library(tidyverse)
dataset %>%
  pivot_longer(cols = -taxa) %>%
  separate(col = name, into = c("info1", "info2", "study_id", "individual", "timept1", "info3"), sep = "[.]") %>%
  pivot_wider(names_from = taxa,
              values_from = value) %>%
  select(study_id, individual, starts_with("k_"), timept1)

And I'm getting the following error message when I apply this to my data

Error in select(., study_id, individual, timept1, starts_with("k_")) : 
  unused arguments (study_id, individual, timept1, starts_with("k_"))
In addition: Warning messages:
1: Expected 6 pieces. Additional pieces discarded in 44 rows [242, 243, 903, 904, 1564, 1565, 2225, 2226, 2886, 2887, 3547, 3548, 4208, 4209, 4869, 4870, 5530, 5531, 6191, 6192, ...]. 
2: Expected 6 pieces. Missing pieces filled with `NA` in 1012 rows [74, 93, 94, 223, 224, 225, 226, 227, 228, 229, 230, 469, 470, 532, 533, 535, 536, 540, 580, 593, ...]. 
3: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates 

Does anyone have suggestions for these error messages?

Hailey
  • 5
  • 2
  • You can get rid of "11908" and "Stool" easily using gsub. For example `gsub("^11908|Stool$","",colnames(dataset))`. Then use reshape to convert to long format. – Michal J Figurski Nov 11 '20 at 13:48
  • Can you please clarify if the fix_dataset is your desired output or if you want the fix_dataset still be pivoted to a long format and if so, based on which columns? – deschen Nov 11 '20 at 13:54
  • the fix_dataset is the desired output, yes – Hailey Nov 11 '20 at 14:01

1 Answers1

0

You can achieve this with the following code:

library(tidyverse)
dataset %>%
  pivot_longer(cols = -taxa) %>%
  separate(col = name, into = c("info1", "info2", "study_id", "individual", "timept1", "info3"), sep = "[.]") %>%
  pivot_wider(names_from = taxa,
              values_from = value) %>%
  select(study_id, individual, starts_with("taxa"), timept1)

which gives:

# A tibble: 4 x 6
  study_id individual taxa1 taxa2 taxa3 timept1
  <chr>    <chr>      <dbl> <dbl> <dbl> <chr>  
1 0008     Inf            0  1760     0 6m     
2 01115    Inf            0  1517     0 6m     
3 0044     Inf            0 10815     0 6m     
4 0125     Mom            0  4719     0 6m 

Note that there is some inconsistency in your study id, i.e. in your original data set one of the ids is "01115" and in your preferred output it is "0115".

deschen
  • 10,012
  • 3
  • 27
  • 50
  • yes that was an error on my part in entering the sample dataset – Hailey Nov 11 '20 at 14:02
  • If the solution works for you you can choose the answer as the accepted answer by clicking on the little checkmark to the left. – deschen Nov 11 '20 at 14:04
  • Can you please provide a more exhaustive code example that also contains some of these other columns that don't fit into the current naming scheme? – deschen Nov 11 '20 at 14:14