Extract data after a string one each appearance R

Question

I have a data like this (named spectra):

#Milk spectra: 1234
##XYDATA=(X++(Y..Y))
649.025085449219 
667.675231457819
686.325377466418 
##XYDATA=(X++(Y..Y))
723.625669483618 
742.275815492218 
760.925961500818 
##XYDATA=(X++(Y..Y))
872.826837552417 
891.476983561017 
910.127129569617
928.777275578216

In this data, each time the string ##XYDATA=(X++(Y..Y)), that is the data for each different animal. So, I want to have the code that can help extract this sample into 3 pieces of data.

Animal 1: 3 lines after 1st '  ##XYDATA=(X++(Y..Y))'
Animal 2: 3 lines after 2nd '  ##XYDATA=(X++(Y..Y))'

And so on.

I tried this line of code but it only help to extract line 1 of all times the string '##XYDATA=(X++(Y..Y))' appeared together. Thus, it did not meet my expect to have three lines and to have a separate pieces of data after each appearance of the string.

bo<-data.frame(spectra$V1[which(spectra$V1 == '##XYDATA=(X++(Y..Y))')+1])

Please make a minimal example and also show exactly how you want the results from that example to look. I can say that I really don't understand your data frame based on this. — Elin, Nov 16 '17 at 05:24
Hi Elin, I have revised my question and the data sample based on your comment. Hopefully you (or others) might be able to help. Thanks in advance — hn.phuong, Nov 16 '17 at 05:38
So you have a data frame with 1 column of character variables? Please show head(spectra). Wouldn't it make more sense to read the data in a standard data frame format (so that there is one row per animal or that the animal ID is in each row while leaving your data as numeric)? — Elin, Nov 16 '17 at 05:44
Yes, I agree with you Elin, but this is the raw data we obtain from a company, and now I have to process it to the right format (as my question). The data frame is just one column with each animal's data being provided just right after each string "##XYDATA=(X++(Y..Y))" — hn.phuong, Nov 16 '17 at 05:51
It is a spectral data in a .JDX file but I could manage to read it by read.table — hn.phuong, Nov 16 '17 at 06:01

score 0 · Answer 1 · answered Nov 16 '17 at 06:49

Okay I think you could do something along these lines. I'm sure this could be much better and more efficient but read it in as a character vector. Then loop through to spread it out. However this assumes there are always the same number of measures and you have a way to identify the character values.

c_data<- c("split", 1, 2, 3, 
                   "split", 4, 5, 6)

y<- c_data == "split"

df_wide <- data.frame("animal"= character(), "v1" = numeric(), "v2" = numeric(), "v3" = numeric(),
                      stringsAsFactors = FALSE)
names(df_wide)<- c("animal", "v1", "v2", "v3")
x <- 0
for (i in 1:length(c_data)){

  if (y[i] == TRUE){
    x <- x +1
    df_wide[x,] <- rbind(c(c_data[i], c_data[i+1], c_data[i+2], c_data[i+3]))
  } 
}

yields

  animal v1 v2 v3
1  split  1  2  3
2  split  4  5  6

If it is a one time thing, it may not be worth trying to write something nicer. If it is an ongoing thing then you may want to look at using an apply function that you could have to write a function for.

score 0 · Answer 2 · answered Nov 16 '17 at 19:15

You can do either of the following with split and map:

library(dplyr)
library(purrr)

df %>%
  mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
  split(.$Animal) %>%
  map(~slice(., -1) %>% mutate(V1 = as.numeric(V1))) %>%
  '['(-1)

This creates an indicator variable Animal, split by that indicator, remove the first row for each dataframe, convert V1 to numeric, and finally remove the first element of the list.

You can also do the following:

df %>%
  mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
  filter(!grepl("^#.*$", V1)) %>%
  mutate(V1 = as.numeric(V1)) %>%
  split(.$Animal)

This also creates the indicator Animal, but it intead, filters out all rows with # signs in it and converts V1 to numeric before splitting into separate dataframes.

Result:

$`1`
# A tibble: 3 x 2
        V1 Animal
     <dbl>  <int>
1 649.0251      1
2 667.6752      1
3 686.3254      1

$`2`
# A tibble: 3 x 2
        V1 Animal
     <dbl>  <int>
1 723.6257      2
2 742.2758      2
3 760.9260      2

$`3`
# A tibble: 4 x 2
        V1 Animal
     <dbl>  <int>
1 872.8268      3
2 891.4770      3
3 910.1271      3
4 928.7773      3

Note:

Here I assumed #Milk spectra: 1234 is also a row in your column, hence the subsetting at the end.

Data:

df = read.table(textConnection("'#Milk spectra: 1234'
                ##XYDATA=(X++(Y..Y))
                649.025085449219 
                667.675231457819
                686.325377466418 
                ##XYDATA=(X++(Y..Y))
                723.625669483618 
                742.275815492218 
                760.925961500818 
                ##XYDATA=(X++(Y..Y))
                872.826837552417 
                891.476983561017 
                910.127129569617
                928.777275578216"),comment.char = "", stringsAsFactors = FALSE)

Extract data after a string one each appearance R

2 Answers2