2

I am trying to split a dataframe column into multiple columns based on a delimiter. My dataframe has one column which looks like below-

A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG

I want a dataframe which has 6 columns namely Site, File, Variable, Timestamp, Value, and Comment, like below-

Site File Variable Timestamp Value Comment
A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG

I tried to do this by using tidyr package and using the 'separate' statement because each observation is delimited by a space. However, the problem is that the comments have spaces in between them and I don't want to split the comments. Is there a way to do this? Any help will be greatly appreciated. Thanks!

asmi
  • 449
  • 1
  • 6
  • 16
  • 1
    If everything except the comments is of fixed width, you can probably try https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.fwf.html – Xiongbing Jin Oct 10 '16 at 20:15

3 Answers3

3

Another tidyverse answer, this time using tidyr::separate.

We note that each row is space-separated, except for the last line (which can contain spaces). In that case, we can just split on space up to the number of columns we know we have.

tidyr::separate takes an extra argument that can handle this use case: extra = "merge".

library(tidyverse)

data.raw = "A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG"

data = read_csv(data.raw, col_names = "Col1")

data %>%
    separate(Col1, into = c("Site", "File", "Variable", "Timestamp", "Value", "Comment"), sep = "\\s", extra = "merge") %>%
    type_convert() %>%
    head()

#> # A tibble: 3 x 6
#>       Site  File Variable        Timestamp Value            Comment
#>      <chr> <chr>    <dbl>            <chr> <dbl>              <chr>
#> 1 A0017493    .A    11.86 23:59_10/10/2016  1.00           SURVEYED
#> 2 A0017493    .A    11.86 23:59_10/11/2016  1.15 DATALOGGER CHANGED
#> 3 A0017496    .A    11.82 23:59_11/12/2016  2.06   READING IS WRONG
Michael Griffiths
  • 1,399
  • 7
  • 14
  • Hi Michael- I used the statement newdata<-separate(mydata, col=comments, into = c("Site","File","Variable","Timestamp","Value","Comment"), sep= " ", extra="merge", remove=TRUE) but, the resulting dataframe has Site in the first column and everything else is in the last column. All other columns in between are empty. Basically it just separated out the Site. Am I doing something wrong? – asmi Oct 12 '16 at 12:21
3

Seems to be a a ragged fixed-width-formatted file, so

library(readr)
pos <- fwf_positions(start = c(1, 9, 13, 19, 36, 42), end = c(9, 13, 19, 36, 42, NA)-2) # if I counted correctly... 
df <- read_fwf(file = "A0017493 .A 11.86 23:59_10/10/2016 1.00 SURVEYED
A0017493 .A 11.86 23:59_10/11/2016 1.15 DATALOGGER CHANGED
A0017496 .A 11.82 23:59_11/12/2016 2.06 READING IS WRONG", col_positions = pos )
glimpse(df)
# Observations: 3
# Variables: 6
# $ X1 <chr> "A001749", "A001749", "A001749"
# $ X2 <chr> ".A", ".A", ".A"
# $ X3 <dbl> 11.86, 11.86, 11.82
# $ X4 <chr> "23:59_10/10/2016", "23:59_10/11/2016", "23:59_11/12/2016"
# $ X5 <chr> "1.00 SU", "1.15 DA", "2.06 RE"
# $ X6 <chr> "VEYED", "ALOGGER CHANGED", "DING IS WRONG"
lukeA
  • 53,097
  • 5
  • 97
  • 100
0

We can use the tidyverse library of packages to do what you want. The key is to split apart each row based on the ' ' character, and then unite those comment columns back together. This assumes that your raw data are contained in a data frame called df, which has a single column called V1.

library(tidyverse)

df.new <- strsplit(df$V1, split = ' ') %>% # split each row into a character vector contained in a list
    lapply(function(x) data.frame(rbind(x))) %>% # simplify each vector into a character array
    rbind.fill %>% # glue together the ragged rows
    unite('Comment', -X1:-X5, sep = ' ') %>% # recombine every column that is NOT one of the first 5 (i.e., combine comment columns)
    mutate(Comment = gsub(' NA', '', Comment)) %>% # get rid of 'NA' strings
    rename(Site = X1, File = X2, Variable = X3, Timestamp = X4, Value = X5) # relabel columns
    mutate_all(as.character) %>% type_convert # convert columns to appropriate formats

      Site File Variable        Timestamp Value            Comment
1 A0017493   .A    11.86 23:59_10/10/2016  1.00           SURVEYED
2 A0017493   .A    11.86 23:59_10/11/2016  1.15 DATALOGGER CHANGED
3 A0017496   .A    11.82 23:59_11/12/2016  2.06   READING IS WRONG
jdobres
  • 11,339
  • 1
  • 17
  • 37