1

The following data is read in with readLines(). It need to be converted into a data frame. The problem is the 3rd column is sometimes blank. Is there a way to split the column by fixed length like in Excel, or other way to handle the 3rd column? Thanks for any help.

   CSTE   DFHTACP             255                     N         0        0        0        0        0           0
   CSTP   DFHZCSTP            255                     N         0        0        0        0        0           0
   CSXM   DFHXMAB               0                     N         0        0        0        0        0           0
   CSZI   DFHSZRMP              1                     N         0        0        0        0        0           0
   CTIN   DFHZCT1   DFHCOMCL  254                     N         0        0        0        0        0           0
   CTSD   DFHTSDQ   DFHTSDEL  254                     N         0        0        0        0        0           0
   CVMI   DFHMIRS               1                     N         0        0        0        0        0           0
   CWBA   DFHWBA               15                     N   6124306        0        0        0        0           0
  • I can show you how with python if it's any help... – yuvi Jan 12 '14 at 14:00
  • Sorry yuvi, no python, but thanks anyway. I'll try [Read fixed width text file](http://stackoverflow.com/questions/14383710/read-fixed-width-text-file) later. – Shange Changzhi Jan 12 '14 at 14:35
  • Can you define "blank" and what the source file is (e.g. text only)? There are different approaches depending on what the column delimiters are -- and there's also `read.fwf` – Carl Witthoft Jan 12 '14 at 15:38
  • Yes, it's text only. The blank is just spaces. Grothendieck's solution is great, it works well, including the read.fwf method. – Shange Changzhi Jan 14 '14 at 10:33

1 Answers1

2

1) read.fwf This can be done using read.fwf as mentioned in the comments. This is straight-forward although it requires tedious determination of the field widths.

w <- c(7, 11, 10, 5, 22, 10, 9, 9, 9, 9, 12)
read.fwf("myfile.dat", w)

The remaining methods assume there is at least one space between each field. That is at least the case in the sample input.

2) read.pattern() This requires the development version of gsubfn package. This does require a long regular expression although its a reasonably straight forward one matching spaces (" +"), non-spaces ("\\S+") and optional non-spaces ("\\S*").

library(gsubfn)
library(devtools) # source_url supports https whereas source only supports http
source_url("https://gsubfn.googlecode.com/svn/trunk/R/read.pattern.R") # from dev repo

# the third capture group is "\\S*" whereas the others are "\\S+"
pat_ <- rep(c("(\\S+) +", "(\\S*) +", "(\\S+) +", "(\\S+)"), c(2, 1, 7, 1))
pat <- paste0( c( "^ *", pat_, " *$"), collapse = "")
read.pattern("myfile.dat", pattern = pat, as.is = TRUE)

3) gsubfn() With slightly more work we can use gsubfn() in the same package in which case we can use the CRAN version of the package. Use pat defined above:

Lines <- readLines("myfile.dat")
tmp <- gsubfn(pat, ... ~ paste(..., sep = ","), Lines)
read.table(text = tmp, sep = ",", as.is = TRUE)

4) count.fields We can explicitly count fields in each line to get another solution. This requires no addon packages. It avoids manual field width counting but does require more code.

Lines <- readLines("myfile.dat")

k <- count.fields("myfile.dat")
Lines2 <- gsub("^ *| *$", "", Lines) # trim whitespace from beginning and end
Lines3 <- ifelse(k == 10, sub("^(\\S+ *\\S+)", "\\1,", Lines2), Lines2) # insert extra ,

Lines4 <- gsub(" +", ",", Lines3) # replace each string of spaces with a ,
read.table(text = Lines4, sep = ",", as.is = TRUE)

5) sub Slightly shorter and still without addon packages is this one. We use sub to insert an extra comma into the short lines and then replace each string of spaces with a comma and read:

Lines <- readLines("myfile.dat")
pat2 <- "^( *(\\S+ +){2})((\\S+ +){7}\\S+) *$"
g <- gsub(" +", ",", sub(pat2, "\\1,\\3", Lines))

read.table(text = g, sep = ",", as.is = TRUE)

REVISED Additional solutions.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341