2

I've got a flatfile, fixed width with neither newline nor linefeed (dump from AS400).

How do I load this file into an R data.frame?

I've tried different combinations of textConnection and read.fwf, to no avail.

The code below crashes Rstudio, so I'm assuming I'm overloading the system.

len below is 24376400, which is tame as far as the files I usually load using read.table. Record length is 400.

Is there any RECLEN parameter I should set, similar to SAS? Is there an option to set EOL = "\n" or "\r\n" ? Thank you.

fname <- "AS400FILE.TXT"
len <- file.info(fname)$size
conn <- file(fname, 'r')
contents <- readChar(conn, len)
close(conn)

df <- read.fwf( textConnection(contents) , widths=layout$length , sep="")

> dput(layout)
structure(list(start = c(1L, 41L, 81L, 121L, 161L, 201L, 224L, 
226L, 231L, 235L, 237L, 238L, 240L, 280L, 290L, 300L, 305L, 308L, 
309L, 330L, 335L, 337L, 349L, 350L, 351L, 355L, 365L), end = c(40L, 
80L, 120L, 160L, 200L, 223L, 225L, 230L, 234L, 236L, 237L, 239L, 
279L, 289L, 299L, 304L, 307L, 308L, 329L, 334L, 336L, 348L, 349L, 
350L, 354L, 364L, 400L), length = c(40L, 40L, 40L, 40L, 40L, 
23L, 2L, 5L, 4L, 2L, 1L, 2L, 40L, 10L, 10L, 5L, 3L, 1L, 21L, 
5L, 2L, 12L, 1L, 1L, 4L, 10L, 36L), label = c("TITLE", "SUFFIX", 
"ADDRESS1", "ADDRESS2", "ADDRESS3", "CITY", "STATE", 
"ZIP", "ZIP+4", "DELIVERY", "CHECKD", "FILLER", "NAME", 
"SOURCECODE", "ID", "FILLER", "BATCH", "FILLER", "FILLER", 
"GRID", "LOT", "FILLER", "CONTROL", 
"ZIPIND", "TROUTE", "SOURCEA", "FILLER")), .Names = c("start", 
"end", "length", "label"), class = "data.frame", row.names = c(NA, 
-27L))
> dim(layout)
[1] 27  4
> 
M.Dimo
  • 421
  • 2
  • 4
  • 11
  • Can you give a small snippet of your text file? I notice you have `sep=""` - so it's (e.g.) raw data `08091011`, width 2 --> `08`,`09`,`10`,`11`? What's `layout$length`? A vector? integer? – mathematical.coffee Feb 27 '12 at 03:00
  • layout is a data frame with field name and field width: – M.Dimo Feb 27 '12 at 03:02
  • The text file is name & address data with routing info: total 27 fields. The layout variable is a data frame that includes field $length. – M.Dimo Feb 27 '12 at 03:09
  • So, to reiterate my question(s), can you give a small snippet of your text file (update your question with it)? (It doesn't matter how many fields you have so much as the separator & format of those fields.). Also, you say that there are no newlines or linefeeds in the entire file, and yet then ask for an option to set end of line equal to "\n" - which is it (that is why I ask for a snippet of the file). And is `layout$length` a single integer or a vector of integers, all the same or different (and what value does it have)? – mathematical.coffee Feb 27 '12 at 03:17
  • Name & address data: I can't share that. Trust me, it's pure ASCII and there are no separators of any kind: I sed'ed and awked the file to insert a \n after the 400th positions to move forward, but I'm sure R can handle this from within. The RECLEN and EOL questions are suggestions for the R solution, which I'm still researching. – M.Dimo Feb 27 '12 at 03:25

1 Answers1

3

You could use readChar for this.

First make up some sample data (I think the format is as you describe as far as I can tell from the question? i.e. wall of text with a specified width per column, no new lines in the entire file):

lengths <- c(2,3,4,2,3,4)
nFields <- length(lengths)
nRows   <- 10              # let's make 10 rows.
contents <- paste(letters[sample.int(26,size=sum(lengths)*nRows,replace=TRUE)],
                  collapse="")
#> contents
#[1] "lepajmcgcqooekmedjprkmmicm.......
cat(contents,file='test.txt')

I can think of 3 ways to do it, various differences between each:

If you know the number of rows in advance you can do:

# If you know #rows in advance..
conn <- file('test.txt','r')
data <- readChar( conn, rep(lengths,nRows) )
close(conn)
# reshape data to dataframe
df <- data.frame(matrix(data,ncol=nFields,byrow=T))

Otherwise you can use a loop (why read in the file once to work out the number of rows and then again to parse?)

# Otherwise use a loop
conn <- file('test.txt','r')
df <- data.frame(matrix(nrow=0,ncol=6)) # initialise 0-row data frame
while ( length(data <- readChar(conn, lengths)) > 0 ) {
    df[nrow(df)+1,] <- data
}
close(conn)

Or, since you already have all of contents in a string, you can just split the string using substring:

# have already read in contents so can calculate nRows
nRows <- floor(nchar(contents)/sum(lengths)) # 10 for my example
starts <- c(0,cumsum(lengths[-nFields]))
df3 <- data.frame(t(
                    vapply( seq(1,nRows*sum(lengths),sum(lengths)),
                    function(r) 
                        substring(contents,starts+r,starts+r+lengths-1),
                    rep("",nFields) )))

If you want to do it in as little file reads as possible, I suggest the second or third methods.

The third method "feels" most elegant to me, but requires you to read in the entire contents all at once, which, depending on file size, may not be viable.

If that's the case I'd go for the second, which only reads in one set of nFields fields at a time.

I don't recommend the first, unless you know the number of rows in advance - it was just my first attempt. I don't recommend it because you have to first read in the file to determine the number of rows, and then you close it and read it in again. If you want to go down that route then just use method 3! However, if you know by some other means the number of rows in advance, then you could use this method.

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • 1
    Pointing out the correct usage of readChar (using rep(lengths,nRows) rather than the file length) really helped. I wasn't looking at this the right way. Thanks for your help. – M.Dimo Feb 27 '12 at 11:42
  • 1
    Update: the approaches suggested above do work. Processing time gets quickly unpractical as file size grows however. I'll be using >fold -400 AS400DATA.TXT | etc... prior to loading within R. Thanks. – M.Dimo Feb 27 '12 at 16:15