0

I have a large text file that I want to import in R with multimodal data encoded as such :

A=1,B=1,C=2,...
A=2,B=1,C=1,...
A=1,B=2,C=1,...

What I'd like to have is a dataframe similar to this :

A    B    C
1    1    2
2    1    1
1    2    1

Because the column name is being repeated over and over for each row, I was wondering if there was a way import that text file with a fscanf functionality that would parse the A, B, C column names such as "A=%d,B=%d,C=%d,...."

Or maybe there's a simpler way using read.table or scan ? But I couldn't figure out how.

Thanks for any tip

accpnt
  • 91
  • 10

1 Answers1

1

1) read.pattern read.pattern in the gsubfn package is very close to what you are asking. Instead of %d use (\\d+) when specifying the pattern. If the column names are not important the col.names argument could be omitted.

library(gsubfn)    
L <- c("A=1,B=1,C=2", "A=1,B=1,C=2", "A=1,B=1,C=2") # test input

pat <- "A=(\\d+),B=(\\d+),C=(\\d+)"
read.pattern(text = L, pattern = pat, col.names = unlist(strsplit(pat, "=.*?(,|$)")))

giving:

  A B C
1 1 1 2
2 1 1 2
3 1 1 2

1a) percent format Just for fun we could implement it using exactly the format given in the question.

fmt <- "A=%d,B=%d,C=%d"
pat <- gsub("%d", "(\\\\d+)", fmt)

Now run the read.pattern statement above.

2) strapply Using the same input and the gsubfn package, again, an alternative is to pull out all strings of digits eliminating the need for the pat shown in (1) reducing the pattern to just "\\d+".

DF <- strapply(L, "\\d+", as.numeric, simplify = data.frame)
names(DF) <- unlist(strsplit(L[1], "=.*?(,|$)"))

3) read.csv Even simpler is this base only solution which deletes the headings and reads in what is left setting the column names as in the prior solution. Again, omit the col.names argument if column names are not important.

read.csv(text = gsub("\\w*=", "", L), header = FALSE,
  col.names = unlist(strsplit(L[1], "=.*?(,|$)")))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks a lot, I got it to work with 3). I used L = readLines('mydata.txt'). I had to disable col.names though. I'll sort it out later ! – accpnt Mar 17 '18 at 17:18