14

I have a file looks like:

a 1,2,3,5
b 4,5,6,7
c 5,6,7,8
...

That the separator between 1st and 2nd is '\t', other separators are comma. How can I read this kind of data set as as dataframe having 5 fields.

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
yliueagle
  • 1,191
  • 1
  • 7
  • 22

3 Answers3

29

I'd probably do this.

read.table(text = gsub(",", "\t", readLines("file.txt")))
  V1 V2 V3 V4 V5
1  a  1  2  3  5
2  b  4  5  6  7
3  c  5  6  7  8

Unpacking that just a bit:

  • readLines() reads the file into R as a character vector with one element for each line.
  • gsub(",", "\t", ...) replaces every comma with a tab, so that now we've got lines with just one kind of separating character.
  • The text = argument to read.table() lets it know you are passing it a character vector to be read directly (rather than the name of a file containing your text data).
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • 1
    Great answer, but I don't think `textConnection` is required here since `read.table` has a `text` argument that could be used instead. – A5C1D2H2I1M1N2O1R2T1 May 09 '14 at 16:27
  • @AnandaMahto - Very cool! I had always assumed that `text=` expected a single character string, but I just tested your suggestion and can confirm that it will indeed take a character vector of multiple elements. – Josh O'Brien May 09 '14 at 16:29
  • I had the same question and also my characters have "{" and "}" which I want to replace by blank. Could it be done in the same gsub()? – Abhishek Singh Oct 25 '18 at 16:48
  • @AbhishekSingh Yes, you can do that by using a pair of brackets (i.e. `[` and `]`) specify a "character class", a list of characters to be matched. In your case, you would do something like this: `gsub("[,{}]", "\t", readLines("file.txt"))`. – Josh O'Brien Oct 25 '18 at 18:01
  • @JoshO'Brien I also have to remove colon at the same time. Can I concatenate colon and curly brackets in the same command? – Abhishek Singh Oct 25 '18 at 18:10
  • @AbhishekSingh I'd recommend **trying** it out on your own, by creating a small example (like the `"file.txt"` used in this example) and then, for instance, adding a colon to the list of characters between the brackets. For help on R's regular expression syntax, see `?regex`. Best of luck. – Josh O'Brien Oct 26 '18 at 15:30
  • Thanks Josh. I figured later that it is a JSON file which is supposed to be read differently. Thanks again :) – Abhishek Singh Oct 27 '18 at 16:22
8

"Balanced" data

Judging by the way you've phrased your question, it seems that you know that your data are "balanced" (rectangular).

Are you looking for speedier options? You might want to combine fread from "data.table" with my experimental concat.split.DT function.

The solution would look something like (replace " " with "\t" for a tab):

concat.split.DT(fread("yourfile.txt", sep = " ", header=FALSE), "V2", ",")

Let's make up some data:

x <- c("a\t1,2,3,5", "b\t4,5,6,7","c\t5,6,7,8")
X <- c(replicate(10000, x))
temp <- tempfile()
writeLines(X, temp, sep="\n") ## Write it to a temporary file

Josh's answer:

system.time(out1 <- read.table(text = gsub(",", "\t", readLines(temp))))
#    user  system elapsed 
#   0.679   0.000   0.676 
head(out1)
#   V1 V2 V3 V4 V5
# 1  a  1  2  3  5
# 2  b  4  5  6  7
# 3  c  5  6  7  8
# 4  a  1  2  3  5
# 5  b  4  5  6  7
# 6  c  5  6  7  8
dim(out1)
# [1] 30000     5

fread + concat.split.DT (which is like using fread twice, but is still super fast):

system.time(out2 <- concat.split.DT(fread(temp, sep = "\t", header=FALSE), "V2", ","))
#    user  system elapsed 
#   0.027   0.000   0.028 
head(out2)
#    V1 V2_1 V2_2 V2_3 V2_4
# 1:  a    1    2    3    5
# 2:  b    4    5    6    7
# 3:  c    5    6    7    8
# 4:  a    1    2    3    5
# 5:  b    4    5    6    7
# 6:  c    5    6    7    8
dim(out2)
# [1] 30000     5

"Unbalanced" data

Although it doesn't apply to your problem, I should mention this for the benefit of others who might need to solve a similar problem:

One limitation of the above is that concat.split.DT only handles "balanced" data. fread doesn't have a fill argument like read.table does (and I seem to remember reading somewhere that it most likely won't have such an argument).

Here's an example of what I mean by unbalanced:

x2 <- c("a\t1,2,3,5,6,7", "b\t4,5,6,7","c\t5,6,7,8,9,10,11,12,13")
X2 <- c(replicate(10000, x2))
temp2 <- tempfile()
writeLines(X2, temp2, sep="\n")

read.table can handle that with the fill = TRUE argument:

system.time(out1b <- read.table(text = gsub(",", "\t", readLines(temp2)), fill=TRUE))
#    user  system elapsed 
#   1.151   0.000   1.152 
head(out1b)
#   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1  a  1  2  3  5  6  7 NA NA  NA
# 2  b  4  5  6  7 NA NA NA NA  NA
# 3  c  5  6  7  8  9 10 11 12  13
# 4  a  1  2  3  5  6  7 NA NA  NA
# 5  b  4  5  6  7 NA NA NA NA  NA
# 6  c  5  6  7  8  9 10 11 12  13

concat.split.DT will give you a nasty error in such cases, but you can try my cSplit function instead. It's not nearly as fast, but still performs decently:

system.time(out2b <- cSplit(fread(temp2, sep = "\t", header=FALSE), "V2", ","))
#    user  system elapsed 
#   0.393   0.004   0.399 
head(out2b)
#    V1 V2_1 V2_2 V2_3 V2_4 V2_5 V2_6 V2_7 V2_8 V2_9
# 1:  a    1    2    3    5    6    7   NA   NA   NA
# 2:  b    4    5    6    7   NA   NA   NA   NA   NA
# 3:  c    5    6    7    8    9   10   11   12   13
# 4:  a    1    2    3    5    6    7   NA   NA   NA
# 5:  b    4    5    6    7   NA   NA   NA   NA   NA
# 6:  c    5    6    7    8    9   10   11   12   13
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • The option for unbalanced datasets scales *much better* than the `read.table` option, so the comparison on 30K rows doesn't really show it off very well. Change the replications to 100K and do a comparison with that. Also, `cSplit` has a fun feature that lets you create "long" datasets on the fly :-) – A5C1D2H2I1M1N2O1R2T1 May 09 '14 at 16:48
  • On my system, a 300K row "file" took 160.323 seconds with `read.table` and 2.769 seconds with my `cSplit` + `fread` approach. – A5C1D2H2I1M1N2O1R2T1 May 09 '14 at 16:52
1
Scanner scan = new Scanner(file);
while (scan.hasNextLine()) {
    String[] a = scan.nextLine().replace("\\t", ",").split(",");
    //do something with the array
}
scan.close();

This did:

  1. create a scanner to process the file (Scanner scan)
  2. scan in the next file line (scan.nextLine()) for each file line based on hasNextLine()
  3. replaced tabs with commas (.replace("\t", ",")), so the separators were all the same
  4. split into an array by commas. Now you can process all the data alike regardless of the length of each line.
  5. Don't forget to close the scanner when you're done.
La-comadreja
  • 5,627
  • 11
  • 36
  • 64