0

I have tried to read the following file in R using read.csv and it seems to me that whenever the first line of the file doesn't contain the largest number of columns, read.csv reads it incorrectly. Specifically, when I put the record "CCW12 ERV14 PER1 PTK2 RPN4 SEC66 SKY1 SUR4 VPS51 VPS52 VPS53 VPS54 VTC4" in the first line of my file then read.csv reads the file correctly into a 7-row table.

My file:

CCW12 ERV14 PER1 PTK2 RPN4 SEC66 SKY1 SUR4 VPS51 VPS52 VPS53 VPS54 VTC4
ERV14 HLJ1 ILM1 KRE1 PER1
BST1 ERV14 ERV25 HLJ1 KIN3 KRE1 LAS21 PER1 VPS38
ANP1 CWH43 ERV14 HLJ1 LAS21 PER1 SUR4 VPS51
CCW12 ERD1 ERV14 OST3 PER1 PMT2 SUM1 SUR4 TED1
ERV14 PER1 SEC66 SSH1 SUR4 VPS51
CCW12 PER1 PMT2 RPN4 SKY1 SUR4 TED1


y=read.csv("./file.txt", sep=" ", header=FALSE)
y
     V1    V2    V3   V4    V5    V6    V7    V8    V9   V10   V11   V12  V13
1 CCW12 ERV14  PER1 PTK2  RPN4 SEC66  SKY1  SUR4 VPS51 VPS52 VPS53 VPS54 VTC4
2 ERV14  HLJ1  ILM1 KRE1  PER1                                               
3  BST1 ERV14 ERV25 HLJ1  KIN3  KRE1 LAS21  PER1 VPS38                       
4  ANP1 CWH43 ERV14 HLJ1 LAS21  PER1  SUR4 VPS51                             
5 CCW12  ERD1 ERV14 OST3  PER1  PMT2  SUM1  SUR4  TED1                       
6 ERV14  PER1 SEC66 SSH1  SUR4 VPS51                                         
7 CCW12  PER1  PMT2 RPN4  SKY1  SUR4  TED1

But when I put that record in some other place, then read.csv puts breaks that record into two rows one of which contains the items {CCW12 ERV14 PER1 PTK2 RPN4 SEC66 SKY1 SUR4 VPS51} and the other contains {VPS52 VPS53 VPS54 VTC4}.

My file after I moved the first line to another place:

ERV14 HLJ1 ILM1 KRE1 PER1
BST1 ERV14 ERV25 HLJ1 KIN3 KRE1 LAS21 PER1 VPS38
ANP1 CWH43 ERV14 HLJ1 LAS21 PER1 SUR4 VPS51
CCW12 ERD1 ERV14 OST3 PER1 PMT2 SUM1 SUR4 TED1
ERV14 PER1 SEC66 SSH1 SUR4 VPS51
CCW12 ERV14 PER1 PTK2 RPN4 SEC66 SKY1 SUR4 VPS51 VPS52 VPS53 VPS54 VTC4
CCW12 PER1 PMT2 RPN4 SKY1 SUR4 TED1

y=read.csv("./file.txt", sep=" ", header=FALSE)
y
     V1    V2    V3   V4    V5    V6    V7    V8    V9
1 ERV14  HLJ1  ILM1 KRE1  PER1  
2  BST1 ERV14 ERV25 HLJ1  KIN3  KRE1 LAS21  PER1 VPS38
3  ANP1 CWH43 ERV14 HLJ1 LAS21  PER1  SUR4 VPS51      
4 CCW12  ERD1 ERV14 OST3  PER1  PMT2  SUM1  SUR4  TED1
5 ERV14  PER1 SEC66 SSH1  SUR4 VPS51                  
6 CCW12 ERV14  PER1 PTK2  RPN4 SEC66  SKY1  SUR4 VPS51
7 VPS52 VPS53 VPS54 VTC4                              
8 CCW12  PER1  PMT2 RPN4  SKY1  SUR4  TED1

I have checked with vim that there is no invisible/wired character in my file other than the spaces between two items in a record/line and end-of-line characters at the end of lines. So am I doing something wrong or is it an R problem?

I have seen one post that raises the same issue but couldn't find much help from there.

Community
  • 1
  • 1
user2426277
  • 71
  • 1
  • 3

1 Answers1

3

First of all, it's a bit odd to use read.csv when you don't actually have a comma separated value. read.table is a more natural choice.

But the main problem is that you don't have rectangular data. read.table and read.csv both output a data.frame where they assume each row has the same number of columns. R reads the first few lines of a file to figure out how many columns that is and what data type each column is. So if your longest line is after this "peeking" zone, then R won't expect that many columns. If you do know the maximum number of columns that your data has, you can specify a vector of that length to colClasses. So if the longest line has 30 values and they are all character, you can specify colClasses=rep("character",30).

It sounds like you might want consider alternate ways to read in your data and store it. Perhaps readLines or scan might be better choices. And you can keep your data in a list rather than a data.frame.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • read.table doesn't work: it says line x did not have y elements – user2426277 May 27 '14 at 14:54
  • 1
    @user2426277 Well, `read.csv` just happens to have `fill=T` set by default. But that doesn't mean it's correct. `read.table` is telling you the real problem. – MrFlick May 27 '14 at 14:55