0

I am trying to import a series of custom data files into R.

The files are organized into blocks, which are marked by XML-like markup tags. I understand that these files are not true XML-files and they contain no definition of markup language.

Each block may be a single line or a tab delimited matrix. Comments tend to be marked by a %

The files are ~10K lines long and I need around 2700 lines from each of them, so I would rather want to avoid loops. Also the file length and number of needed lines varies by somewhat unpredictable factors.

I have tried a number of the methods from the XML package, but always get bunch of errors such as "StartTag: invalid element name" and "Premature end of data in tag MERGED-PUPIL-DATA line 5443".

Do you have any ideas? Are there any methods that accept custom markup tags?

A typical file may look something like this (dots indicate stuff I cut out)

<SESSION>
<VERSION>
2
<\VERSION>
<DATE>
2014-01-20 14:29:43
<\DATE>
<SUBJECT-ID>
SUB001
<\SUBJECT-ID>
<NOTE>
red300os
<\NOTE>
<MIN-MAX-PLOT>
0.100000 8707.554688
<\MIN-MAX-PLOT>
<STIMULUS-DEFINED>
redOS300
Default Human Relative Spectral Sensitivity
1   0
1   10.000000   20.000000   60.000000   1   3   2.000000    -100.000000 0.000000    0.000000    1
<\STIMULUS-DEFINED>
.
.
.
.
.
.
<MERGED-PUPIL-DATA>
% time is in sec; diameter is in mm; loci is in pixel; color code -> 100 = unknown, 0 = white, 1 = red, 2 = green, 3 = blue; intensity is in Lux or W/m2
% real time logical time    R. valid    R. diameter R. x loci    R. y loci  L. valid    L. diameter L. x loci    L. y loci  R. led color     R. led intensity   L. led color    L. led intensity
2703
-0.049000   -0.049000   1   5.483765    266.668732  268.837402  1   5.441666    272.687500  272.724976  100 0.000000    100 0.000000
-0.018000   -0.018000   1   5.478448    265.918732  267.837402  1   5.438361    270.687500  273.406219  100 0.000000    100 0.000000
.
.
.
.
89.932000   89.932000   1   5.604879    289.575165  273.574738  1   5.255306    301.056091  303.812744  3   0.000000    3   0.000000
89.964000   89.964000   1   5.650856    289.575165  269.574738  1   5.255306    301.056091  301.812744  3   0.000000    3   0.000000
<\MERGED-PUPIL-DATA>
.
.
.
<\SESSION>
sgibb
  • 25,396
  • 3
  • 68
  • 74
  • I think it would be more like valid XML if the <\TAG>s were s – Spacedman Jan 21 '14 at 17:33
  • If external tools are acceptable and you have a unix toolset I'd use `awk` to match the sections you want and cut them out to new files which R can read. – Spacedman Jan 21 '14 at 17:53
  • Can you post a link to the full dataset? I think one of your problems is the line `color code -> 100=unknown,`. The XML package doesn't like `<` or `>` in element text. – jlhoward Jan 21 '14 at 18:16

2 Answers2

0

The wrong-way slashes are going to thwart any attempt to use XML processing unless you first do a search and replace. The other approach is to read the file in as lines and search for tags.

Read the data file:

txt = readLines("dummy.txt")

Here's a function that returns text between matching tags, as a list in case there's more than one section:

getSection <- function(txt, tag){
    start=paste0("^<",tag,">$")
    end = paste0("^<\\\\",tag,">$") 
    startLines = grep(start,txt)
    endLines = grep(end,txt)
    lapply(1:length(startLines),function(i){
        txt[(startLines[i]+1):(endLines[i]-1)]
    })
}

So for example with a test file that has:

<DATE>
2014-01-20 14:29:43
<\DATE>
<DATE>
Never!
<\DATE>

I get:

> getSection(txt,"DATE")
[[1]]
[1] "2014-01-20 14:29:43"

[[2]]
[1] "Never!"

Suggest you write functions that wrap this for the various sections you want to parse, for example I've slightly edited your file to give this section a bit more regularity:

<STIMULUS-DEFINED>
redOS300
Default Human Relative Spectral Sensitivity
1   10.000000   20.000000   60.000000   1   
3   2.000000    -100.000000 0.000000    0.000000 
<\STIMULUS-DEFINED>

and then written:

getStimulusDefined <- function(lines){
    section = getSection(lines,"STIMULUS-DEFINED")[[1]] # only one of these
    data = read.table(textConnection(section),skip=1,head=TRUE)
    data
}

So I can then do:

> getStimulusDefined(txt)
  Default Human Relative Spectral Sensitivity
1       1    10       20       60           1
2       3     2     -100        0           0

and I get a data frame back (you'll need to rewrite this based on your understanding of that section).

It'll do odd things if tags are nested, but I doubt this file format will have that.

Is it fast enough/efficient enough? We won't know until you try it on your data, but it is at least a solution.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Thank you for the answer. You methods proved quite efficient. I had assumed (wrongly!) that any method involving readLines would be too slow. There is a slight error in your function though as it uses the arguments "lines" and "tag" in the definition, but "txt" and "tag" in the statements. I assume "lines" should be "txt" in both places. – Almighty Shintru Jan 22 '14 at 08:47
  • Yes, of course when I was developing this I had read the text into `txt` and that was matching inside the function instead of the names `lines` argument. Edited. Good spot. – Spacedman Jan 22 '14 at 09:23
0

Sorry I´m making a complete mess here, but I am new to Stackoverflow. I wanted to expand a bit on Spacedman's excellent answer but couldn't get my code in a commment.

I've altered Spacedman's function to make a more generic function to read data frames.

The startSkip and endSkip arguments can be used to ignore lines at the start and end of each block.

I seems to work pretty fast on my files at least.

getSection <- function(file, tag,startSkip=0,endSkip=0){
  txt<-readLines(file)
  start=paste0("^<",tag,">$")
  end = paste0("^<\\\\",tag,">$") 
  startLines = grep(start,txt)
  endLines = grep(end,txt)
  noLines=endLines-startLines-startSkip-endSkip-1
  read.table(file,skip=startLines+startSkip,nrows=noLines)  
}
  • I separated the reading of the file from the extracting of the parts because then you don't need to read the whole file in every time to extract each tagged part. – Spacedman Jan 22 '14 at 09:21
  • That's a good point. Although it seems to run pretty fast even if I read the file each time, it would be better to avoid that. I just don't quite know how to read a data frame without using read.table() and the whole file – Almighty Shintru Jan 22 '14 at 10:22
  • Sorry, I see you already have a solution for that. I guess I should have read your answer better! – Almighty Shintru Jan 22 '14 at 10:27