skipping to a certain position in a large txt file

Question

I have over 100 .txt files on which i would like to do calculations. The files contain gaze data which is collected with an eye tracker.

The first part of the data is the calibration part. It contains only a limited number of variables. Every line looks like this(about 20 000 rows):

Event: Data - startTime 1563518990 endTime 1563619015 Gaze 885.638118989 316.57751978

The 2nd part of the data contains the actual gaze data as collected during a gaze test. It contains more variables, in which I'm interested. It looks like this:

Gaze Data - IviewTimestamp 649261961 OpenSesameTimeStamp 55191.0 GazeLeft 0.0 0.0 GazeRight 0.0 0.0 DistanceRight 530.630058679 DiameterLeft 4.89342033646 DiamaterRight 4.44607910548

However, when i use the function read_table2, it only find the variables gathered during the calibration proces. This is because R only looks at the first 1000 rows of the .txt file to determine the variables. I would like it to skip to the first line that contains "iviewTimestamp", so it only imports the relevant part of the .txt file and automatically find the right variables. Since the calibration length isn't equal in every subject, its not possible to skip to a fixed number.

How would i do this?

Ben Bolker · Answer 1 · 2018-04-09T12:39:14.713

Approximately: use grep() to find the first location of the desired string, then use read_table2's skip argument.

firstline <- grep("IviewTimestamp",readLines("file.txt"))[1]

readLines() reads the entire text of the file, as a character vector (one element per line of the file); grep returns the indices of the lines that contain the specified character string (or regular expression); [1] extracts the index of the first line containing the string. Now you can use this to find the right position to start reading:

read_table2("file.txt", skip=firstline-1)

This is inefficient (since you need to read the file twice), but I guess it would cost you less than a second per file. A clunkier but more efficient solution that would work on a Unix or Unix-alike OS would be to use system() to run an external (more efficient) grep command.

Thanks @Ben Bolker, it works! did not know about the readLines function. What does the [1] do? — Bart R, Apr 09 '18 at 12:28

score 0 · Accepted Answer · answered Apr 09 '18 at 12:30

I'd suggest that you import the data and tidy it afterwards, rather than reading it twice.

First import all the file that you have in your directory with:

library(dplyr)
library(purrr)
df <- map_df(list.files(path = path, pattern = '*.txt', full.names = TRUE), read_table2)

It's worth noting here that you can add optional args like col_names etc after you call 'read_table2'.

Once all of your text files have been imported they can be filtered:

filter(df, 'timeStampColumnName' == IviewTimestamp)

skipping to a certain position in a large txt file

2 Answers2