2

ANSWERED: Thank you so much Bob, ffs the issue was not specifying comment='#'. Why this works, when 'skip' should've skipped the offending lines remains a mystery. Also see Gray's comment re: Excel's 'Text to Columns' feature for a non-R solution.

Hey folks,

this has been a demon on my back for ages.

The data I work with is always a collection of tab delimited .txt files, so my analysis always begin with gathering the file paths to each and feeding those into read.csv() and binding to a df.

dat <- list.files(
    path = 'data',
    pattern = '*.txt',
    full.names = TRUE,
    recursive = TRUE
    ) %>%
    map_df( ~read.csv( ., sep='\t', skip=16) )  # actual data begins at line 16

This does exactly what I want, but I've been transitioning to tidyverse over the last few years.

I don't mind using utils::read.csv(), where my datasets are usually small the speed benefit of readr wouldn't be felt. But, for consistency's sake I'd rather use readr.

When I do the same, but sub readr::read_tsv(), i.e.,

dat <- 
    .... same call to list.files()
    %>%
    map_df( ~read_tsv( ., skip=16 ))

I always get an empty (0x0) table. But it seems to be 'reading' the data, because I get a warning print out of 'Parsed with column specification: cols()' for every column in my data.

Clearly I'm misunderstanding here, but I don't know what about it I don't understand, which has made my search for answers challenging & fruitless.

So... what am I doing wrong here?

Thanks in advance!

edit: a example snippet of (one of) my data files was requested, hope this formats well!

# KLIBS INFO
#  > KLibs Commit: 11a7f8331ba14052bba91009694f06ae9e1cdd3d
#
# EXPERIMENT SETTINGS
#  > Trials Per Block: 72
#  > Blocks Per Experiment: 8
#
# SYSTEM INFO
#  > Operating System: macOS 10.13.4
#  > Python Version: 2.7.15
#
# DISPLAY INFO
#  > Screen Size: 21.5" diagonal
#  > Resolution: 1920x1080 @ 60Hz
#  > View Distance: 57 cm

PID search_type stimulus_type   present_absent  response    rt  error
3   time    COLOUR  present absent  5457.863881 TRUE
3   time    COLOUR  absent  absent  5357.009108 FALSE
3   time    COLOUR  present present 2870.76412  FALSE
3   time    COLOUR  absent  absent  5391.404728 FALSE
3   time    COLOUR  present present 2686.6131   FALSE
3   time    COLOUR  absent  absent  5306.652878 FALSE

edit: Using Jukob's suggestion

files <- list.files(
    path = 'data',
    pattern = '*.txt',
    full.names = TRUE,
    recursive = TRUE
    )

for (i in 1:length(files)) {
    print(read_tsv(files[i], skip=16))
}

prints:

Parsed with column specification:
cols()
# A tibble: 0 x 0

... for each file

If I print files, I do get the correct list of file paths. If I remove skip=16 I get:

Parsed with column specification:
cols(
  `# KLIBS INFO` = col_character()
)
Warning: 617 parsing failures.
row col  expected     actual                                     file
 15  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
 16  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
 17  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
 18  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
 19  -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
... ... ......... .......... ........................................
See problems(...) for more details.

... for each file
Brett
  • 45
  • 8
  • 1
    I must ask the obvious question; why are you saving those files as tsv? Try this: copy one of the tsv files and rename it with a csv extension. Then using just one line of code try reading it. df <- read.csv("path/to/file" ). If you want a tibble, use one line of code for reading it. dt <- read_csv "path/to/file") – Gray Aug 10 '20 at 17:42
  • They're being saved as .txt (not .tsv, assuming that wasn't a typo), where I didn't write the program I'm stuck with its output. I tried changing the extensions like you suggested, then tried importing it using my second example (_tsv() ), as well as using read_csv(), both result in the same error. – Brett Aug 10 '20 at 17:53
  • 1
    Could you copy & paste and provide a sample of the txt file data? Did you use the parenthesis withinin your read code? Do you read.csv("path/to/code()" ) ?? – Gray Aug 10 '20 at 17:58
  • Added some example data to my post. No, no paren's within read code, my working code is exactly as contained in my first example. – Brett Aug 10 '20 at 18:12
  • 1
    Thanks for the code sample, It helps. By any chance, do you have MS Excel on your PC? And what names do you want for the columns? – Gray Aug 10 '20 at 18:23
  • Sure do (have excel). I would like to retain the first row as headers (my data always contains a header) – Brett Aug 10 '20 at 18:29
  • 1
    Sometimes Excel comes in handy because it has a useful feature under the Data tab called Text to Columns. I easily converted your data into something usable. I pasted your code into Excel. Removed the first row of data because that first row is not properly representing your column names. Then I highlighted that first column (with all the data). The used the Excel feature "Text to Columns". Using the the Fixed Width option, choose Next, Next, Finish. Add the Column Names, then Save As - use the .csv option. The file now opens easily with code: read.csv("folder") – Gray Aug 10 '20 at 18:42
  • Oh interesting, that's unbelievably helpful thank you! I'm going to pass this along to the guy who wrote the framework we use, possibly there could be improvements to how it writes data. – Brett Aug 10 '20 at 18:44
  • 1
    I only use this trick when there is troublesome data that needs to read as .csv. – Gray Aug 10 '20 at 18:45
  • Meant to ask: what do you mean by the first row not properly representing my data... is that in reference to the initial 16 lines starting w/ #, or is there also an issue with the (intended) header row? – Brett Aug 10 '20 at 18:47
  • 1
    Whenever there is some tabular data that is not being properly read into code as a csv, I'll paste the data into Excel. First do some cleanup, then use the Text to Columns. When the data is really ugly, it might require using the Excel feature, Find and Replace a time or two. Sometimes the data needs to be enclosed in quotes, then using the Excel concat under Formulas. Etc... – Gray Aug 10 '20 at 18:51
  • 1
    I would have the similar output using your snippet but that was because there's no tab in the snippet. If you were to open your txt file in Notepad++ (if in window) or other text editiors, could you check if the separators are recognized as tab or space? Also, `data.table::fread()` works perfectly fine! – Ben Toh Aug 10 '20 at 19:20
  • Ben: took a look in bbedit, and they are being recognized as tabs. It's looking like data.table::fread() is the way to go like you suggested. Really it's just pride that keeps me from using fread(), not being able to seamlessly transition from read.csv to read_csv... well it just bothers me. – Brett Aug 10 '20 at 19:34

2 Answers2

2

Could your try following code? The value of i may give you some idea for which file there is a problem.

files <- list.files(path = "path", full.names = T, pattern = ".csv")
for (i in 1:length(files)){
  print(read_tsv(files[i], skip = 16))
}
Jakub.Novotny
  • 2,912
  • 2
  • 6
  • 21
  • Doing this results in: Error in if (length(x) > 1 || grepl("\n", x)) { (01_preprocess_2020_data.R#22): missing value where TRUE/FALSE needed Which is a new error! So that's... progress? – Brett Aug 10 '20 at 17:56
  • 1
    It seems like there is a problem with reading in a file/some of the files. I updated my answer to show what purrr-like syntax works for me. – Jakub.Novotny Aug 10 '20 at 18:03
  • Your second example is exactly what I always imagine the answer should be here, but still nothing. You're likely right in that it's an oddity w/ the files themselves. I'll keep digging and hopefully update this post one day with a solution. Thanks mate! – Brett Aug 10 '20 at 18:19
  • 1
    I changed my answer. It now contains a loop which should hopefully identify the file that is causing trouble. – Jakub.Novotny Aug 10 '20 at 18:22
  • As unhelpful as this is, the error is with all of them. If I print files, I get a complete & accurate list of paths to each file. When I print the result of read_tsv() each line reads: Parsed with column specification: cols() # A tibble: 0 x 0 – Brett Aug 10 '20 at 18:37
2

FWIW I was able to solve the problem using your snippet by doing something along the following line:

# Didn't work for me since when I copy and paste your snippet,
# the tabs become spaces, but I think in your original file
# the tabs are preserved so this should work for you
read_tsv("dat.tsv", comment = "#") 

# This works for my case
read_table2("dat.tsv", comment = "#")

Didn't even need to specify skip argument!

But also, no idea why using skip and not comment will fail... :(

Ben Toh
  • 742
  • 5
  • 9