readr::read_tsv() parsing failures due to trailing tabs

Question

Issue/Question

I have tab-delimited my-data.txt file with 51 columns. The col_names row has no trailing tabs and readr::read_tsv() correctly detects 51 columns. However, the data columns all contain trailing tabs and readr::read_tsv() interprets these incorrectly as having 52 columns. While the code runs, I get a warning, which I would like to get rid of. Are there any read_tsv() arguments that can help handle this? Should I instead use a different readr function?

my-data.txt

PT  AU  BA  CA  GP  RI  OI  BE  Z2  TI  X1  Y1  Z1  FT  PN  AE  Z3  SO  S1  SE  BS  VL  IS  SI  MA  BP  EP  AR  DI  D2  SU  PD  PY  AB  X4  Y4  Z4  AK  CT  CY  SP  CL  TC  Z8  ZB  ZS  Z9  SN  BN  UT  PM
J   Jacquelin, Sebastien; Straube, Jasmin; Cooper, Leanne; Vu, Therese; Song, Axia; Bywater, Megan; Baxter, Eva; Heidecker, Matthew; Wackrow, Brad; Porter, Amy; Ling, Victoria; Green, Joanne; Austin, Rebecca; Kazakoff, Stephen; Waddell, Nicola; Hesson, Luke B.; Pimanda, John E.; Stegelmann, Frank; Bullinger, Lars; Doehner, Konstanze; Rampal, Raajit K.; Heckl, Dirk; Hill, Geoffrey R.; Lane, Steven W.                              Jak2V617F and Dnmt3a loss cooperate to induce myelofibrosis through activated enhancer-driven inflammation                              BLOOD               132 26          2707    2721        10.1182/blood-2018-04-846220            DEC 27 2018 2018                                        10                          WOS:000454429300003     
J   Renne, Julius; Gutberlet, Marcel; Voskrebenzev, Andreas; Kern, Agilo; Kaireit, Till; Hinrichs, Jan; Zardo, Patrick; Warnecke, Gregor; Krueger, Marcus; Braubach, Peter; Jonigk, Danny; Haverich, Axel; Wacker, Frank; Vogel-Claussen, Jens; Zinne, Norman                               Multiparametric MRI for organ quality assessment in a porcine Ex-Vivo lung perfusion system                             PLOS ONE                13  12                  e0209103    10.1371/journal.pone.0209103            DEC 27 2018 2018                                        1                           WOS:000454418200015     
J   Lau, Skadi; Eicke, Dorothee; Oliveira, Marco Carvalho; Wiegmann, Bettina; Schrimpf, Claudia; Haverich, Axel; Blasczyk, Rainer; Wilhelmi, Mathias; Figueiredo, Constanca; Boeer, Ulrike                              Low Immunogenic Endothelial Cells Maintain Morphological and Functional Properties Required for Vascular Tissue Engineering                             TISSUE ENGINEERING PART A               24  5-6         432 447     10.1089/ten.tea.2016.0541           MAR 2018    2018                                        4                           WOS:000418327100001

Reprex

Note that I did some manually editing of the reprex because I needed to read in the .txt file to reproduce the issue, but this causes errors in reprex without my computer-specific path). See RStudio Community Topic 8773

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(readr)

my_data <- read_tsv("my-data.txt", quote = "")
#> Parsed with column specification:
#> cols(
#>   .default = col_logical(),
#>   PT = col_character(),
#>   AU = col_character(),
#>   TI = col_character(),
#>   SO = col_character(),
#>   VL = col_double(),
#>   IS = col_character(),
#>   BP = col_double(),
#>   EP = col_double(),
#>   AR = col_character(),
#>   DI = col_character(),
#>   PD = col_character(),
#>   PY = col_double(),
#>   TC = col_double(),
#>   UT = col_character()
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row col   expected     actual                                   file
#>   1  -- 51 columns 52 columns 'my-data.txt'
#>   2  -- 51 columns 52 columns 'my-data.txt'
#>   3  -- 51 columns 52 columns 'my-data.txt'

problems(my_data)
#> # A tibble: 3 x 5
#>     row col   expected   actual     file                                  
#>   <int> <chr> <chr>      <chr>      <chr>                                 
#> 1     1 <NA>  51 columns 52 columns 'my-data.txt'
#> 2     2 <NA>  51 columns 52 columns 'my-data.txt'
#> 3     3 <NA>  51 columns 52 columns 'my-data.txt'

^{Created on 2020-04-01 by the reprex package (v0.3.0)}

Thank you for taking the time to help me.

score 0 · Answer 1 · answered Apr 01 '20 at 13:46

0

My favorite .tsv file reader is fread from data.table. It often works right out of the box. It might be worth a try.

library(data.table)
my_data <- fread("my-data.txt")

answered Apr 01 '20 at 13:46

Ian Campbell

23,484
14
36
57

Thank you for the suggestion, Ian. I gave it a try (see reprex below). Unfortunately `data.table::fread` accounts for the additional tab by adding an additional column which misaligns the data from the column names, e.g., the "DOI" is no longer in the "DI" column but rather is shifted one left to the "AR" column. I appreciate any other ideas you may have for handling trailing tabs! – maia-sh May 22 '20 at 09:56
`my_data <- data.table::fread("my-data.txt", quote = "")` `#> Warning in data.table::fread("/Users/maia/Repositories/responsible-metrics/my-data.txt", : Detected 51 column names but the data has 52 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.` ^{Created on 2020-05-22 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)} – maia-sh May 22 '20 at 09:56

readr::read_tsv() parsing failures due to trailing tabs

Issue/Question

my-data.txt

Reprex

1 Answers1