Specify number of columns to read when first row is missing values

Question

I have data from a logger that inserts timestamps as rows within the comma separated data. I've sorted out a way to wrangle those timestamps into a tidy data frame (thanks to the responses to this question).

The issue I'm having now is that the timestamp lines don't have the same number of comma-separated values as the data rows (3 vs 6), and readr is defaulting to reading only in only 3 columns, despite me manually specifying column types and names for 6. Last summer (when I last used the logger) readr read the data in correctly, but to my dismay the current version (2.1.1) throws a warning and lumps columns 3:6 all together. I'm hoping that there's some option for "correcting" back to the old behaviour, or some work-around solution I haven't thought of (editing the logger files is not an option).

Example code:

library(tidyverse)

# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"

# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"

# throws warning and reads 3 columns
read_csv(
  txt1,
  col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
  col_types = "ddcddc"
)

# works correctly
read_csv(
  txt2,
  col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
  col_types = "ddcddc"
)

# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
  ~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
    NA,   NA, "Logger Start 12:34", NA, NA, NA,
  -112,   53, "N=1", 9, 15, ".25",
  -112,   53, "N=2",12, 17, ".17"
)

I think you should reconsider as to whether you really want that output. Storing two different types of information (timestamps and counts) in the same column is not good practice and will only cause more issues later. — neilfws, Apr 28 '22 at 04:23
I wrangle the timestamp into its own column and convert to posixct as the next step (see link in my post), but need to read it in first in the format it's in. — JMDR, Apr 28 '22 at 04:38

score 1 · Accepted Answer · answered Apr 28 '22 at 03:52

1

Use the base read.csv then convert to typle if need be:

read.csv(text=txt1, header = FALSE,
     col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
   lon lat                  n red nir NDVI
1   NA  NA Logger Start 12:34  NA  NA   NA
2 -112  53                N=1   9  15 0.25
3 -112  53                N=2  12  17 0.17

answered Apr 28 '22 at 03:52

Onyambu

67,392
3
24
53

Thanks! I read the documentation for ```read.csv``` and missed the argument to specify col.names, which is mentioned in the more general ```read.table``` section instead. – JMDR Apr 28 '22 at 18:31

score 0 · Answer 2 · answered Apr 28 '22 at 05:24

I think I would use read_lines and write_lines to convert the "bad CSV" into "good CSV", and then read in the converted data.

Assuming you have a file test.csv like this:

,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17

Try something like this:

library(dplyr)
library(tidyr)

read_lines("test.csv") %>% 
  # assumes all timestamp lines are the same format
  gsub(",,Logger Start (.*?)$", "\\1,,,,,,", ., perl = TRUE) %>%
  # assumes that NDVI (last column) is always present and ends with a digit
  # you'll need to alter the regex if not the case 
  gsub("^(.*?\\d)$", ",\\1", ., perl = TRUE) %>% 
  write_lines("test_out.csv")

test_out.csv now looks like this:

12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17

So we now have 7 columns, the first is the timestamp.

This code reads the new file, fills in the missing timestamp values and removes rows where n is NA. You may not want to do that, I've assumed that n is only missing because of the original row with the timestamp.

mydata <- read_csv("test_out.csv", 
                   col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>% 
  fill(ts) %>% 
  filter(!is.na(n))

The final mydata:

# A tibble: 2 x 7
  ts       lon   lat n       red   nir  NDVI
  <time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34   -112    53 N=1       9    15  0.25
2 12:34   -112    53 N=2      12    17  0.17

Specify number of columns to read when first row is missing values

2 Answers2