1

TLDR: How do I set Rstudio to import a CSV as a tibble exactly as Microsoft Excel (Rstudio for mac version: Version 1.3.959, Excel for mac: version 16.33 if that helps)? If this is not possible or it should already behave the same, how do I set it to read in a CSV file with no more than 8 columns and fill in blank values in rows so I can tidy it?

Long version:

I have a dozen CSV files (collected from archival animal tags) that messy (inconsistent width, multiple blocks of data on one file) and need to be read in. For workflow reasons, I would like to take the raw data and bring it straight into R. The data has a consistent structure between files: a metadata block, a summary by day that is 6 columns wide, and 2 blocks of constant logging that are 2 columns wide. If you were to count the blank cells in each section, it would be:

Section Width Length
Metadata 8 37
Summary Block 7 N days
Block 1 2 N*72
Block 2 2 N*72

The last three blocks of data can be thousands of entries long. I am unable to get this data to load into R as anything other than a single 1x500,000+ dataframe. Using tag1 = read_csv('file', skip = 37) to just start with the data I want crashes R. It works with read.csv(), but that removes the metadata block that I would like to keep.

Attempting to read the file into Excel shows the correct format (width, length, etc) but will not load all of the data. It cuts off a good chunk of the last block of data. reading in the data in a tabular format like read_xl() presents the same issue.

Ultimately, I like to either import the data as a nested tibble with these different sections, or better yet, automate this process so it can read in an entire folder's worth of csv files, automatically assign them to variables, and split them into sections. However, for now I just want to get this data into a workable format intact, and I would appreciate any help you can give me.

1 Answers1

0

Get the number of lines in the file, n, and from that derive N. Then read the blocks one by one. Use the same connection so each read starts off from where the prior one ended.

n <- length(count.fields("myfile", sep = ""))
N = (n - 37) / (1 + 2 * 72)

con <- file("myfile", open = "r")
meta <- readLines(con, 37)
summary_block <- read.csv(con, header = FALSE, nrow = N)
block1 <- read.csv(con, header = FALSE, nrow = N * 37)
block2 <- read.csv(con, header = FALSE, nrow = N * 37)
close(con)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341