Merging text files to csv under the same header in R

Question

I have a folder with over 1000 text files showing pollutant levels from particular air quality stations.

I'm looking to combine all these text files into one csv in R, so i can analyse the data temporally in one space.

Each text file is organised as below, with the unit name, the start time of the particular set of observations and then the columns for the data.

Example of the headers of my text file 1:

Unit 12345678                                           
Start time: Wed Jan 29 10:57:58 2020

**dd/mm/yyyy hh:mm:ss,  PM1,    PM2.5,  PM10,   TSP,    RHpre,  Tpre,   DPpre,  RHpost, Tpost,  DPpost**

29/01/2020 10:59:00,    1.39,   4.70,   17.11,  172.64, 36.10,  23.11,  7.17,   12.49,  41.26,  7.09
29/01/2020 11:00:00,    1.21,   3.64,   15.68,  26.39,  36.59,  23.12,  7.32,   12.41,  41.52,  7.17
29/01/2020 11:01:00,    1.20,   3.65,   15.12,  93.69,  36.51,  23.18,  7.43,   12.39,  41.68,  7.31
29/01/2020 11:02:00,    1.29,   4.09,   11.93,  15.31,  36.19,  23.22,  7.42,   12.30,  41.79,  7.37
29/01/2020 11:03:00,    1.30,   3.74,   9.06,   11.90,  36.04,  23.26,  7.33,   12.27,  41.88,  7.27
29/01/2020 11:04:00,    1.33,   4.31,   18.62,  44.38,  35.98,  23.28,  7.33,   12.21,  41.97,  7.34

Example of text file 2

Unit 12345678          
                                 
Start time: Wed Jan 29 11:14:46 2020

**dd/mm/yyyy hh:mm:ss,  PM1,    PM2.5,  PM10,   TSP,    RHpre,  Tpre,   DPpre,  RHpost, Tpost,  DPpost**

29/01/2020 11:16:00,    1.29,   4.80,   12.68,  14.96,  36.77,  23.15,  7.69,   14.41,  38.14,  6.58
29/01/2020 11:17:00,    1.24,   3.97,   13.30,  18.04,  37.51,  23.13,  7.58,   14.23,  38.57,  6.76
29/01/2020 11:18:00,    1.13,   3.50,   16.80,  60.72,  37.09,  23.16,  7.80,   14.11,  38.89,  6.84
29/01/2020 11:19:00,    1.33,   4.56,   14.23,  71.32,  38.96,  23.22,  8.25,   14.24,  39.15,  7.04
29/01/2020 11:20:00,    1.23,   3.72,   16.87,  22.36,  38.13,  23.29,  8.47,   14.00,  39.39,  7.27
29/01/2020 11:21:00,    1.17,   4.47,   12.30,  15.60,  37.00,  23.34,  8.36,   13.86,  39.62,  7.24
29/01/2020 11:22:00,    1.28,   4.18,   12.80,  229.03, 36.27,  23.36,  7.54,   13.70,  39.85,  7.37
29/01/2020 11:23:00,    1.34,   4.28,   17.27,  96.94,  36.19,  23.37,  7.50,   13.54,  40.05,  7.30

So for each text file, the first (station ID) and the third (column names) will remain the same for the particular station but the second line will change with each output produced by the monitor.

As mentioned above I am looking to combine all these text files together, but under the uniform header of the column names (dd/mm/yyyy hh:mm:ss, PM1, PM2.5, PM10, TSP, RHpre, Tpre, DPpre, RHpost, Tpost, DPpost), because this will be consistent in every monitor i have access too and therefore the code can be easily reproduced.

I've attempted:

mypath = "C:/Desktop/mytxtfolder/"

txt_files_ls = list.files(path=mypath, pattern="*.txt") 

txt_files_df <- lapply(txt_files_ls, function(x) {read.table(file = x,skip =3, header = T, sep =",")})

combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))

And get consistent errors of

Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

I assume this is because the second row values (time of uploads) do not match and i am using the function incorrectly to skip the first two rows and only combine on the third row.

alex_jwb90 · Accepted Answer · 2020-08-31T12:08:10.980

First of all, I think that do.call(dplyr::bind_rows, txt_files_df) would already solve the error you are seeing with base::rbind, because bind_rows doesn't crash when the columns of its inputs aren't aligned. It just adds new columns to the result in that case.
Secondly, you can also make your code a little cleaner, using purrr's map_dfr, which applies a function over the elements of a list and row-binds the results robustly using dplyr. Like this:

library(dplyr)
library(purrr)
library(readr)

combined_df <- purrr::map_dfr(txt_files_ls, function(x) {
  readr::read_csv(x, skip = 3, trim_ws = T)
})

However, with the error you are getting, I would guess that either the headers are not always the same or it is not a constant of 3 rows you need to skip.
You can test for that by going through your list and test all loaded dataframes on whether the colnames are the same as in the first one. For example:

test <- txt_files_df %>%
    purrr::discard(~identical(colnames(.), colnames(txt_files_df[[1]])))

I'm using purrr::discardto throw out any entry where the colnames are as expected, so your final result should be empty - but if it isn't, you know you need to check your data or adjust your code to be more robust if that's not possible.

I'd suggest adding the filename to the dataframes you read, so that you can identify which file has provided you odd inputs. Also, in case the leading lines are the culprit, let's explicitly check where the header is and skip lines accordingly:

combined_df <- purrr::map_dfr(txt_files_ls, function(x) {
  first_10_lines <- readLines(x, 10L)
  header_line <- min(which(grepl('**dd/mm/yyyy hh:mm:ss', first_10_lines, fixed = T)))
  
  df <- readr::read_csv(x, skip = header_line - 1, trim_ws = T)
  df$file_name <- x # allowing you to know what file this data came from
  df
})

// update, responding to OP's issue with non matching column types:

I am receiving errors Error: Can't combine PM1 <double> and PM1 <character>

There's two ways to attack this:

If you're 100% sure that the data is always numeric, then you can declare that in the csv parser itself. However, if character data had managed to sneak in, it will be treated as <NA>'s and thus be "lost" (you will receive a warning though):

combined_df <- purrr::map_dfr(txt_files_ls, function(x) {
  first_10_lines <- readLines(x, 10L)
  header_line <- min(which(grepl('**dd/mm/yyyy hh:mm:ss', first_10_lines, fixed = T)))

  df <- readr::read_csv(
    x,
    skip = header_line - 1,
    trim_ws = T,
    col_types = cols(
      `**dd/mm/yyyy hh:mm:ss` = col_datetime(format = "%d/%m/%Y %H:%M:%S"),
      .default = col_double()
    )
  )
  df$file_name <- x # allowing you to know what file this data came from
  df
})

If you'd rather not lose anything on loading the files, you can just read all columns in as character-vectors and let readr::type_convert guess the types later, after rowbinding.

combined_df <- purrr::map_dfr(txt_files_ls, function(x) {
  first_10_lines <- readLines(x, 10L)
  header_line <- min(which(grepl('**dd/mm/yyyy hh:mm:ss', first_10_lines, fixed = T)))

  df <- readr::read_csv(
    x,
    skip = header_line - 1,
    trim_ws = T,
    col_types = cols(
      `**dd/mm/yyyy hh:mm:ss` = col_datetime(format = "%d/%m/%Y %H:%M:%S"),
      .default = col_character()
    )
  )
  df$file_name <- x # allowing you to know what file this data came from
  df
}) %>%
readr::type_convert()

Hi @alex_jwb90, this has worked for the most part and its such an efficient and concise way of doing it so thank you. However for one of my folders it seems to crash, I am receiving errors Error: Can't combine `PM1` and `PM1` . I imagine it has something to do with one of the files but since the function didn't finish i am unable to see which one is causing the problem/ Any tips? — Adam_Eire_2020, Aug 31 '20 at 11:29
that's because in one of your files, the `read_csv`parser didn't guess the type to be *double* and instead read it in as a *character*-vector. I've added some detail in my answer how you can counter this — alex_jwb90, Aug 31 '20 at 12:03
blank data shouldn't be an issue, really. you could filter out rows with blank data right within the `map_dfr` function. again, `dplyr` to the rescue: `df <- df %>% dplyr::filter(!is.na(PM1))` excludes the rows where "PM1" has no value. — alex_jwb90, Aug 31 '20 at 12:14
happy to hear this works for you. can only encourage you to begin working with the tidyverse tools, especially [readr](https://readr.tidyverse.org/), [dplyr](https://dplyr.tidyverse.org/) and [purrr](https://purrr.tidyverse.org/). they will make your life with R infinitely more enjoyable :) — alex_jwb90, Aug 31 '20 at 12:20
Absolutely, its all a steep learning curve at the moment, I'm more intune with just how to use packages without really understanding whats going on in the background but certaintly trying to improve my use and understanding of writing functions! — Adam_Eire_2020, Aug 31 '20 at 12:23

Merging text files to csv under the same header in R

1 Answers1