R: Read text files with blanks and unequal number of columns

Question

I am trying to read many text files into R using read.table. Most of the time we have clean text files which have defined columns.

The data that I am trying to read comes from ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt

You can see that the blanks and length of text files varies by report. ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/100917_livecattle.txt

My objective is to read many of these text files and combine them into a dataset.

If I can read one of the them then compiling should not be an issue. However, I am running into several issues because of the format of the text file:

1) the number of FIRMS vary from report to report. For example, sometimes there will be 3 rows (i.e. 3 firms that did business on that data) of data to import and sometimes there may be 10.

2) Blanks are being recognized. For example, under the FIRM section there should be a column for Deliveries (DEL) and Receipts (REC). The data when it is read in THIS section should look like:

df <- data.frame("FIRM_#" = c(407, 685, 800, 905), 
  "FIRM_NAME" = c("STRAITS FIN LLC", "R.J.O'BRIEN ASSOC", "ROSENTHAL COLLINS LL", "ADM INVESTOR SERVICE"),
  "DEL" = c(1,1,15,1), "REC"= c(NA,18,NA,NA))

however when I read this in the fomatting is all messed up and does not put NA for the blank values

3) The above issues apply for "YARDS" and "FUTURE DELIVERIES SCHEDULED" section of the text file.

I have tried to read in sections of the text file and then format it accordingly but since the the number of firms change day to day the code does not generalize.

Any help would greatly be appreciated.

score 1 · Accepted Answer · answered Nov 08 '17 at 21:59

Here an answer which starts from the scratch via rvest for downloading data and includes lots of formatting. The general idea is to identify fixed widths that may be used to separate columns - I used a little help from SO for this purpose link.

You could then use read.fwf() in combination with cat()and tempfile(). In my first attempt this did not work, due to some formatting issues, so I added some additional lines to get the final table format.

Maybe there are some more elegant options and shortcuts I have overseen, but at least, my answer should get you started. Of course, you will have to adapt the selection of lines, identification of widths for spliting tables depending on what parts of the data you need. Once this is settled, you may loop through all the websites to gather data. I hope this helps...

library(rvest)
library(dplyr)

page <- read_html("ftp://ftp.cmegroup.com/delivery_reports/live_cattle_delivery/102317_livecattle.txt")

table <- page %>%
  html_text("pre") %>%
  #reformat by splitting on line breakes
  { unlist(strsplit(., "\n")) } %>%
  #select range based on strings in specific lines
  "["(.,(grep("FIRM #", .):(grep("        DELIVERIES SCHEDULED", .)-1))) %>%
  #exclude empty rows
  "["(., !grepl("^\\s+$", .)) %>%
  #fix width of table to the right
  { substring(., 1, nchar(gsub("\\s+$", "" , .[1]))) } %>%
  #strip white space on the left
  { gsub("^\\s+", "", .) }


headline <- unlist(strsplit(table[1], "\\s{2,}"))

get_split_position <- function(substring, string) {

   nchar(string)-nchar(gsub(paste0("(^.*)(?=", substring, ")"), "", string , perl=T))

}

#exclude first element, no split before this element
split_positions <- sapply(headline[-1], function(x) {

   get_split_position(x, table[1])

})


#exclude headline from split
table <- lapply(table[-1], function(x) {

  substring(x,  c(1, split_positions + 1),  c(split_positions, nchar(x)))

})

table <- do.call(rbind, table)
colnames(table) <- headline

#strip whitespace
table <- gsub("\\s+", "", table)

table <- as.data.frame(table, stringsAsFactors = FALSE)
#assign NA values
table[ table == "" ] <- NA
#change column type
table[ , c("FIRM #", "DEL", "REC")] <- apply(table[ , c("FIRM #", "DEL", "REC")], 2,  as.numeric)

table
# FIRM #          FIRM NAME DEL REC
# 1    407      STRAITSFINLLC   1  NA
# 2    685   R.J.O'BRIENASSOC   1  18
# 3    800 ROSENTHALCOLLINSLL  15  NA
# 4    905 ADMINVESTORSERVICE   1  NA

Thank you for the reply. This has helped a great deal. Everything thing works smoothly expect for pulling the data in the last section. The numbers 0100 change from report to report and thus are not genearlizable across text files. Is there a way to tell the grabber function to go to the end of the file? Thank you again for the help — EDennnis, Nov 11 '17 at 01:39
For getting the final line of a character vector, hence your text lines, you may simply use `vector[length(vector)]`. I think, alternatively, `tail(vector, 1)` should work. — Manuel Bickel, Nov 11 '17 at 09:00
If you still have open issues please specify what exactly did not work and provide the code you have tried that did not work, this will make it easier to help (parsing data is a case specific task, therefore, it is important to specify the critical points in the code as exactly as possible). Otherwise please consider accspting the answer. — Manuel Bickel, Nov 11 '17 at 09:05

R: Read text files with blanks and unequal number of columns

1 Answers1