1

I have many years of data to read from .txt (tab delimited) to data.frame or data.table formats to work in R. For each year, quarterly files need to be appended. My searching has resulted in some nice code to find all quarterly files and, using FREAD and BIND_ROWS, create 1 annual file. @Maiasaura

One oddity I've found - using FREAD instead of READ.TABLE leads to different classes for some vectors. The pat_age is to be alphanumeric, "00", "01", "02". READ.TABLE seems to handle this as expected - FREAD creates an integer. Thus I've added colClasses to control PAT_AGE class.

Unfortunately - column names across the quarterly files are sometimes Upper Case - others are Lower Case (PAT_AGE pat_age). Any way to control that as I read in the .txt files? ColClasses with tolower didn't work for me.

tabtest <- list.files( pattern= ".*PUDF.*base.*tab.*" ,   full.names = TRUE)
 %>% lapply( fread,  header=TRUE,   colClasses=c(pat_age="character"))   %>% 
   dplyr::bind_rows()

I expect messy data - and may need to adjust other column names and classes as I move from year to year.

NOTE: Am I correct that if I can't change case within the lapply statement - I'd need to do it to the .txt files? The colClasses function requires "pat_age" to be lower cased across all files.

NOTE: Came across this question:
fread (data.table) select columns, throw error if column not found

Could it be modified to read and modify the header - and then read the entire .txt file with corrected headers?

Latest attempt - think it might work okay. Lots of effort/syntax just to change the case of column names!

read_cols <- function(x) {
titles <- fread(x , nrows = 0, header = TRUE, stringsAsFactors = FALSE )
var.names<-tolower(colnames(titles))
rest <- fread(x ,   skip =1  )
names(rest) <- var.names
return(rest)
}


tabtest2 <- list.files( pattern=".*PUDF.*base.*tab.*",   full.names = TRUE) 
%>%    lapply( read_cols ) 
%>%   dplyr::bind_rows()

Thank you.

Community
  • 1
  • 1
Anjeg
  • 23
  • 4
  • Your best bet may be a command line tool to loop through your files first and make all column names lowercase. – Gregor Thomas Aug 22 '16 at 20:49
  • Well, you could match your parentheses to start. Right now, you have something like `c(pat_age="character") %>% dplyr::bind_rows()` which is probably not desired. Fwiw, this seems to work: `c(pat_age="character") %>% setNames(., toupper(names(.)))` – Frank Aug 22 '16 at 21:34
  • @Frank . I have updated the parens - an issue as I created the question - not in practice. I'll try the setNames tomorrow - am guessing I may need to do that first - as pat_age takes different cases across files. – Anjeg Aug 22 '16 at 23:39
  • @Gregor - would you be able to point me to some examples? – Anjeg Aug 23 '16 at 14:19
  • [maybe this](http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/) – Gregor Thomas Aug 23 '16 at 16:22
  • Thx. Not working in UNIX yet. – Anjeg Aug 23 '16 at 17:37
  • You can install Unix command line tools even on windows. – Gregor Thomas Aug 23 '16 at 17:39

0 Answers0