Read and parse an irregular and mixed ASCII file into R

Question

I'm still very new to R and I apologize if I'm not using the proper terminology. I'm interested in pulling a large amount of Unemployment Insurance Trust Fund data from the Treasury Direct online report query system (http://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm) and I've successfully pulled the information using readLines.

ESAA_OCT15 <- readLines('http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt')

which gives me the chart as a string vector.

Is there a way to then parse the lines and turn it into a data frame so I can at least put it excel and easily get the important information out of it? I'm sure there is another way to do this as well but the reports will always vary in what accounting code sections are included and how many individual transactions are included so I'm not even sure where to begin with that.

The items I need are the date, the share/par (dollar transaction amount), the transaction code, and the transaction description. The totals would be useful but are by no means necessary.

When you look at it using Excel it looks like

score 0 · Accepted Answer · answered Oct 21 '16 at 16:31

This will help you parse the information:

ESAA_OCT15 <- readLines('http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt')
# Select lines with /
z = grepl(pattern = "/",x = ESAA_OCT15)
d = trimws(ESAA_OCT15[z])

dates = substr(d,0,10)
sharesPar = substr(d,11,41)

What this does is first select all lines that contain a / character. This will even return the column titles. These are stored in d.

If you examine d:

[1] "Effective Date                 Shares/Par  Description Code           Memo Number    Code      Account Number"
 [2] "10/01/2015                 2,313,000.0000  12-10 FUTA RECEIPTS         3305617                 ESAA"          
 [3] "10/01/2015                 3,663,000.0000  12-10 FUTA RECEIPTS         3305618                 ESAA"          
 [4] "10/02/2015                 4,314,000.0000  12-10 FUTA RECEIPTS         3305640                 ESAA"          
 [5] "10/05/2015                 3,512,000.0000  12-10 FUTA RECEIPTS         3305662                 ESAA"

The information is aligned neatly. This means that the data of each column ends at a precise position. To parse this you can use substr with start and stop as shown in my script.

Of course, I did not complete all parses, I'll let you finish the rest. Once each column is parsed, create a data.frame(dates, sharesPar, ...)

hrbrmstr · Answer 2 · 2016-10-21T18:17:33.557

It's a fixed-width format, so it should be treated as such:

library(dplyr)
library(readr)

readLines("http://www.treasurydirect.gov/govt/reports/tfmp/utf/es/dfiw01015tses.txt") %>% 
  grep("^\ +[[:digit:]]+/[[:digit:]]+", ., value=TRUE) %>% # grab only the lines with data
  textConnection() %>% 
read.fwf(widths=c(19, 26, 27, 15, 10, 27), skip=7) %>%     # read the columns
  mutate_all(trimws) %>%                                   # clean them up
  type_convert() %>%                                       # you still need to convert the date even with this type conversion
  setNames(c("effective_date", "shares_per",               # add decent colnames
             "trans_descr_code", "memo_num", "location_code", "acct_no"))

Read and parse an irregular and mixed ASCII file into R

2 Answers2