I'm trying to explore 2017 HMDA data. The flat file is about 9GB, available here. The CSV is too large to read into memory, so I tried using the ff
library. However, I am getting errors when I try to read the file.
> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv')
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a real', got '"83.9800033569336"'`
When I just scan the first 1000 rows, the error disappears; but it starts somewhere between line 1000 and 10000:
> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv', nrow = 1000)
I tried specifying all of my column classes too, but it returns an error:
> hmda.ff <- read.csv.ffdf(file = 'hmda_lar.csv',
nrow = 10000,
colClasses = c('real', 'real', 'integer', 'real', 'integer',
'integer', 'integer', 'integer', 'integer',
'factor', 'factor', 'character', 'character',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'character', 'integer', 'character',
'factor', 'factor', 'factor', 'factor', 'factor',
'factor', 'factor', 'factor', 'factor', 'factor'))
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'a real', got '"63.5"'
When I convert all of the integers and reals to characters, I still get an error:
... Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered,
: vmode 'character' not implemented
The only solution that works is to specify colClasses = 'factor'
, turning all of the columns into factors.
EDIT: The issue appears to be related to the raw CSV file. Some of the values are wrapped in quotes and some are not. If I export the first 10,000 lines to a CSV and use read.csv()
, it works as expected, with the data types as numeric. But on that same subset, if I use read.csv.ffdf()
I get the error scan() expected 'a real', got '"63.5"'
. It's part the CSV, but also ffdf doesn't read the CSV as expected.
Because read.csv()
works, I tried chunking the file into 15 different data frames, each containing 1,000,000 rows. However, it keeps freezing when getting to the 11th file, probably because it's loading it to memory just to find the 11,000,000th row.
So the question is, how can you get ff
to deal with real numbers that are inconsistently wrapped with quotes? Or how do you clean up the raw data to remove the quotes? Or how do you chunk the data in a way that efficiently uses your RAM?
FYI here is the data header:
> glimpse(hmda.ff[,])
Observations: 14,285,496
Variables: 47
$ tract_to_msamd_income <fct> 63.5, 238.1199951171875, 38.189998626708984, 132.32000732421875, 87.5, 138.16000366210938, 98.43000030517578, 93.04000091552...
$ rate_spread <fct> , , , , , , , , , , , , , , , , 01.85, , , , 03.92, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ population <fct> 7067, 5429, 6869, 3835, 1960, 7120, 1828, 4643, 16372, 2977, 9630, 3298, 4487, 3324, 4099, 3835, 6003, 5187, 4818, 5849, 422...
$ minority_population <fct> 72.08000183105469, 6.559999942779541, 30.719999313354492, 65.73999786376953, 55.459999084472656, 23.309999465942383, 13.5699...
$ number_of_owner_occupied_units <fct> 1201, 1611, 236, 1027, 407, 2037, 615, 854, 3292, 317, 3052, 1104, 617, 1099, 1409, 1027, 1122, 1638, 1495, 1508, 1187, 1700...
$ number_of_1_to_4_family_units <fct> 1303, 1807, 794, 1141, 601, 2431, 725, 1936, 5286, 1174, 3188, 1175, 1120, 1404, 1522, 1141, 1520, 2162, 1989, 2080, 1421, 2...
$ loan_amount_000s <fct> 400, 525, 225, 621, 181, 70, 123, 5, 100, 34, 302, 680, 108, 99, 100, 100, 171, 443, 420, 50, 75, 361, 179, 338, 300, 544, 3...
$ hud_median_family_income <fct> 107600, 77500, 61800, 75200, 50000, 68800, 79600, 75200, 58400, 70800, 79600, 107600, 79300, 83900, 108300, 75200, 63200, 72...
$ applicant_income_000s <fct> 90, 300, , 255, 109, 238, 84, 75, 44, 195, 62, 159, 50, 84, 70, 124, 80, 264, 177, 214, 181, 57, 86, 157, 64, 96, , 30, 50, ...
$ state_name <fct> Virginia, Illinois, Michigan, California, California, South Carolina, Michigan, California, Florida, Pennsylvania, Michigan,...
$ state_abbr <fct> VA, IL, MI, CA, CA, SC, MI, CA, FL, PA, MI, VA, CA, CO, CT, CA, CA, WI, NY, CA, CA, CA, NE, VA, NY, CA, CA, FL, SC, CA, VA, ...
$ sequence_number <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ respondent_id <fct> 7442300004, 0000852218, 0000146672, 0000852218, 86-0860478, 0000617677, 7197000003, 0000504713, 39-2001010, 3027509990, 0000...
$ purchaser_type_name <fct> Life insurance company, credit union, mortgage bank, or finance company, Loan was not originated or was not sold in calendar...
$ property_type_name <fct> One-to-four family dwelling (other than manufactured housing), One-to-four family dwelling (other than manufactured housing)...
$ preapproval_name <fct> Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicable, Not applicab...
$ owner_occupancy_name <fct> Owner-occupied as a principal dwelling, Owner-occupied as a principal dwelling, Not owner-occupied as a principal dwelling, ...
$ msamd_name <fct> Washington, Arlington, Alexandria - DC, VA, MD, WV, Chicago, Naperville, Arlington Heights - IL, Kalamazoo, Portage - MI, Sa...
$ loan_type_name <fct> Conventional, Conventional, Conventional, Conventional, FHA-insured, Conventional, FHA-insured, Conventional, Conventional, ...
$ loan_purpose_name <fct> Home purchase, Refinancing, Home purchase, Refinancing, Home purchase, Home improvement, Refinancing, Home improvement, Home...
$ lien_status_name <fct> Secured by a first lien, Secured by a first lien, Secured by a first lien, Secured by a first lien, Secured by a first lien,...
$ hoepa_status_name <fct> Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan, Not a HOEPA loan...
$ edit_status_name <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ denial_reason_name_3 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Insufficient cash (downpayment, closing costs), , , , , ...
$ denial_reason_name_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Debt-to-income ratio, , Debt-to-income ratio, , , , , , ...
$ denial_reason_name_1 <fct> , , , Debt-to-income ratio, , Credit history, , Credit history, , Credit application incomplete, , , , , , , , , , , , Credi...
$ county_name <fct> Fairfax County, Cook County, Kalamazoo County, Sacramento County, Fresno County, Charleston County, Macomb County, Sacrament...
$ co_applicant_sex_name <fct> Male, No co-applicant, No co-applicant, Female, Female, Information not provided by applicant in mail, Internet, or telephon...
$ co_applicant_race_name_5 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ co_applicant_race_name_4 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ co_applicant_race_name_3 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ co_applicant_race_name_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ co_applicant_race_name_1 <fct> White, No co-applicant, No co-applicant, White, Asian, Information not provided by applicant in mail, Internet, or telephone...
$ co_applicant_ethnicity_name <fct> Not Hispanic or Latino, No co-applicant, No co-applicant, Not Hispanic or Latino, Not Hispanic or Latino, Information not pr...
$ census_tract_number <fct> 4522.00, 8198.01, 0015.07, 0093.30, 0049.02, 0046.09, 2515.00, 0018.00, 0432.04, 0007.00, 2234.00, 4703.00, 0017.00, 0102.10...
$ as_of_year <fct> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017...
$ application_date_indicator <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_sex_name <fct> Female, Male, Male, Male, Male, Information not provided by applicant in mail, Internet, or telephone application, Female, F...
$ applicant_race_name_5 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_race_name_4 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_race_name_3 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_race_name_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
$ applicant_race_name_1 <fct> White, White, White, White, Asian, Information not provided by applicant in mail, Internet, or telephone application, White,...
$ applicant_ethnicity_name <fct> Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Not Hispanic or Latino, Info...
$ agency_name <fct> Department of Housing and Urban Development, Consumer Financial Protection Bureau, Consumer Financial Protection Bureau, Con...
$ agency_abbr <fct> HUD, CFPB, CFPB, CFPB, HUD, CFPB, HUD, CFPB, FDIC, HUD, CFPB, FRS, CFPB, NCUA, CFPB, NCUA, HUD, FRS, HUD, CFPB, NCUA, HUD, C...
$ action_taken_name <fct> Loan originated, Loan originated, Loan originated, Application denied by financial institution, Loan originated, Application...