0

I have a fixed width file from the U.S. Census. It's the one in the zip called "orgeo2010.sf1". The zip is a big file. I want to read that file into a table in PostgreSQL 12.1. This is how I built the table.

create table census_2010.geo_header_sf1
 (
    fileid          varchar(6), 
    stusab          varchar(2), 
    sumlev          varchar(3), 
    geocomp         varchar(2), 
    chariter            varchar(3), 
    cifsn           varchar(2), 
    logrecno            integer PRIMARY KEY, 
    region          varchar(1), 
    division            varchar(1), 
    state           varchar(2), 
    county          varchar(3), 
    countycc            varchar(2), 
    countysc            varchar(2), 
    cousub          varchar(5), 
    cousubcc            varchar(2), 
    cousubsc            varchar(2), 
    place           varchar(5), 
    placecc         varchar(2), 
    placesc         varchar(2), 
    tract           varchar(6), 
    blkgrp          varchar(1), 
    block           varchar(4), 
    iuc         varchar(2), 
    concit          varchar(5), 
    concitcc            varchar(2), 
    concitsc            varchar(2), 
    aianhh          varchar(4), 
    aianhhfp            varchar(5), 
    aianhhcc            varchar(2), 
    aihhtli         varchar(1), 
    aitsce          varchar(3), 
    aits            varchar(5), 
    aitscc          varchar(2), 
    ttract          varchar(6), 
    tblkgrp         varchar(1), 
    anrc            varchar(5), 
    anrccc          varchar(2), 
    cbsa            varchar(5), 
    cbsasc          varchar(2), 
    metdiv          varchar(5), 
    csa         varchar(3), 
    necta           varchar(5), 
    nectasc         varchar(2), 
    nectadiv            varchar(5), 
    cnecta          varchar(3), 
    cbsapci         varchar(1), 
    nectapci            varchar(1), 
    ua          varchar(5), 
    uasc            varchar(2), 
    uatype          varchar(1), 
    ur          varchar(1), 
    cd          varchar(2), 
    sldu            varchar(3), 
    sldl            varchar(3), 
    vtd         varchar(6), 
    vtdi            varchar(1), 
    reserve2            varchar(3), 
    zcta5           varchar(5), 
    submcd          varchar(5), 
    submcdcc            varchar(2), 
    sdelem          varchar(5), 
    sdsec           varchar(5), 
    sduni           varchar(5), 
    arealand            integer, 
    areawatr            integer, 
    name            varchar(90), 
    funcstat            varchar(1), 
    gcuni           varchar(2), 
    pop100          integer, 
    hu100           integer, 
    intptlat            varchar(11), 
    intptlon            varchar(12), 
    lsadc           varchar(2), 
    partflag            varchar(1), 
    reserve3            varchar(6), 
    uga         varchar(5), 
    statens         varchar(8), 
    countyns            varchar(8), 
    cousubns            varchar(8), 
    placens         varchar(8), 
    concitns            varchar(8), 
    aianhhns            varchar(8), 
    aitsns          varchar(8), 
    anrcns          varchar(8), 
    submcdns            varchar(8), 
    cd113           varchar(2), 
    cd114           varchar(2), 
    cd115           varchar(2), 
    sldu2           varchar(3), 
    sldu3           varchar(3), 
    sldu4           varchar(3), 
    sldl2           varchar(3), 
    sldl3           varchar(3), 
    sldl4           varchar(3), 
    aianhhsc            varchar(2), 
    csasc           varchar(2), 
    cnectasc            varchar(2), 
    memi            varchar(1), 
    nmemi           varchar(1), 
    puma            varchar(5), 
    reserved            varchar(18)
);

I tried to read from the file directly with:

Decennial_2010=# COPY census_2010.geo_header_sf1
Decennial_2010-# FROM 'D:\projects_and_data\data\PostgreSQL\data\data\or2010.sf1\orgeo2010.sf1';
ERROR:  value too long for type character varying(6)
CONTEXT:  COPY geo_header_sf1, line 1, column fileid: "SF1ST OR04000000  00000014941      

When that didn't work, I thought maybe I could import it into R (I know enough R for manipulating R) and edit it and write a new F.W.F. file. I tried with my new file and got the same result.

Decennial_2010=# COPY census_2010.geo_header_sf1
Decennial_2010-# FROM 'D:/projects_and_data/data/PostgreSQL/data/data/geo_a' CSV HEADER;
ERROR:  value too long for type character varying(6)
CONTEXT:  COPY geo_header_sf1, line 2, column fileid: "SF1ST OR 40000     14941  

This is the first line of the data frame that holds the file in R and the write.fwf() to write the new file that didn't work. I know I should use dput(td[1,]), but then it prints all the levels for all the factors in all the rows (200,000+) and then the output from dput() doesn't even fit in all the printable space of the console. So I'll just copy and paste the line as it appears by default. Sorry.

td[1,]
  fileid stusab sumlev geocomp chariter cifsn logrecno region division state
1 SF1ST      OR     40      00        0    NA        1      4        9    41
  county countycc countysc cousub cousubcc cousubsc place placecc placesc tract
1     NA                NA     NA                NA    NA              NA    NA
  blkgrp block iuc concit concitcc concitsc aianhh aianhhfp aianhhcc aihhtli
1     NA    NA  NA     NA       NA       NA     NA       NA                 
  aitsce aits aitscc ttract tblkgrp anrc anrccc cbsa cbsac metdiv csa necta
1     NA   NA     NA                  NA     NA   NA    NA     NA  NA    NA
  nectasc nectadiv cnecta cbsapci nectapci ua uasc uatype ur cd sldu sldl vtd
1      NA       NA     NA                  NA   NA     NA NA NA   NA   NA  NA
  vtdi reserve2 zcta5 submcd submcdcc sdelem sdsec sduni     arealand
1            NA    NA     NA       NA     NA    NA    NA 248607802255
    areawatr
1 6191433228
                                                                                        name
1 Oregon                                                                                    
  funcstat gcuni  pop100   hu100 intptlat  intptlon lsadc partflag reserve3 uga
1        A     ! 3831074 1675562 43.97152 -120.6226    00                NA  NA
  statens countyns cousubns placens concitns aianhhns aitsns anrcns submcdns
1 1155107       NA       NA      NA       NA       NA     NA     NA       NA
  cd113 cd114 cd115 sldu2 sldu3 sldu4 sldl2 sldl3 sldl4 aianhhsc csasc cnectasc
1    NA    NA    NA    NA    NA    NA    NA    NA    NA       NA    NA       NA
  memi nmemi puma reserved
1   NA    NA   NA       NA


library(gdata)
write.fwf(td, "D:/projects_and_data/data/PostgreSQL/data/data/geo_a", sep="")

What needs to be changed in the table or the starting file to get the file to copy into the table in PostgreSQL?

John Polo
  • 547
  • 1
  • 8
  • 25
  • Why not write it out from R as a CSV and read the CSV in psql? Looks like the sql paser isn't able to parse the fixed width file correctly. – sempervent Dec 20 '19 at 17:39
  • @JoshuaGrant thanks for the suggestion. I just tried it and I get the error "ERROR: extra data after last expected column CONTEXT: COPY geo_header_sf1, line 2: ""1","SF1ST ","OR",40,"00",0,NA,1,4,9,41,NA," ",NA,NA," ",NA,NA," ",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,..."" Since the fields aren't lining up, should I rewrite the table and leave out the length for each of the character fields? Or are those necessary? – John Polo Dec 20 '19 at 17:47
  • Unless you're concerned with saving space in the database or you know for sure the max number of characters, I wouldn't worry about specifying how many characters. You could also use the `RPostgres` package's `dbWriteTable` and not worry about specifically creating the table. – sempervent Dec 20 '19 at 19:04
  • https://rconsortium.github.io/censusguide/r-packages-all.html This may be of interest and may have an R package that can help you out. – sempervent Dec 20 '19 at 19:11
  • @JoshuaGrant thank you for the additional resources, they seem promising. I will look at them and see what works. When I find the right one, I'll let you know, so you can add it as an answer and then I'll mark the question solved. – John Polo Dec 20 '19 at 23:18
  • @JoshuaGrant I downloaded and used the `censusapi` library from rconsortium.github.io/censusguide/r-packages-all.html. Using `lapply()`, `getCensus()`, and a list of the numbers to represent the counties in the `regionin` parameter of `getCensus()`, I was able to get the population tables that I wanted for all of the the census blocks in Oregon. Thanks for pointing out those tools. – John Polo Dec 29 '19 at 00:06

0 Answers0