How can I get this ASCII text file into a usable data format?

Question

I want to use NIBRS' "master file download" for arrests in 2021. However, this data comes in an ASCII text file that I do not know how to convert into a usable dataset. It seems like, from the help file, certain positions of the long number string correspond to different columns, but I don't know how to tell R or Stata to import the file while cutting at those positions. Any help here would be much appreciated!

The first few rows of the data look like this:

350AK001046 921  00 20000  0000000     N0000008198                                                            KETCHIKAN                Alaska                                                                                                                                                                                                                                                                                                                                                                                                                                       
350AK001046 921  01100210               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000001000000004000000003000000003000000001000000000000000000000000001000000001000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000001000000001000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000003000000000000000000000000008                    
350AK001046 921  01100240               000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000002000000003000000004000000000000000000000000002000000001000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000001000000001000000002000000000000000003000000000000000000000000000000000000000000000000000000000000000000000000000000000000000014000000000000000004000000000000000000000000005                    
350AK001046 921  01100260               000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000002000000006000000005000000009000000005000000005000000006000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000001000000003000000002000000003000000002000000002000000001000000000000000000000000000000000000000000000000000000000000000000000000000000035000000000000000015000000000000000000000000025

It is not obvious to me after clicking the link which file you want among the files I see there. The files I see are csv files. In R you can use `read.csv(...)` in Stata you can use `import delimited`. Please provide more specific details or perhaps even a reproducible example with the error if this does not solve your question. — TheIceBear, Feb 23 '23 at 14:59
Sorry I didn't make it more clear! If you scroll down to "Master File Downloads" and select "Arrests" and "2021" from the dropdown menus, you should be able to download the data as a .txt file and a help .xlsx file. — leecarvallo, Feb 23 '23 at 15:12
Could you include a portion of a file right in the question so it is clear what the question is about. — G. Grothendieck, Feb 23 '23 at 15:14
I've included what the first few rows of the data look like. I'm not sure how else to show what I'm asking about. — leecarvallo, Feb 23 '23 at 15:16

G. Grothendieck · Answer 1 · 2023-02-23T15:50:12.697

Assume that we have the file lee.dat generated in the Note at the end. It is made up of lines that are 8 or 4 fields and we assume that the 8 field rows are to be prepended to the 4 field rows that follow it. Now use count.fields to find the number of fields in each row and then use cumsum to create a grouping variable g that will group the 8 field row with the following 4 field rows. Then paste each group and read everything in.

File <- "lee.dat"
g <- cumsum(count.fields(File) == 8)
File |>
  readLines() |>
  tapply(g, function(x) paste(x[1], x[-1])) |>
  unlist() |>
  read.table(text = _, colClasses = "character")

giving

           V1  V2 V3    V4      V5          V6        V7     V8          V9 V10      V11 V12
1 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100210 000
2 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100240 000
3 350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska 350AK001046 921 01100260 000

Note

We have reduce the long field in the 4 field rows to 3 characters for ease of exposition.

Lines <- "350AK001046 921 00 20000 0000000 N0000008198 KETCHIKAN Alaska
350AK001046 921 01100210 000
350AK001046 921 01100240 000
350AK001046 921 01100260 000
"
cat(Lines, file = "lee.dat")

score 0 · Answer 2 · answered Feb 23 '23 at 21:02

In Stata this would be a job for the infix command.

If you put your example data in a file called test.raw, this code would extract the first four columns:

# delim ;
infix  
    ID 1-1 
    numeric_state 2-3 
    str ori_code 4-10
using test.raw;

Here's what the output from a list command looks after that:

     +--------------------------+
     | ID   numeri~e   ori_code |
     |--------------------------|
  1. |  3         50    AK00104 |
  2. |  3         50    AK00104 |
  3. |  3         50    AK00104 |
  4. |  3         50    AK00104 |
     +--------------------------+

I got the initial column headers and positions from clicking "Download the Help file for the Arrest File" under "Master File Downloads" at the link you provided. Here's a screenshot of the PDF:

score 0 · Answer 3 · answered Feb 23 '23 at 22:51

The best tool for such a task, is usually "Excel" but with "Big Data" it fails at line 653??+ (classically Highlighted most recently by the Loss of Covid Patient tests) Hence I think the best approach would be to split the file in to parts of 65000 lines, with heads or tails or a file splitter, THEN use file import into Spreadsheets with columns.

How can I get this ASCII text file into a usable data format?

3 Answers3

Note