2

Revision of a previous question to include edge cases.

I am trying to clean up a dataset of crime data by giving it better categorical labels. A sample of the table looks like this:

d <- as.data.table(read.csv('[filepath]'))
print(d)

Classifications                    ucr_ncic_code
SOVEREIGNTY                        NA
Treason                            101
Treason Misprison                  102
Espionage                          103
Sovereignty                        199
MILITARY (restricted to agencies)  NA
Military Desertion                 201
Military                           299 
IMMIGRATION                        NA
Illegal Entry                      301
False Citizenship                  302
Smuggling Aliens                   303
Immigration                        399
CRIMES AGAINST PERSON              7099
HOMICIDE                           NA
Homicide Family-Gun                901
Homicide Family-Weapon             902
Homicide Nonfam-Gun                903
PROPERTY CRIMES                    7199
<TRUNCATED>

As you can see, in the original dataset the broader categories of crime classifications are formatted as all-caps headers and most have an NA code (e.g SOVEREIGNTY NA). However, some headers include non-caps characters (e.g. MILITARY (restricted to agencies)), and some headers don't have any sub-categories and therefore have a valid code (e.g. CRIMES AGAINST PERSON 7099). What I would like to do is reformat the data so that these headers are their own categorial column in the table.

Here is my initial solution, which I am almost sure is not the best approach, but produces the desired result:

d[,row.num := .I,]
d.categs <- d[toupper(substr(Classifications,1,3))==substr(Classifications,1,3)] 
#the substring is for some edge cases that I don't show here

setnames(d.categs, "Classifications", "Category")
d <- merge(d,d.categs[,row.num,list(Category)],'row.num', all.x=TRUE)
d <- d[order(row.num)]

prev.row <- NA
for (i in seq(1,d[,.N])) {
  current.row <- d$Category[i]  
  if (is.na(current.row) & !(is.na(prev.row))){
    d$Category[i] <- prev.row
  } 
  prev.row <- d$Category[i]
}

#clean up
d <- d[!(is.na(ucr_ncic_code))]
d[,row.num := NULL,]

print(d)

Classifications   ucr_ncic_code   Category
Treason                 101       SOVEREIGNTY
Treason Misprison       102       SOVEREIGNTY
Espionage               103       SOVEREIGNTY
Sovereignty             199       SOVEREIGNTY
Military Desertion      201       MILITARY (restricted to agencies)
Military                299       MILITARY (restricted to agencies)
Illegal Entry           301       IMMIGRATION
False Citizenship       302       IMMIGRATION
Smuggling Aliens        303       IMMIGRATION
Immigration             399       IMMIGRATION
CRIMES AGAINST PERSON   7099      CRIMES AGAINST PERSON
Homicide Family-Gun     901       HOMICIDE
Homicide Family-Weapon  902       HOMICIDE
Homicide Nonfam-Gun     903       HOMICIDE
PROPERTY CRIMES         7099      PROPERTY CRIMES
<TRUNCATED>

What would be a better way to utilize the data.table package to make this formatting change? I'm guessing there's a better way to copy cells down than the for-loop that I designed, but many simpler solutions are hindered by the inconsistencies of character formatting in the headers and their codes, or lack thereof (see previous question).

Community
  • 1
  • 1
grrothman
  • 171
  • 1
  • 11

1 Answers1

3

It should only take one line:

dt[,Category := Classifications[(x=grepl("^[A-Z]{2,}", Classifications))][cumsum(x)]][]
#                       Classifications ucr_ncic_code                          Category
#  1:                       SOVEREIGNTY            NA                       SOVEREIGNTY
#  2:                           Treason           101                       SOVEREIGNTY
#  3:                 Treason Misprison           102                       SOVEREIGNTY
#  4:                         Espionage           103                       SOVEREIGNTY
#  5:                       Sovereignty           199                       SOVEREIGNTY
#  6: MILITARY (restricted to agencies)            NA MILITARY (restricted to agencies)
#  7:                Military Desertion           201 MILITARY (restricted to agencies)
#  8:                          Military           299 MILITARY (restricted to agencies)
#  9:                       IMMIGRATION            NA                       IMMIGRATION
# 10:                     Illegal Entry           301                       IMMIGRATION

Explanation

Try creating an index that marks the changing categories. We need a pattern that can identify each change like "^[A-Z]{2,}". This is a simple regular expression that matches two or more capital letters at the start of Classifications. After identifying the heading rows, we can take the cumulative sum of that index. It sounds odd at first but what's happening under the hood is a conversion from logical to numeric. Each TRUE will become 1. When added together it becomes a subsettable index (i.e. 1 1 1 2 2 3 3 3...):

I should also mention an R trick in there. I created a new variable and used it in the same line. You are allowed to do (x=1+1) + x to get 4 in R for example.

Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • 1
    A +1 for `(x=1+1) + x` or, I should say a +`(z = sum(1/factorial(seq(0,10))))*0 + log(z)` – Frank Nov 14 '15 at 21:13