Revision of a previous question to include edge cases.
I am trying to clean up a dataset of crime data by giving it better categorical labels. A sample of the table looks like this:
d <- as.data.table(read.csv('[filepath]'))
print(d)
Classifications ucr_ncic_code
SOVEREIGNTY NA
Treason 101
Treason Misprison 102
Espionage 103
Sovereignty 199
MILITARY (restricted to agencies) NA
Military Desertion 201
Military 299
IMMIGRATION NA
Illegal Entry 301
False Citizenship 302
Smuggling Aliens 303
Immigration 399
CRIMES AGAINST PERSON 7099
HOMICIDE NA
Homicide Family-Gun 901
Homicide Family-Weapon 902
Homicide Nonfam-Gun 903
PROPERTY CRIMES 7199
<TRUNCATED>
As you can see, in the original dataset the broader categories of crime classifications are formatted as all-caps headers and most have an NA code (e.g SOVEREIGNTY NA
). However, some headers include non-caps characters (e.g. MILITARY (restricted to agencies)
), and some headers don't have any sub-categories and therefore have a valid code (e.g. CRIMES AGAINST PERSON 7099
). What I would like to do is reformat the data so that these headers are their own categorial column in the table.
Here is my initial solution, which I am almost sure is not the best approach, but produces the desired result:
d[,row.num := .I,]
d.categs <- d[toupper(substr(Classifications,1,3))==substr(Classifications,1,3)]
#the substring is for some edge cases that I don't show here
setnames(d.categs, "Classifications", "Category")
d <- merge(d,d.categs[,row.num,list(Category)],'row.num', all.x=TRUE)
d <- d[order(row.num)]
prev.row <- NA
for (i in seq(1,d[,.N])) {
current.row <- d$Category[i]
if (is.na(current.row) & !(is.na(prev.row))){
d$Category[i] <- prev.row
}
prev.row <- d$Category[i]
}
#clean up
d <- d[!(is.na(ucr_ncic_code))]
d[,row.num := NULL,]
print(d)
Classifications ucr_ncic_code Category
Treason 101 SOVEREIGNTY
Treason Misprison 102 SOVEREIGNTY
Espionage 103 SOVEREIGNTY
Sovereignty 199 SOVEREIGNTY
Military Desertion 201 MILITARY (restricted to agencies)
Military 299 MILITARY (restricted to agencies)
Illegal Entry 301 IMMIGRATION
False Citizenship 302 IMMIGRATION
Smuggling Aliens 303 IMMIGRATION
Immigration 399 IMMIGRATION
CRIMES AGAINST PERSON 7099 CRIMES AGAINST PERSON
Homicide Family-Gun 901 HOMICIDE
Homicide Family-Weapon 902 HOMICIDE
Homicide Nonfam-Gun 903 HOMICIDE
PROPERTY CRIMES 7099 PROPERTY CRIMES
<TRUNCATED>
What would be a better way to utilize the data.table package to make this formatting change? I'm guessing there's a better way to copy cells down than the for-loop that I designed, but many simpler solutions are hindered by the inconsistencies of character formatting in the headers and their codes, or lack thereof (see previous question).