Subset by Count of Factors

Question

I am working with Uniform Crime Reporting data for Nebraska cities(a generous classification) and have calculated crime rates for the major classifications from 1995 to 2010 in 5 year increments.

I would like to plot the rates for crimes across the years. However, not all of cities have reported values for all years due to the way reporting in the UCR works.

I'm fairly new to R but a colleague suggested that I try to create a for loop that gives a count of the unique values for each city name. Then I can use these counts to either drop data or subset the data so that I have a minimum of at least three observations to work with for plotting. This is about as far as I have gotten and what is there does not work. Unfortunately, I need to focus on some more pressing issues for the rest of the week so I thought I'd throw it to the community to get some insight.

The code and the names data are below. Thanks.

drop = NULL
city.names <- unique(cnames)

for (i in 1:length(city.names)){
  x = sum(cnames==i)
 if (x < 3) {c(drop,i)}
}

There are 191 observations with 64 unique names. Data are csv and imported as

data <- read.csv("cities.csv", header=TRUE, sep=",")

"","year","cnames" "1",1995,"Beatrice" "2",1995,"Bellevue" "3",1995,"Columbus" "4",1995,"Fremont" "5",1995,"Grand Island" "6",1995,"Hastings" "7",1995,"Kearney" "8",1995,"La Vista" "9",1995,"Lincoln" "10",1995,"Norfolk" "11",1995,"North Platte" "12",1995,"Omaha" "13",1995,"Papillion" "14",1995,"Scottsbluff" "15",1995,"South Sioux City" "16",2000,"Bellevue" "17",2000,"Columbus" "18",2000,"Fremont" "19",2000,"Grand Island" "20",2000,"Hastings" "21",2000,"Kearney" "22",2000,"La Vista" "23",2000,"Lincoln" "24",2000,"Norfolk" "25",2000,"Omaha" "26",2000,"Papillion" "27",2000,"Scottsbluff" "28",2000,"South Sioux City" "29",2005,"Alliance" "30",2005,"Ashland" "31",2005,"Auburn" "32",2005,"Bayard" "33",2005,"Beatrice" "34",2005,"Bellevue" "35",2005,"Blair" "36",2005,"Bridgeport" "37",2005,"Broken Bow" "38",2005,"Central City" "39",2005,"Chadron" "40",2005,"Columbus" "41",2005,"Cozad" "42",2005,"Crete" "43",2005,"David City" "44",2005,"Elkhorn" "45",2005,"Falls City" "46",2005,"Fremont" "47",2005,"Gering" "48",2005,"Gothenburg" "49",2005,"Grand Island" "50",2005,"Hastings" "51",2005,"Holdrege" "52",2005,"Imperial" "53",2005,"Kearney" "54",2005,"La Vista" "55",2005,"Lexington" "56",2005,"Lincoln" "57",2005,"Lyons" "58",2005,"Madison" "59",2005,"McCook" "60",2005,"Milford" "61",2005,"Minden" "62",2005,"Mitchell" "63",2005,"Nebraska City" "64",2005,"Norfolk" "65",2005,"North Platte" "66",2005,"Ogallala" "67",2005,"Omaha" "68",2005,"O'Neill" "69",2005,"Ord" "70",2005,"Papillion" "71",2005,"Plainview" "72",2005,"Plattsmouth" "73",2005,"Ralston" "74",2005,"Schuyler" "75",2005,"Scottsbluff" "76",2005,"Seward" "77",2005,"Sidney" "78",2005,"South Sioux City" "79",2005,"St. Paul" "80",2005,"Superior" "81",2005,"Valley" "82",2005,"Wahoo" "83",2005,"West Point" "84",2005,"Wymore" "85",2005,"York" "86",2010,"Alliance" "87",2010,"Ashland" "88",2010,"Auburn" "89",2010,"Aurora" "90",2010,"Bayard" "91",2010,"Beatrice" "92",2010,"Bellevue" "93",2010,"Bennington" "94",2010,"Blair" "95",2010,"Bridgeport" "96",2010,"Broken Bow" "97",2010,"Central City" "98",2010,"Chadron" "99",2010,"Columbus" "100",2010,"Cozad" "101",2010,"Crete" "102",2010,"Falls City" "103",2010,"Fremont" "104",2010,"Gering" "105",2010,"Gothenburg" "106",2010,"Grand Island" "107",2010,"Hastings" "108",2010,"Holdrege" "109",2010,"Imperial" "110",2010,"Kearney" "111",2010,"La Vista" "112",2010,"Lexington" "113",2010,"Lincoln" "114",2010,"Lyons" "115",2010,"Madison" "116",2010,"McCook" "117",2010,"Milford" "118",2010,"Minden" "119",2010,"Nebraska City" "120",2010,"Norfolk" "121",2010,"North Platte" "122",2010,"Ogallala" "123",2010,"Omaha" "124",2010,"O'Neill" "125",2010,"Papillion" "126",2010,"Plainview" "127",2010,"Plattsmouth" "128",2010,"Ralston" "129",2010,"Scottsbluff" "130",2010,"Seward" "131",2010,"Sidney" "132",2010,"South Sioux City" "133",2010,"Superior" "134",2010,"Valentine" "135",2010,"Valley" "136",2010,"Wahoo" "137",2010,"Wayne" "138",2010,"West Point" "139",2010,"Wilber" "140",2010,"York" "141",2013,"Alliance" "142",2013,"Ashland" "143",2013,"Aurora" "144",2013,"Beatrice" "145",2013,"Bellevue" "146",2013,"Bennington" "147",2013,"Blair" "148",2013,"Bridgeport" "149",2013,"Broken Bow" "150",2013,"Central City" "151",2013,"Chadron" "152",2013,"Columbus" "153",2013,"Cozad" "154",2013,"Crete" "155",2013,"Falls City" "156",2013,"Fremont" "157",2013,"Gering" "158",2013,"Gordon" "159",2013,"Gothenburg" "160",2013,"Grand Island" "161",2013,"Hastings" "162",2013,"Holdrege" "163",2013,"Imperial" "164",2013,"Kearney" "165",2013,"Kimball" "166",2013,"La Vista" "167",2013,"Lexington" "168",2013,"Lincoln" "169",2013,"Madison" "170",2013,"McCook" "171",2013,"Milford" "172",2013,"Minden" "173",2013,"Mitchell" "174",2013,"Nebraska City" "175",2013,"Norfolk" "176",2013,"Ogallala" "177",2013,"Omaha" "178",2013,"O'Neill" "179",2013,"Papillion" "180",2013,"Plattsmouth" "181",2013,"Ralston" "182",2013,"Scottsbluff" "183",2013,"Seward" "184",2013,"South Sioux City" "185",2013,"Superior" "186",2013,"Valentine" "187",2013,"Valley" "188",2013,"Wahoo" "189",2013,"West Point" "190",2013,"Wilber" "191",2013,"York"

How do your read the data into R? It's missing from your code. — Aleksandr Blekh, Jan 08 '15 at 04:26
@AleksandrBlekh Apologies. I hope the changes clear things up. — pophealth, Jan 08 '15 at 05:11
No problem, no need to apologize. It's clearer now, though a bit long. Advice for future: if possible, try to to use [Github Gists](https://help.github.com/articles/about-gists), [Pastebin](http://pastebin.com) or similar services. [RPubs](http://rpubs.com) is an R-focused one. [Figshare](http://figshare.com) is also nice and more comprehensive. — Aleksandr Blekh, Jan 08 '15 at 05:33
Two more notes: 1) I don't think that your code does what you want; 2) I don't see a statistics component in this question, so perhaps migrating it to StackOverflow with `r` tag will result in more attention and faster help (I will flag the question for you). — Aleksandr Blekh, Jan 08 '15 at 05:41

akrun · Answer 1 · 2015-01-08T06:32:29.420

For subsetting by "frequency" of a column, there are many options in base R and in other packages. One option is using table function on the "cnames" column and get the frequency. The output will be a vector with "key/values" corresponding to the names/frequency of each unique "cnames". Check whether the values are less than 3 (tbl <3) which gives a logical index of "TRUE/FALSE". Subset the names of the "tbl" using that index, and use that to index the "cnames" column by using %in%. I am showing two methods, one with negation (!) and using <, other with >=

 tbl <- table(data$cnames)
 data[!data$cnames %in% names(tbl)[tbl <3],]

Or

 data[data$cnames %in% names(tbl)[tbl >=3],]

Or using ave to get the length of each unique "cnames" and get the logical index by the >= operator. ave returns the output in the same order as in the original dataset. This could be used for subsetting.

 data[with(data, ave(seq_along(cnames), cnames, FUN=length)>=3),]

If you are using data.table, the code will be more compact and is more faster for big datasets. Convert the "data.frame" to "data.table" using setDT, assign the counts (n:=.N) for each unique "cnames", and finally subset the dataset with >=

library(data.table)
setDT(data)[,n:=.N, cnames][n>=3]

Subset by Count of Factors

1 Answers1