Fastest way/algorithm to find count of unique rows of a sorted file

Question

I currently use .N to find the number of unique rows in a file using by= ... .

For eg. to find the count of unique rows of col1 and col2 in a data table, dt, the query would be,

dt[, .N, by="col1,col2"]

For very large files this could take a very long time. If the table is sorted, is there a faster way to do this? Basically, you could set a counter and update it with the number of times each row appears using a single entry every time a unique row is encountered. I can't use a for loop as that would take forever.

Once again, absolutely no reason to add the 'rcpp' tag. Please edit -- and while you're at it, maybe learn how to format code here. It's all documented rather well... — Dirk Eddelbuettel, Jul 13 '13 at 18:40
I appreciate everybody's edits but I would still appreciate some appropriate effort by the OP himself. — Dirk Eddelbuettel, Jul 13 '13 at 19:29
Sure, understood. Posted from an iPhone, probably didn't enter the spaces for the code properly. Either way, thanks for giving us rcpp, after a long time I can finally claim and prove that R is good enough to handle all those projects that mgmt used to hand over to developers of mainstream languages that had approved budgets. — xbsd, Jul 13 '13 at 19:33
I'm a bit puzzled. With the function `table`, I've never had "very long" delays. — IRTFM, Jul 13 '13 at 19:40
I would also like to see some proof that the data.table query shown in the question takes "a very long time". How do you define "very long" here? — Roland, Jul 13 '13 at 19:44
I have not tried this, but tabulate on a large sorted table will probably take much longer using standard table function in place of the .N method given above. The tables in this context are temp tables that are typically several hundred millions of rows long. — xbsd, Jul 13 '13 at 19:51
What I typically end up doing is splitting the data set recursively, finding the unique counts of each set, and summing up .N at the end, sort of map-reduce. I was mainly thinking however along the lines of something like say the gzip algorithm that essentially looks at rows that are similar and sets a pointer, so the compression ratio of a file that contains sequences of 'aaaabbbb' 1 mil times is much higher than that of a file with, 'abababab' ... . — xbsd, Jul 13 '13 at 20:03

score 1 · Answer 1 · edited Jul 13 '13 at 23:18

1

unique.data.table is very different than base R unique in the sense that unique.data.table fetches unique values based on only key columns of the data.table if a key is set. To explain this with an example,

Try this:

dt <- data.table(x=c(1,1,1,2,2), y=c(5,6,6,7,8))
unique(dt) # no key set, similar to 'unique.data.frame' output

# set key now
setkey(dt, "x")
unique(dt) # unique based on just column x

If you want to get just the total number of unique rows, therefore try the following:

setkeyv(dt, c("col1", "col2"))
nrow(unique(dt))

edited Jul 13 '13 at 23:18

Arun

116,683
26
284
387

answered Jul 13 '13 at 21:45

Ricardo Saporta

54,400
17
144
178

That seems to take the same time as running just unique on the table as-is .. ? Is there any way to specify that the table being read in is already sorted, for eg., read a sorted csv file ... . Basically avoiding having to run setkey. – xbsd Jul 13 '13 at 23:14
Just realised that if `x` is already sorted then instead of setting "key", you could set: `setattr(dt, 'sorted', 'x')` so that the key is directly set to `x` instead of sorting.. and this could save quite some time! Thanks to @eddi for this hack ([**go here**](http://stackoverflow.com/a/17751249/559784)). – Arun Jul 21 '13 at 19:54

score 0 · Answer 2 · answered Jul 22 '13 at 06:50

On your question:

dt[, .N, by="col1,col2"]

does actually not give you the number of unique rows, while either of these two do:

dt[, .N, by="col1,col2"][, .N] # data.table solution
nrow(dt[, .N, by="col1,col2"]) # data.frame syntax applied to data.table

My answer to your question:

A core feature of the data.table package is to work with a key. On p.2 from the short introduction to the data.table package it reads:

Furthermore, the rows are sorted by the key. Therefore, a data.table can have at most one key, because it cannot be sorted in more than one way.

Thus unless you have a column defining the sort order that you can set as key, the fact that your data are sorted, will be of no advantage. You thus need to set the key. For your purpose (large datafiles, thus assumingly many columns), you would want to include all of the columns in your dataset to set the key:

setkeyv(dt,c(names(dt))) # use key(dt) to check whether this went as expected
unique(dt)[, .N] # or nrow(unique(dt))

PS: please provide us a with a replicable dataset, so we can assess what you consider fast or slow.

Fastest way/algorithm to find count of unique rows of a sorted file

2 Answers2

On your question:

My answer to your question: