This is a saga which began with the problem of how to do survey weighting. Now that I appear to be doing that correctly, I have hit a bit of a wall (see previous post for details on the import process and where the strata
variable came from):
> require(foreign)
> ipums <- read.dta('/path/to/data.dta')
> require(survey)
> ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Error in if (nbins > .Machine$integer.max) stop("attempt to make a table with >= 2^31 elements") :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In pd * nl : NAs produced by integer overflow
> traceback()
9: tabulate(bin, pd)
8: as.vector(data)
7: array(tabulate(bin, pd), dims, dimnames = dn)
6: table(ids[, 1], strata[, 1])
5: inherits(x, "data.frame")
4: is.data.frame(x)
3: rowSums(table(ids[, 1], strata[, 1]) > 0)
2: svydesign.default(id = ~serial, weights = ~perwt, strata = ~strata,
data = ipums)
1: svydesign(id = ~serial, weights = ~perwt, strata = ~strata, data = ipums)
This error seems to come from the tabulate
function, which I hoped would be straightforward enough to circumvent, first by changing .Machine$integer.max
> .Machine$integer.max <- 2^40
and when that didn't work the whole source code of tabulate
:
> tabulate <- function(bin, nbins = max(1L, bin, na.rm=TRUE))
{
if(!is.numeric(bin) && !is.factor(bin))
stop("'bin' must be numeric or a factor")
#if (nbins > .Machine$integer.max)
if (nbins > 2^40) #replacement line
stop("attempt to make a table with >= 2^31 elements")
.C("R_tabulate",
as.integer(bin),
as.integer(length(bin)),
as.integer(nbins),
ans = integer(nbins),
NAOK = TRUE,
PACKAGE="base")$ans
}
Neither circumvented the problem. Apparently this is one reason why the ff
package was created, but what worries me is the extent to which this is a problem I cannot avoid in R
. This post seems to indicate that even if I were to use a package that would avoid this problem, I would only be able to access 2^31 elements at a time. My hope was to use sql
(either sqlite
or postgresql
) to get around the memory problems, but I'm afraid I'll spend a while getting that to work, only to run into the same fundamental limit.
Attempting to switch back to Stata
doesn't solve the problem either. Again see the previous post for how I use svyset
, but the calculation I would like to run causes Stata
to hang:
svy: mean age, over(strata)
Whether throwing more memory at it will solve the problem I don't know. I run R
on my desktop which has 16 gigs, and I use Stata
through a Windows server, currently setting memory allocation to 2000MB, but I could theoretically experiment with increasing that.
So in sum:
- Is this a hard limit in
R
? - Would
sql
solve myR
problems? - If I split it up into many separate files would that fix it (a lot of work...)?
- Would throwing a lot of memory at
Stata
do it? - Am I seriously barking up the wrong tree somehow?