0

I am using glmnet for web data. Typically the data is categorical (high cardinality of factors) and has millions of samples. I am dealing with 'big data' and want to be memory efficient.

Because its categorical, one can more efficiently represent the data by grouping and passing the number of successes and failures for each group: eg 'male','30-35' : 30 successes, 50 failures

The issue I face is with crossvalidation... by grouping I cannot just partition my dataset ( call X original grouped data set). what I would like is to be able to pass in the original grouped data independent variables, but then partition 'the outcomes across the folds': eg the 30 successes 50 failures would be split into 10 #success,#failure pairs ( replicating what would happen if I did crossvalidation on the original ungrouped data. Is there anyway of running cv.glmnet in this way? the alternative of replicating all the data k times (with different success failure values) is obviously less memory efficient.

assume I have 2 groups and 2 folds:

male','30-35' : 30 successes, 50 failures

female','30-35' : 50 successes, 30 failures

then what I would like is

X = 

    [[male','30-35'] 

    [female','30-35']]

y = 

[[20,30], [10,20],

[[25,30], [25, 0] ]

So the X variable contains n groups. y dependent variable then has n rows, and has 2 columns ( each containing a success failure tuple) - each column represents a fold. Now I am not saying I am after this particular data structure, just that for grouped data, I do not want to create a kN row X_fold matrix and corresponding y_fold matrix with kN rows.

ie X_fold =

[['male','30-35',...] 

['female','30-35',...]

['male','30-35',...] 

['female','30-35',...]
]

and y_fold =

[[20,30], ,
[25,30], 
[10,20]
[25, 0] ]

the point is that X has many rows and columns and I do not want to replicate it when the independent data is the same, only the number of successes and failures changes (allowing for folds with 0 success and 0 failure for some rows)

I am assuming this cannot be done without modifying the source code, but wanted to double check no one else had come across the problem.

seanv507
  • 1,206
  • 1
  • 11
  • 23
  • Could you offer an example of what your desired folds would look like? – Megatron Feb 24 '17 at 15:18
  • I have clarified. I guess I am suggesting that for grouped data, the standard way of partitioning rows between folds doesn't make sense. Instead one should partition the outcome counts ( so the y data becomes n rows by (num_outcomes * num_folds), where num_outcomes would be 2 for logistic regression. I am assuming this must be done in code, but perhaps there is already a fork of the code? – seanv507 Feb 24 '17 at 16:03
  • What do you mean by X and Y folds? – Megatron Feb 24 '17 at 16:05
  • I think its easier if you ask yourself how would you perform crossvalidation on a data set assuming that the data is grouped ( so the outcome variable is count of success and failure), and it has millions of rows and 10's of factors. – seanv507 Feb 24 '17 at 16:27

1 Answers1

0

You mention crossvalidation, but that's not really the issue here. What you want is, when all your variables are categorical, to summarise your dataset into a contingency table of factors and fit a model to the counts of responses and non-responses. This is a fairly well-known technique in logistic regression. It applies both to fitting the base model, and also to crossvalidating it.

To see how it works, let's generate an example dataset (1 million rows):

set.seed(12345)
df <- data.frame(
    x1 = factor(sample(10, 1e6, TRUE)),
    x2 = factor(sample(20, 1e6, TRUE)),
    x3 = factor(sample(5, 1e6, TRUE)),
    x4 = factor(sample(15, 1e6, TRUE)),
    y = rbinom(1e6, 1, 0.1))

Now collapse it down to a contingency table (15000 cells/rows):

library(dplyr)
dfsmry <- df %>%
    group_by(x1, x2, x3, x4) %>%
    summarise(y = sum(y), ny = n() - y)

Now fit an elastic net model. When fitting a logistic regression to summarised data, the response is a 2-column matrix with the total failures and successes in each cell.

# to make life easier: see https://github.com/hong-revo/glmnetUtils
library(glmnetUtils)

# base model
mod <- glmnet(cbind(ny, y) ~ x1 + x2 + x3 + x4, data = dfsmry, family = "binomial")

# do crossvalidation
cvmod <- cv.glmnet(cbind(ny, y) ~ x1 + x2 + x3 + x4, data = dfsmry, family = "binomial")
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • Thank you for generating example code _ I should have done that! I will try to generate my own version (that doesn't require glmnetUtils - incompatible with my R) – seanv507 Feb 27 '17 at 19:29
  • The problem is what sampling scheme cv.glmnet does for grouped data (ie during crossvalidation). I am claiming that currently cv.glmnet samples *rows* (ie groups) uniformly ( test is adding keep=T parameter in cv.glmnet - haven't been able to run it for this example), whereas what is required is sampling uniformly the *ungrouped* data. – seanv507 Feb 27 '17 at 19:35
  • If you really have millions of rows and tens of columns, it doesn't matter what sampling scheme you use. If you have tens of rows and millions of columns, that's another matter. – Hong Ooi Feb 27 '17 at 23:19
  • Also, glmnetUtils doesn't require anything more than what glmnet itself requires, so if you can use one, you can use both. – Hong Ooi Feb 27 '17 at 23:20