I am using glmnet for web data. Typically the data is categorical (high cardinality of factors) and has millions of samples. I am dealing with 'big data' and want to be memory efficient.
Because its categorical, one can more efficiently represent the data by grouping and passing the number of successes and failures for each group: eg 'male','30-35' : 30 successes, 50 failures
The issue I face is with crossvalidation... by grouping I cannot just partition my dataset ( call X original grouped data set). what I would like is to be able to pass in the original grouped data independent variables, but then partition 'the outcomes across the folds': eg the 30 successes 50 failures would be split into 10 #success,#failure pairs ( replicating what would happen if I did crossvalidation on the original ungrouped data. Is there anyway of running cv.glmnet in this way? the alternative of replicating all the data k times (with different success failure values) is obviously less memory efficient.
assume I have 2 groups and 2 folds:
male','30-35' : 30 successes, 50 failures
female','30-35' : 50 successes, 30 failures
then what I would like is
X =
[[male','30-35']
[female','30-35']]
y =
[[20,30], [10,20],
[[25,30], [25, 0] ]
So the X variable contains n groups. y dependent variable then has n rows, and has 2 columns ( each containing a success failure tuple) - each column represents a fold. Now I am not saying I am after this particular data structure, just that for grouped data, I do not want to create a kN row X_fold matrix and corresponding y_fold matrix with kN rows.
ie X_fold =
[['male','30-35',...]
['female','30-35',...]
['male','30-35',...]
['female','30-35',...]
]
and y_fold =
[[20,30], ,
[25,30],
[10,20]
[25, 0] ]
the point is that X has many rows and columns and I do not want to replicate it when the independent data is the same, only the number of successes and failures changes (allowing for folds with 0 success and 0 failure for some rows)
I am assuming this cannot be done without modifying the source code, but wanted to double check no one else had come across the problem.