I have a data frame where 'Earning' is numeric and A,B,C,D,E... are binary vector.
Earning A B C D E ...**1000 such binary vector columns**
21 1 0 0 1 1
45 0 0 0 1 1
67 0 0 0 1 1
23 0 0 0 0 1
44 0 0 0 1 1
77 1 1 0 0 1
89 0 1 0 1 1
90 1 0 0 0 0
Among the A, B, C....1000columns, I want to retain the top-400 columns whose colSums is the largest. For the other 600-columns, I want to bin them as one column marked as 'Other' which would have a 0 or 1 (basically each row entry in the 'Other' column is a OR between the least-colSum 600 columns).
Overall, intention is to ultimately use the most 'popular' top-400 columns among A,B,C,D,E... (where popularity is measured as a '1' in the binary vector) to do linear regression wrt to Earning.