-1

I have a data frame where 'Earning' is numeric and A,B,C,D,E... are binary vector.

Earning A B C D E ...**1000 such binary vector columns**
  21    1 0 0 1 1
  45    0 0 0 1 1
  67    0 0 0 1 1
  23    0 0 0 0 1
  44    0 0 0 1 1
  77    1 1 0 0 1
  89    0 1 0 1 1
  90    1 0 0 0 0

Among the A, B, C....1000columns, I want to retain the top-400 columns whose colSums is the largest. For the other 600-columns, I want to bin them as one column marked as 'Other' which would have a 0 or 1 (basically each row entry in the 'Other' column is a OR between the least-colSum 600 columns).

Overall, intention is to ultimately use the most 'popular' top-400 columns among A,B,C,D,E... (where popularity is measured as a '1' in the binary vector) to do linear regression wrt to Earning.

ausworli
  • 479
  • 1
  • 4
  • 10

1 Answers1

0

Suppose that dfs is data.frame with your data.

# +1/-1 is to keep 'Earnings' at the beginning of the data.frame
new_order = order(colSums(dfs[,-1], na.rm = TRUE), decreasing = TRUE) + 1
res = cbind(
    dfs[, c(1, new_order[1:400])], 
    other = 1*(rowSums(dfs[, new_order[-(1:400)]])>0)
    )

res is resulting data.frame with new order of columns.

Gregory Demin
  • 4,596
  • 2
  • 20
  • 20