0

I have a dataset containing the dates and post_id for which writer of the post received a reaction:

userid date postid
u1      d1     a
u1      d1     b
u1      d1     c
u1      d2     a
u1      d2     b
u1      d2     d
u1      d3     b
u1      d3     d
u2      d1     g
u2      d3     g
u2      d3     h
...

there are 4M postids and each postid is related to only 1 user.

My final dataset is a balanced data of user-day and a variable showing whether a user received a reaction in that day:

userid date reaction
u1      d1     1
u1      d2     1
u1      d3     1
u2      d1     1
u2      d2     0
u2      d3     1

Using the first dataset, I need to indicate which post each user received a reward for (I need to control for the unobserved characteristics of the post using fixed effects). Theoretically, the most straightforward solution is to have 4M dummies for each postid and make it 0 or 1:

userid date reaction post_a post_b post_c post_d post_g post_h 
u1      d1     1        1     1       1       0     0      0
u1      d2     1        1     1       0       1     0      0
u1      d3     1        0     1       0       1     0      0
u2      d1     1        0     0       0       0     1      0
u2      d2     0        0     0       0       0     0      0
u2      d3     1        0     0       0       0     1      1

this will require 4M dummies to be created, impossible for me to handle. Even if I take advantage of the fact that each postid is unique to 1 user, I need more than 4k dummies, since some users receive a reaction to 4k different posts in a day. my idea is to combine these dummies into categorical variables:

userid date reaction generic_post1 generic_post2 generic_post3
u1      d1     1          b              a             c     
u1      d2     1          b              a             d   
u1      d3     1          b              0             d   
u2      d1     1          g              0             0   
u2      d2     0          0              0             0    
u2      d3     1          g              h             0   

for this specific example, I was able to reduce the number of variables needed to 3.

I need an algorithm that leads to fewest categorical variables with fewest 0's. However, there are two conditions:

  1. each postid needs to be under only 1 generic_postid (generic_post2 for u1 in d3 (row 3) cannot be d).
  2. generic_post's need to be filled from left to right so that 0's in generic_post1 are not more than 0's in generic_post2. Because of computing limitations, I may need to use the first 5 to 10 generic_post variables, so I need the maximum number of postids in them.
ozaroos
  • 61
  • 4

0 Answers0