0

I have a question of the most efficient way to generate markov tranform maxtrix for about one million IDs. Each account has 24 hours/day x 90 days/year x 10 years data. The data are in the format as follows:

ID Yr Day HE01 HE02 ... HE23 HE24
1 1 1 5.0 10 ... 2.1 1.6
1 1 2 4.7 5.1 ... 1.7 5.5
... ... ... ... ... ... ... ...
1 10 90 4.75 4.2 ... 1.5 1.2
2 1 1 400 420 ... 500 420
... ... ... ... ... ... ... ...
2 10 90 322 322 ... 155 100
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
1,000,000 10 90 0.5 0.1 ... 0.2 0.7

I'd like to treat each hour as a time stamp, So there are going to be 24 time steps. Then, create markov transform possiblity matrix from it. The followings are the steps:

  1. Normalization data for each customer. Since each customer has different data scale, it is required to be normalized to the same range of [0, 1] by divding the max value for each ID itself. So the data above will be transfered as
ID Yr Day HE01 HE02 ... HE23 HE24
1 1 1 5.0/10yr_max(ID1) 10yr_max(ID1) ... 2.1/10yr_max(ID1) 1.6/10yr_max(ID1)
... ... ... ... ... ... ... ...
2 1 1 400/10yr_max(ID2) 420/10yr_max(ID2) ... 500/10yr_max(ID2) 420/10yr_max(ID2)
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
1,000,000 10 90 0.510/yr_max(ID1,000,000) 0.1/yr_max(ID1,000,000) ... 0.2/yr_max(ID1,000,000) 0.7/yr_max(ID1,000,000)
  1. Convert each customers normalized value into sevarl bins. These bins will be the possible states of markov tranform. For example, if we divide values between [0,1] in to 4 bins. Then the value between [0, 0.25) be bin 1 (state 1); [0.25, 0.5) be bin 2 (state 2); [0.5, 0.75) be bin 3 (state 3); [0.75, 1] be bin 4 (state 4). The data above in step 1 result will be tansformed something as below:
ID Yr Day HE01 HE02 ... HE23 HE24
1 1 1 state 3 state 3 ... state 1 state 1
... ... ... ... ... ... ... ...
2 1 1 state 2 stat 2 ... state 2 state 2
... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
1,000,000 10 90 state 3 state 1 ... state 1 state 3
  1. Calculate Markov Transformation Matrix for each customer, each time step. For example, for each customer, there will be 23 transformation matrix, such as M12 indicate the tranformation matrix from hour 1 to hour 2. As an example, the ID 1 customer M12 will be a 4 x 4 matrix since we have 4 states. The format of this matrix is
P11 P12 P13 P14
P21 P22 P23 P24
P31 P32 P33 P34
P41 P42 P43 P44

in which: Pij = (total count of state i at hour 1)/(total count of state j at hour 2 which was state 1 at hour 1)

So there will be 1,000,000 customers x 23 Matrix/customer x (4x4) dimension/matrix.

I am a newbee in the pyspark dataframe. Is there any suggestion for efficient code to make the above transformation?

Thanks for your help in advance

Cu Buffalo
  • 59
  • 2

0 Answers0