I have a question of the most efficient way to generate markov tranform maxtrix for about one million IDs. Each account has 24 hours/day x 90 days/year x 10 years data. The data are in the format as follows:
ID | Yr | Day | HE01 | HE02 | ... | HE23 | HE24 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 5.0 | 10 | ... | 2.1 | 1.6 |
1 | 1 | 2 | 4.7 | 5.1 | ... | 1.7 | 5.5 |
... | ... | ... | ... | ... | ... | ... | ... |
1 | 10 | 90 | 4.75 | 4.2 | ... | 1.5 | 1.2 |
2 | 1 | 1 | 400 | 420 | ... | 500 | 420 |
... | ... | ... | ... | ... | ... | ... | ... |
2 | 10 | 90 | 322 | 322 | ... | 155 | 100 |
... | ... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... | ... |
1,000,000 | 10 | 90 | 0.5 | 0.1 | ... | 0.2 | 0.7 |
I'd like to treat each hour as a time stamp, So there are going to be 24 time steps. Then, create markov transform possiblity matrix from it. The followings are the steps:
- Normalization data for each customer. Since each customer has different data scale, it is required to be normalized to the same range of [0, 1] by divding the max value for each ID itself. So the data above will be transfered as
ID | Yr | Day | HE01 | HE02 | ... | HE23 | HE24 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 5.0/10yr_max(ID1) | 10yr_max(ID1) | ... | 2.1/10yr_max(ID1) | 1.6/10yr_max(ID1) |
... | ... | ... | ... | ... | ... | ... | ... |
2 | 1 | 1 | 400/10yr_max(ID2) | 420/10yr_max(ID2) | ... | 500/10yr_max(ID2) | 420/10yr_max(ID2) |
... | ... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... | ... |
1,000,000 | 10 | 90 | 0.510/yr_max(ID1,000,000) | 0.1/yr_max(ID1,000,000) | ... | 0.2/yr_max(ID1,000,000) | 0.7/yr_max(ID1,000,000) |
- Convert each customers normalized value into sevarl bins. These bins will be the possible states of markov tranform. For example, if we divide values between [0,1] in to 4 bins. Then the value between [0, 0.25) be bin 1 (state 1); [0.25, 0.5) be bin 2 (state 2); [0.5, 0.75) be bin 3 (state 3); [0.75, 1] be bin 4 (state 4). The data above in step 1 result will be tansformed something as below:
ID | Yr | Day | HE01 | HE02 | ... | HE23 | HE24 |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | state 3 | state 3 | ... | state 1 | state 1 |
... | ... | ... | ... | ... | ... | ... | ... |
2 | 1 | 1 | state 2 | stat 2 | ... | state 2 | state 2 |
... | ... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... | ... |
1,000,000 | 10 | 90 | state 3 | state 1 | ... | state 1 | state 3 |
- Calculate Markov Transformation Matrix for each customer, each time step. For example, for each customer, there will be 23 transformation matrix, such as M12 indicate the tranformation matrix from hour 1 to hour 2. As an example, the ID 1 customer M12 will be a 4 x 4 matrix since we have 4 states. The format of this matrix is
P11 | P12 | P13 | P14 |
P21 | P22 | P23 | P24 |
P31 | P32 | P33 | P34 |
P41 | P42 | P43 | P44 |
in which: Pij = (total count of state i at hour 1)/(total count of state j at hour 2 which was state 1 at hour 1)
So there will be 1,000,000 customers x 23 Matrix/customer x (4x4) dimension/matrix.
I am a newbee in the pyspark dataframe. Is there any suggestion for efficient code to make the above transformation?
Thanks for your help in advance