Generate Markov Transform Matrix from Time Series Data in Spark Dataframe

Question

I have a question of the most efficient way to generate markov tranform maxtrix for about one million IDs. Each account has 24 hours/day x 90 days/year x 10 years data. The data are in the format as follows:

ID	Yr	Day	HE01	HE02	...	HE23	HE24
1	1	1	5.0	10	...	2.1	1.6
1	1	2	4.7	5.1	...	1.7	5.5
...	...	...	...	...	...	...	...
1	10	90	4.75	4.2	...	1.5	1.2
2	1	1	400	420	...	500	420
...	...	...	...	...	...	...	...
2	10	90	322	322	...	155	100
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
1,000,000	10	90	0.5	0.1	...	0.2	0.7

I'd like to treat each hour as a time stamp, So there are going to be 24 time steps. Then, create markov transform possiblity matrix from it. The followings are the steps:

Normalization data for each customer. Since each customer has different data scale, it is required to be normalized to the same range of [0, 1] by divding the max value for each ID itself. So the data above will be transfered as

ID	Yr	Day	HE01	HE02	...	HE23	HE24
1	1	1	5.0/10yr_max(ID1)	10yr_max(ID1)	...	2.1/10yr_max(ID1)	1.6/10yr_max(ID1)
...	...	...	...	...	...	...	...
2	1	1	400/10yr_max(ID2)	420/10yr_max(ID2)	...	500/10yr_max(ID2)	420/10yr_max(ID2)
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
1,000,000	10	90	0.510/yr_max(ID1,000,000)	0.1/yr_max(ID1,000,000)	...	0.2/yr_max(ID1,000,000)	0.7/yr_max(ID1,000,000)

Convert each customers normalized value into sevarl bins. These bins will be the possible states of markov tranform. For example, if we divide values between [0,1] in to 4 bins. Then the value between [0, 0.25) be bin 1 (state 1); [0.25, 0.5) be bin 2 (state 2); [0.5, 0.75) be bin 3 (state 3); [0.75, 1] be bin 4 (state 4). The data above in step 1 result will be tansformed something as below:

ID	Yr	Day	HE01	HE02	...	HE23	HE24
1	1	1	state 3	state 3	...	state 1	state 1
...	...	...	...	...	...	...	...
2	1	1	state 2	stat 2	...	state 2	state 2
...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...
1,000,000	10	90	state 3	state 1	...	state 1	state 3

Calculate Markov Transformation Matrix for each customer, each time step. For example, for each customer, there will be 23 transformation matrix, such as M12 indicate the tranformation matrix from hour 1 to hour 2. As an example, the ID 1 customer M12 will be a 4 x 4 matrix since we have 4 states. The format of this matrix is


P11	P12	P13	P14
P21	P22	P23	P24
P31	P32	P33	P34
P41	P42	P43	P44

in which: Pij = (total count of state i at hour 1)/(total count of state j at hour 2 which was state 1 at hour 1)

So there will be 1,000,000 customers x 23 Matrix/customer x (4x4) dimension/matrix.

I am a newbee in the pyspark dataframe. Is there any suggestion for efficient code to make the above transformation?

Thanks for your help in advance

Generate Markov Transform Matrix from Time Series Data in Spark Dataframe

0 Answers0