Given the following input dataframe
npos = 3
inp = spark.createDataFrame([
['1', 23, 0, 2],
['1', 45, 1, 2],
['1', 89, 1, 3],
['1', 95, 2, 2],
['1', 95, 0, 4],
['2', 20, 2, 2],
['2', 40, 1, 4],
], schema=["id","elap","pos","lbl"])
A dataframe which looks like this needs to be constructed
out = spark.createDataFrame([
['1', 23, [2,0,0]],
['1', 45, [2,2,0]],
['1', 89, [2,3,0]],
['1', 95, [4,3,2]],
['2', 20, [0,0,2]],
['2', 40, [0,4,2]],
], schema=["id","elap","vec"])
The input dataframe has 10s of millions of records.
Some details which are seen in the example above (by design)
npos
is the size of the vector to be constructed in the outputpos
is guaranteed to be in[0,npos)
- at each time step (
elap
) there will be at most 1label
for apos
- if
lbl
is not given at a time step it has to be inferred from the last time it was specified for thatpos
- if
lbl
is not previously specified, it can be assumed to be 0