I have a set of tuples of size three in a list that represent windowed sequences. What I need is using pyspask to be able to get (given the two first parts of the tuple) the third one.
So I need it to create sequences of three elements based on their frequency.
This is what I am doing:
data = [[['a','b','c'],['b','c','d'],['c','d','e'],['d','e','f'],['e','f','g'],['f','g','h'],['a','b','c'],['d','e','f'],['a','b','c'],['b','c','d'],['f','g','h'],['d','e','f'],['b','c','d']]]
rdd = spark.sparkContext.parallelize(data,2)
rdd.cache()
model = PrefixSpan.train( rdd, 0.2, 3)
print(sorted(model.freqSequences().take(100)))
Although, I would expect to see the sequences and the frequencies o them to follow the alphabet, they don't.
And I am getting sequences like:
FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)
FreqSequence(sequence=[[u'g'], [u'c'], [u'c']], freq=1)
which are not appearing in the defined ones. Obviously there is a problem in the way I have structure my features or I am missing something in the purpose and functionality of this algorithm..
Thank you!