I have a Python function that takes in a block of text and returns a special 2D vector/dictionary representation of it, depending on a chosen length n. An example output might look like this:
1: [6, 8, 1]
2: [6, 16, 4, 4, 5, 11, 5, 8]
3: [4, 7, 8, 4]
..
..
n: [5, 2, 1, 4, 5, 6]
The keys from 1 to n represent positions within the input text; e.g., if n = 12, the key 5 would hold data that is ~5/12 of the way into the document.
The length of the list of ints at each key is arbitrary; thus, another block of text, for the same n value, could very well produce this:
1: [4, 5, 16, 7, 6]
2: None
3: [7, 9, 12]
..
..
n: [3]
I want to create a similarity measure for any two such vectors of the same length n. One thing I've tried is to consider only the averages of each integer list in the dictionary, thus providing simple, 1D vectors for an easy cosine comparison.
But this loses a little more information than I'd like (not to mention the trouble with occasional None values).
Since I can create different vectors/different 'granularities' of the representation by choosing different *n*s, would there be value in taking two documents, creating multiple vector pairs over a range of matched *n*s, and then doing some kind of average of averages?
Or would it be better to approach things entirely differently? I can just represent input texts as a 1D vectors and still get the idea I want, but they will end up being different lengths, which might complicate the comparison. (come to think of it, the varied lengths at each key in the original representation don't exactly solve that problem...ha. But still...)