3

I have a Python function that takes in a block of text and returns a special 2D vector/dictionary representation of it, depending on a chosen length n. An example output might look like this:

1: [6, 8, 1]
2: [6, 16, 4, 4, 5, 11, 5, 8]
3: [4, 7, 8, 4]
..
..
n: [5, 2, 1, 4, 5, 6]

The keys from 1 to n represent positions within the input text; e.g., if n = 12, the key 5 would hold data that is ~5/12 of the way into the document.

The length of the list of ints at each key is arbitrary; thus, another block of text, for the same n value, could very well produce this:

1: [4, 5, 16, 7, 6]
2: None
3: [7, 9, 12]
..
..
n: [3]

I want to create a similarity measure for any two such vectors of the same length n. One thing I've tried is to consider only the averages of each integer list in the dictionary, thus providing simple, 1D vectors for an easy cosine comparison.

But this loses a little more information than I'd like (not to mention the trouble with occasional None values).

Since I can create different vectors/different 'granularities' of the representation by choosing different *n*s, would there be value in taking two documents, creating multiple vector pairs over a range of matched *n*s, and then doing some kind of average of averages?

Or would it be better to approach things entirely differently? I can just represent input texts as a 1D vectors and still get the idea I want, but they will end up being different lengths, which might complicate the comparison. (come to think of it, the varied lengths at each key in the original representation don't exactly solve that problem...ha. But still...)

norman
  • 5,128
  • 13
  • 44
  • 75
  • Similarity for what / in what way? What is it about two vectors (in the context of the actual data, the problem) that you'd like to be similar? There are many metrics that could apply here: you could interpret each vector as a 2D matrix and then just take e.g. Euclidean (or discrete) distance on R^(n^2). – gabe Sep 04 '14 at 22:05
  • Are you wanting to compute cosine similarity so that you can calculate tf-idf? – EdChum Sep 04 '14 at 22:06
  • To follow-up on @EdChum, if you're using tfidf -- then why break up the document into n chunks? – gabe Sep 04 '14 at 22:08
  • You could store the terms and their position which will create a sparse matrix but I don't see how this is useful for calculating similarity. The standard approaches are bayesian/tf-idf. Also position is pretty meaningless when calculating simiarity unless you are doing phrase query matches in which case the order would be important. – EdChum Sep 04 '14 at 22:12
  • @EdChum - Sort of. I could look at the ints in each list as terms and gather counts of them up for each document, but then I'd lose the positional information. Unless each term somehow combined position and the int...but that might make the problem too large – norman Sep 08 '14 at 01:46
  • 1
    In my opinion the position and order depending on what you are trying to do is of little value, however there are well documented techniques for efficiently storing postings lists and compressing them. Your problem is a little wooly and I think is better suited for a different site like crossvalidated, data science, computer science or have a look at [which site](http://stackexchange.com/sites) may be suitable – EdChum Sep 08 '14 at 06:17

0 Answers0