I am trying to find a solution for finding nearest or approximate nearest neighbor of documents.
Right now I am using tfidf as vector representation of the document. My data is pretty big (N ~ million). If I use annoy with tfidf, I ran out of memory. I figured it's because of tfidf's high dimensionality(my vocabulary is about 2000000 Chinese words).
I then tried it with pysparNN, which works great. However my concern is as my data size grow, pysparNN build a bigger index, and eventually it might not fit into RAM. This is ab problem because pysparNN does not use a static file like annoy does.
I am wondering what might be a good solution for finding nearest neighbor for text data. Right now I am looking into using gensim's annoy index, with doc2ve