I am writing a ML module (python) to predict tags for a stackoverflow question (tag + body). My corpus is of around 5 million questions with title, body and tags for each. I'm splitting this 3:2 for training and testing. I'm plagued by the curse of dimensionality.
Work Done
- Pre-processing: markup removal, stopword removal, special character removal and a few bits and pieces. Store into MySQL. This almost halves the size of the test data.
ngram association: for each unigram and bigram in the title and the body of each question, I maintain a list of the associated tags. Store into redis. This results in about a million unique unigrams and 20 million unique bigrams, each with a corresponding list of tag frequencies. Ex.
"continuous integration": {"ci":42, "jenkins":15, "windows":1, "django":1, ....}
Note: There are 2 problems here: a) Not all unigrams and bigrams are important and, b) not all tags associated with a ngram are important, although this doesn't mean that tags with frequency 1 are all equivalent or can be haphazardly removed. The number of tags associated with a given ngram easily runs into the thousands - most of them unrelated and irrelevant.
tfidf: to aid in selecting which ngrams to keep, I calculated the tfidf score for the entire corpus for each unigram and bigram and stored the corresponding idf values with associated tags. Ex.
"continuous integration": {"ci":42, "jenkins":15, ...., "__idf__":7.2123}
The tfidf scores are stored in a
documentxfeature
sparse.csr_matrix, and I'm not sure how I can leverage that at the moment. (it is generated by fit_transform())
Questions
- How can I use this processed data to reduce the size of my feature set? I've read about SVD and PCA but the examples always talk about a set of documents and a vocabulary. I'm not sure where the tags from my set can come in. Also, the way my data is stored (redis + sparse matrix), it is difficult to use an already implemented module (sklearn, nltk etc) for this task.
Once the feature set is reduced, the way I have planned to use it is as follows:
- Preprocess the test data.
- Find the unigrams and bigrams.
- For the ones stored in redis, find the corresponding best-k tags
- Apply some kind of weight for the title and body text
- Apart from this I might also search for exact known tag matches in the document. Ex, if "ruby-on-rails" occurs in the title/body then its a high probability that it's also a relevant tag.
- Also, for tags predicted with a high probability, I might leverage a tag graph (a undirected graph with tags frequently occurring together having weighted edges between them) to predict more tags.
Are there any suggestions on how to improve upon this? Can a classifier come in handy?
Footnote
I've a 16-core, 16GB RAM machine. The redis-server (which I'll move to a different machine) is stored in RAM and is ~10GB. All the tasks mentioned above (apart from tfidf) are done in parallel using ipython clusters.