6

I am writing a ML module (python) to predict tags for a stackoverflow question (tag + body). My corpus is of around 5 million questions with title, body and tags for each. I'm splitting this 3:2 for training and testing. I'm plagued by the curse of dimensionality.


Work Done

  1. Pre-processing: markup removal, stopword removal, special character removal and a few bits and pieces. Store into MySQL. This almost halves the size of the test data.
  2. ngram association: for each unigram and bigram in the title and the body of each question, I maintain a list of the associated tags. Store into redis. This results in about a million unique unigrams and 20 million unique bigrams, each with a corresponding list of tag frequencies. Ex.

    "continuous integration": {"ci":42, "jenkins":15, "windows":1, "django":1, ....}
    

Note: There are 2 problems here: a) Not all unigrams and bigrams are important and, b) not all tags associated with a ngram are important, although this doesn't mean that tags with frequency 1 are all equivalent or can be haphazardly removed. The number of tags associated with a given ngram easily runs into the thousands - most of them unrelated and irrelevant.

  1. tfidf: to aid in selecting which ngrams to keep, I calculated the tfidf score for the entire corpus for each unigram and bigram and stored the corresponding idf values with associated tags. Ex.

    "continuous integration": {"ci":42, "jenkins":15, ...., "__idf__":7.2123}
    

    The tfidf scores are stored in a documentxfeature sparse.csr_matrix, and I'm not sure how I can leverage that at the moment. (it is generated by fit_transform())


Questions

  1. How can I use this processed data to reduce the size of my feature set? I've read about SVD and PCA but the examples always talk about a set of documents and a vocabulary. I'm not sure where the tags from my set can come in. Also, the way my data is stored (redis + sparse matrix), it is difficult to use an already implemented module (sklearn, nltk etc) for this task.
  2. Once the feature set is reduced, the way I have planned to use it is as follows:

    • Preprocess the test data.
    • Find the unigrams and bigrams.
    • For the ones stored in redis, find the corresponding best-k tags
    • Apply some kind of weight for the title and body text
    • Apart from this I might also search for exact known tag matches in the document. Ex, if "ruby-on-rails" occurs in the title/body then its a high probability that it's also a relevant tag.
    • Also, for tags predicted with a high probability, I might leverage a tag graph (a undirected graph with tags frequently occurring together having weighted edges between them) to predict more tags.

    Are there any suggestions on how to improve upon this? Can a classifier come in handy?


Footnote

I've a 16-core, 16GB RAM machine. The redis-server (which I'll move to a different machine) is stored in RAM and is ~10GB. All the tasks mentioned above (apart from tfidf) are done in parallel using ipython clusters.

vinayakshukl
  • 313
  • 3
  • 17

2 Answers2

1

Use the public Api of Dandelion, this is a demo.
It extracts concepts from a text, so, in order to reduce dimentionality, you could use those concepts, instead of the bag-of-word paradigm.

0

A baseline statistical approach would treat this as a classification problem. Features are bags-of-words processed by a maximum entropy classifier like Mallet http://mallet.cs.umass.edu/classification.php. Maxent (aka logistic regression) is good at handling large feature spaces. Take the probability associated with each each tag (i.e., the class labels) and choose some decision threshold that gives you a precision/recall tradeoff that works for your project. Some of the Mallet documentation even mentions topic classification, which is very similar to what you are trying to do.

The open questions are how well Mallet handles the size of your data (which isn't that big) and whether this particular tool is a non-starter with the technology stack you mentioned. You might be able to train offline (dump the reddis database to a text file in Mallet's feature format) and run the Mallet-learned model in Python. Evaluating a maxent model is simple. If you want to stay in Python and have this be more automated, there are Python-based maxent implementations in NLTK and probably in scikit-learn. This approach is not at all state-of-the-art, but it'll work okay and be a decent baseline with which to compare more complicated methods.

romanows
  • 458
  • 4
  • 12