I'm a complete newbie in Machine Learning, NLP, Data Analysis but I'm very motivated to understand it better. I'm reading couple of books on NLTK, scikit-learn etc. I discovered a python module "TextBlob" and found it to be super easy to get started with it. Hence I have created a sample demo python script which is hosted at: https://gist.github.com/dpnishant/367cef57a8033138eb0a. I'm trying to figure out the best suited algorithm for sentiment analysis and text classification. My questions are as follows:
Why is the sentiment analysis in the NaiveBayesClassifier slow even on such a small training set? Is this time constant or is it going to increase even more with more training data? And also the sentiment analysis is incorrect (refer the script output, it says "negative" for the input text "sandwich is good"). What am I doing wrong?
I read in the TextBlob's documentation that the NaiveBayesClassifier is trained on the movie_review corpus. Is there any api where I can change it to something else, nps_chat maybe? Something that is not very clear to me is what is the role of a corpus? I mean, we are training the classifier with our own sample training data then how would more specific corpus e.g. nps_chat, product_reviews, moview_review etc. would help?
I understand that I need to train a classifier for it to work on a unlabelled data. But if the training data gets huge, what is the best way to handle it? Should the program build the model from the training data every time or is there way where we can save the model to a file (something like pickle) and read it from there? Is it possible with TextBlob and will there be any performance improvements with this methodology?
In my script, in the last block I'm trying to evaluate the SklearnClassifier via the NLTKClassifier module but I'm having no luck there. It throws some cryptic error messages. Can you please help me in resolving it? And also may I request you to, if possible, show some examples regarding the usage of algorithms/classifiers available in the nltk.classify package on the TextBlob's documentation website e.g. the Megam, LogisticRegression, SVM, BernoulliNB, GaussianNB etc. An use-case for understanding the applicability of the each algorithm would clear a lot of doubts in beginners like me.