0

I am having a hard time understanding the process of building a bag-of-words. This will be a multiclass classfication supervised machine learning problem wherein a webpage or a piece of text is assigned to one category from multiple pre-defined categories. Now the method that I am familiar with when building a bag of words for a specific category (for example, 'Math') is to collect a lot of webpages that are related to Math. From there, I would perform some data processing (such as remove stop words and performing TF-IDF) to obtain the bag-of-words for the category 'Math'.

Question: Another method that I am thinking of is to instead search in google for something like 'List of terms related to Math' to build my bag-of-words. I would like to ask if this is method is okay?

Another question: In the context of this question, does bag-of-words and corpus mean the same thing?

Thank you in advance!

1 Answers1

0

This is not what bag of words is. Bag of words is the term to describe a specific way of representing a given document. Namely, a document (paragraph, sentence, webpage) is represented as a mapping of form

word: how many times this word is present in a document

for example "John likes cats and likes dogs" would be represented as: {john: 1, likes: 2, cats: 1, and: 1, dogs: 1}. This kind of representation can be easily fed into typical ML methods (especially if one assumes that total vocabulary is finite so we end up with numeric vectors).

Note, that this is not about "creating a bag of words for a category". Category, in typical supervised learning would consist of multiple documents, and each of them independently is represented as a bag of words.

In particular this invalidates your final proposal of asking google for words that are related to category - this is not how typical ML methods work. You get a lot of documents, represent them as bag of words (or something else) and then perform statistical analysis (build a model) to figure out the best set of rules to discriminate between categories. These rules usually will not be simply "if the word X is present, this is related to Y".

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Can you please give me some examples on what you mean by perform statistical analysis (build a model)? Links would be great as well. –  May 27 '17 at 16:18
  • I mean for example train a Naive Bayes classifier, or Support Vector Machine, or build any other kind of approximation of P(category | samples) – lejlot May 27 '17 at 16:20
  • Oh, I see. Just to clarify. Would this be an alright process for a web page categorization problem? Gather corpus/web pages and label it with predefined categories (like 'Math' or 'Sports) -> Divide corpus between test and train set -> Using the training set, build bag of words for particular category -> Then do a test? –  May 27 '17 at 16:35
  • 1
    You don't build a bag-of-words for a category. You do REPRESENT a text document as a bag-of-words and then, perform a classification task of those documents, assigning labels and training a model. – shirowww May 27 '17 at 16:57
  • as @shirowww said, and as is stated in the answer - bag of words is a way to represent **documents**, not a category. Once these documents are in bow form they can be used to **learn** a mapping to a category – lejlot May 27 '17 at 17:52
  • Alright, I think I understand. Another question that comes to mind. Can you build a corpus for a particular category? I am not sure but I think a corpus would be a set of webpages and I would manually assigned them a category based on what I think it is? –  May 28 '17 at 02:14
  • usually a build a corpus for a problem rather than for a category, but yes - this is just set of documents. How you assign the categories is up to you. Sometimes you can assign it in a perfect way (since you know where the document comes from) sometimes it can be automated (for example based on the hierarchy in wikipedia if your corpus is based on wiki pages) and sometimes you end up doing things by hand (which takes **a lot** of time) – lejlot May 28 '17 at 12:14