how to use my own file instead of using dataset in this code

Question

i am implementing this code and this gives me the corrent output but i want to save those four lines of"dataset" in a file and then use it.how can i do this?how can i use my own file instead of manually typed dataset?

from naiveBayesClassifier import tokenizer

from naiveBayesClassifier.trainer import Trainer

from naiveBayesClassifier.classifier import Classifier

nTrainer = Trainer(tokenizer)


dataSet =[
    {'text': 'hello everyone', 'category': 'NO'},

    {'text': 'dont use words like jerk', 'category': 'YES'},

    {'text': 'what the hell.', 'category': 'NO'},

    {'text': 'you jerk','category': 'yes'},


]

for n in dataSet:

    nTrainer.train(n['text'], n['category'])

nClassifier = Classifier(nTrainer.data, tokenizer)
.
unknownInstance = "Even if I eat too much, is not it possible to lose some weight"

classification = nClassifier.classify(unknownInstance)

print classification

dporru · Answer 1 · 2015-11-07T22:07:36.880

1

You could store the data set as a json file and then load it in your python code:

import json


with open('data.json') as f:
    dataSet = json.loads(f.read())

    # Use dataset.

edited Nov 07 '15 at 22:07

answered Nov 07 '15 at 10:32

dporru

11
3

Traceback (most recent call last): File "C:\Python27\dtnbayes.py", line 17, in dataSet = json.loads(f.read()) File "C:\Python27\lib\json\__init__.py", line 338, in loads return _default_decoder.decode(s) File "C:\Python27\lib\json\decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode obj, end = self.scan_once(s, idx) ValueError: Expecting property name: line 2 column 6 (char 7) – Neha Nov 07 '15 at 10:46
Make sure you load a valid JSON. The single quotes need to be double quote for instance. You could use this: http://jsonlint.com/ – dporru Nov 07 '15 at 13:09

score 0 · Answer 2 · answered Nov 07 '15 at 14:28

This line seems to be doing the most the work of training.

nTrainer.train(n['text'], n['category'])

This line seems to be doing the prediction after learning.

classification = nClassifier.classify(unknownInstance)

So if you a have a list of corpus (training data), a list of corresponding labels and list of data you want to predict (unknown instances)
You could so something like

from naiveBayesClassifier import tokenizer
from naiveBayesClassifier.trainer import Trainer
from naiveBayesClassifier.classifier import Classifier

corpus = ['hello everyone', 'dont use words like jerk', 'what the hell.', 'you jerk'] # Your training data
labels = ['NO', 'YES', 'NO', 'YES'] # Your labels
unknown_data = ['Even if I eat too much, is not it possible to lose some weight'] # List of data to be predicted

nTrainer = Trainer(tokenizer)

# model training
for item, category in zip(corpus, labels):
    nTrainer.train(item, category)

nClassifier = Classifier(nTrainer.data, tokenizer)
predictions = [ nClassifier.classify(unknownInstance)  for unknownInstance in unknown_data]

print classification

how to use my own file instead of using dataset in this code

2 Answers2