I have been attempting to classify an author using multiple texts written by this author, which I would then use to find similarities in other texts to identify that author in the test group.
I have been successful with some of the predictions, however I am still getting results where it failed to predict the author.
I have done pre-processing the texts beforehand with stemming, tokenizing, stop words, removing punctuation etc. in an attempt to make it more accurate.
I am unfamiliar with how exactly the OneClassSVM parameters work. What parameters could I use to best suit my problem and how could I make my model more accurate in it's predictions?
Here is what I have so far:
vectorizer = TfidfVectorizer()
author_corpus = self.pre_process(author_corpus)
test_corpus = self.pre_process(test_corpus)
train = author_corpus
test = test_corpus
train_vectors = vectorizer.fit_transform(train)
test_vectors = vectorizer.transform(test)
model = OneClassSVM(kernel='linear', gamma='auto', nu=0.01)
model.fit(train_vectors)
test_predictions = model.predict(test_vectors)
print(test_predictions[:10])
print(model.score_samples(test_vectors)[:10])