0

I have been attempting to classify an author using multiple texts written by this author, which I would then use to find similarities in other texts to identify that author in the test group.

I have been successful with some of the predictions, however I am still getting results where it failed to predict the author.

I have done pre-processing the texts beforehand with stemming, tokenizing, stop words, removing punctuation etc. in an attempt to make it more accurate.

I am unfamiliar with how exactly the OneClassSVM parameters work. What parameters could I use to best suit my problem and how could I make my model more accurate in it's predictions?

Here is what I have so far:

vectorizer = TfidfVectorizer()

author_corpus = self.pre_process(author_corpus)
test_corpus = self.pre_process(test_corpus)

train = author_corpus
test = test_corpus

train_vectors = vectorizer.fit_transform(train)

test_vectors = vectorizer.transform(test)

model = OneClassSVM(kernel='linear', gamma='auto', nu=0.01)

model.fit(train_vectors)

test_predictions = model.predict(test_vectors)

print(test_predictions[:10])

print(model.score_samples(test_vectors)[:10])

MythKhan
  • 121
  • 7

2 Answers2

0

You can use a SVM, but deep learning is really well-suited for this. I did a Kaggle competition with classifying documents that was amazing for this.

If you don't think you have a big enough dataset, you might want to just take a text classifier model and re-train the last layer on your author, then fine-tune the rest of the model.

Tdoggo
  • 411
  • 2
  • 6
  • Thanks for the answer. I’m not very familiar with deep learning. Do you have any guides or resources I could use to help me build a deep learning model for authorship attribution? – MythKhan Mar 03 '20 at 09:07
0

I’ve heard positive things about Andrew Ng’s deep learning class on Coursera. I learned all I know about AI using the Microsoft Professional Certification in AI on edx.

Tdoggo
  • 411
  • 2
  • 6