9

I have a large corpus of opinions (2500) in raw text. I would like to use scikit-learn library to split them into test/train sets. What could be the best aproach to solve this task with scikit-learn?. Could anybody provide me an example of spliting raw text in test/train sets (probably i´ll use tf-idf representation).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
anon
  • 457
  • 2
  • 5
  • 8

1 Answers1

20

Suppose your data is a list of strings, i.e.

data = ["....", "...", ]

Then you can split it into training (80%) and test (20%) sets using train_test_split e.g. by doing:

from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size = 0.2)

Before you rush doing it, though, read those docs through. 2500 is not a "large corpus" and you probably want to do something like a k-fold cross-validation rather than a single holdout split.

Rishabh Agrahari
  • 3,447
  • 2
  • 21
  • 22
KT.
  • 10,815
  • 4
  • 47
  • 71
  • I would like to do some sentiment analysis in spanish. Is that a correct aproach to split the dataset?, i have a directory with 2500 .txt files (opinions). – anon Sep 12 '14 at 00:53
  • 4
    As I said, 2500 is not a large number, so you are better off doing cross-validation to assess your performance. Moreover, you might need to first split off a "final test set" (say, 500 items), use the 2000 for model selection (using cross-validation to select the best model), and once you are settled on a model, check its performance on the originally held-out test set. There may be variations to your approach, depending on a number of factors. – KT. Sep 12 '14 at 01:10