1

I am going to use CountVectorizer with a large corpus which I retrieve from Gutenberg (or any dat set from nltk) There are ebooks in tis corpus. I want to gather all sentences in those books in the same list. Something like that: listsentences=["SENTENCE#1" ,"SENTENCE#2" ,"SENTENCE#3" ...] I am stuck how to create sentence list. Any help is massively appreciated! This is how my code looks like:

from nltk.corpus import gutenberg
text=nltk.corpus.gutenberg.fileids()
gutenberg.fileids()
emma=gutenberg.sents()
vectorizer=CountVectorizer(min_df = 1, stop_words = 'english')
dtm= vectorizer.fit_transform(emma)
pd.DataFrame(dtm.toarray(),columns=vectorizer.get_feature_names()).head(10)
vectorizer.get_feature_names()
lsa = TruncatedSVD(3, algorithm = 'arpack')
dtm_lsa = lsa.fit_transform(dtm)
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
Denis
  • 151
  • 1
  • 4
  • 11
  • 1
    What kind of sentence list were you expecting? Sentences don't repeat like words, so you won't be getting a frequency count. There is a POS (Parts of speech) analysis built into the NLTK. – Mike Wise May 10 '15 at 15:58
  • Thank you for your response Mike! I am running LSA, so eventually I would like to get term by document (or sentence)matrix. I used very small dataset o CountVectorizer and run TruncatedSVD. The dat set was a list of sentences, so I want to create the same format of nltk corpus texts. – Denis May 10 '15 at 16:02
  • 1
    Try modifying your question and add a little table illustrating what you want to get out of the NLTK. That will help people find you an answer. – Mike Wise May 10 '15 at 16:06
  • Don't you already have all sentences in gutenberg.sents() ? – Ashalynd May 10 '15 at 18:34
  • Your variable `emma` contains all sentences from the gutenberg corpus. You're getting your examples mixed up. – alexis May 10 '15 at 22:15
  • Thank you so much for your responses. I greatly appreciated. emma has the sentences but not the format I want. For instance this is emma[24]: [u'She',u'dearly', u'loved'] but I would like to have ['She dearly loved'] – Denis May 11 '15 at 03:37

0 Answers0