How to count the number of words in a paragraph and exclude some words (from a file)?

Question

I've just started to learn Python so my question might be a bit silly. I'm trying to create a program that would:
- import a text file (got it)
- count the total number of words (got it),
- count the number of words in a specific paragraph, starting with a specific phrase (e.g. "P1", ending with another participant "P2") and exclude these words from my word count. Somehow I ended up with something that counts the number of characters instead :/
- print paragraphs separately (got it)
- exclude "P1" "P2" etc. words from my word count.

My text files look like this:
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.

I ended up with this code:

text = open (r'C:/data.txt', 'r')
lines = list(text)
text.close()
words_all = 0
for line in lines:
    words_all = words_all + len(line.split())
print 'Total words:   ', words_all

words_par = 0
for words_par in lines:
    if words_par.startswith("P1" or "P2" or "P3") & words_par.endswith("P1" or "P2" or "P3"):
        words_par = line.split()
    print len(words_par)
    print words_par.replace('P1', '') #doesn't display it but still counts
else:
    print 'No words'

Any ideas how to improve it?

Thanks

@Jakob Bowyer It is because it is explicit that it is vain. So your sentence means "Sometimes, it's nice to be vain". — eyquem, Sep 09 '11 at 10:59
it should probably be `r'C:\data.txt'`, since the correct directory separator on windows is \, and `'C:\\data.txt'` is too awful. — SingleNegationElimination, Sep 10 '11 at 23:19

eyquem · Answer 1 · 2011-09-09T11:15:52.050

You shouldn't call open ('zery.txt', 'r') with identifier text. It is not the text in the file, it is the handler of the file, described as a "file-like object" in the docs (I never understood what it means, "file-like object", by the way)

.

with open ('C:/data.txt', 'r')  as f:
    ........
    ........

is better than

f = open ('C:/data.txt', 'r') 
    ......
    .....
f.close()

.

You should read the instructions concerning split() , so you'll see that you can do:

with open ('C:/data.txt', 'r') as f:
    text = f.read()
words_all = len(text.split())
print 'Total words:   ', words_all

.

If the structure of your text is:

P1: Bla bla bla. 
P2: Bla bla bla bla. 
P1: Bla bla. 
P3: Bla.

then words_par.endswith("P1" or "P2" or "P3") is always False, hence the desired spliting isn't performed.

Consequently, words_par doesn't become a list, it remains a string, that's why the characters are counted.

.

Also, your code is certainly wrong.

If the splitting was performed, it would be the last line obtained in the first for-loop, in the beginning of the code, that would be repeatedly splitted.

So, instead of

for words_par in lines: 
    if words_par.startswith("P1" or "P2" or "P3"):
        words_par = line.split()

it is certainly:

for line in lines: 
    if line[0:2] in ("P1","P2","P3") :
        words_par = line.split()

`line.startswith("P1" or "P2" or "P3")` is equivalent to `line.startswith("P1")` and misleading at best. — MattH, Sep 09 '11 at 10:59
@MattH Oh ! I didn't see that. I went to your last answer (Linux non-blocking FIFO) and upvoted it — eyquem, Sep 09 '11 at 11:14

score 2 · Answer 2 · answered Sep 09 '11 at 10:54

Maybe I didn't understand the requirements completely, but I'll do my best.

The first part about counting all words is quite ok. I'd shorten it a bit:

with open('C:/data.txt', 'r') as textfile:
    lines = list(textfile)
words_all = sum([len(line.split()) for line in lines])
print 'Total words:   ', words_all

In the second part, something seems to go wrong.

words_par = 0 # You can leave out this line,
              # 'words_par' is initialized in the for-statement

More problems here:

    if words_par.startswith("P1" or "P2" or "P3") & words_par.endswith("P1" or "P2" or "P3"):

"P1" or "P2" or "P3" evaluates to "P1" (non-empty strings are "truthy" values). So you could shorten the line to

    if words_par.startswith("P1") & words_par.endswith("P1"):

which is probably not what you wanted.
When the condition evaluates to False, the split-method is not called and words_par remains a string (and not a list of strings as expected). So len(words_par) returns the number of characters instead of the number of words.

(A little disgression on names: IMHO this error arose from an inaccurate naming of a variable. A different naming

for line in lines:
    if line.startswith(...:
        words_par = line.split()
    print len(words_par)

would have produced a clear error message. In a second reading, that must have been what you meant anyway.)

James Hurford · Accepted Answer · 2011-09-10T22:22:23.550

The first part is ok where you get the total words and print the result.

Where you fall down is here

words_par = 0
for words_par in lines:
    if words_par.startswith("P1" or "P2" or "P3") & words_par.endswith("P1" or "P2" or "P3"):
        words_par = line.split()
    print len(words_par)
    print words_par.replace('P1', '') #doesn't display it but still counts
else:
    print 'No words'

The words_par is at first a string containing the line from the file. Under a condition which will never be meet, it is turned into a list with the

line.split()

expression. This, if the expression

words_par.startswith("P1" or "P2" or "P3") & words_par.endswith("P1" or "P2" or "P3")

were to ever return True, would always be splitting the last line in your file, due to the last time it was assigned to was in the first part of your program where you did a full count of the number of words in the file. That should really be

words_par.split()

Also

words_par.startswith("P1" or "P2" or "P3")

will always be

words_par.startswith("P1")

since

"P1" or "P2" or "P3"

always evaluates to the first one which is True, which is the first string in this case. Read http://docs.python.org/reference/expressions.html if you want to know more.

While we are at it, unless you are wanting to do bitwise comparisons avoid doing

something & something

instead do

something and something

The first will evaluate both expressions no matter what the result of the first, where as the second will only evaluate the second expression if the first is True. If you do this your code will operate a little more efficiently.

The

print len(words_par)

on the next line is always going to counting the number of characters in the line, since the if statement is always going to evaluate to False and the word_par never got split into a list of words.

Also the else clause on the for loop will always be executed no matter whether the sequence is empty or not. Have a look at http://docs.python.org/reference/compound_stmts.html#the-for-statement for more information.

I wrote a version of what I think you are after as a example according to what I think you want. I tried to keep it simple and avoid using things like list comprehension, since you say you are just starting to learn, so it is not optimal, but hopefully will be clear. Also note I made no comments, so feel free to hassle me to explain things for you.

words = None
with open('data.txt') as f:
    words = f.read().split()
total_words = len(words)
print 'Total words:', total_words

in_para = False
para_count = 0
para_type = None
paragraph = list()
for word in words:
  if ('P1' in word or
      'P2' in word or
      'P3' in word ):
      if in_para == False:
         in_para = True
         para_type = word
      else:
         print 'Words in paragraph', para_type, ':', para_count
         print ' '.join(paragraph)
         para_count = 0
         del paragraph[:]
         para_type = word
  else:
    paragraph.append(word)
    para_count += 1
else:
  if in_para == True:
    print 'Words in last paragraph', para_type, ':', para_count
    print ' '.join(paragraph)
  else:
    print 'No words'

EDIT:

I actually just noticed some redundant code in the example. The variable para_count is not needed, since the words are being appended to the paragraph variable. So instead of

print 'Words in paragraph', para_type, ':', para_count

You could just do

print 'Words in paragraph', para_type, ':', len(paragraph)

One less variable to keep track of. Here is the corrected snippet.

in_para = False
para_type = None
paragraph = list()
for word in words:
  if ('P1' in word or
      'P2' in word or
      'P3' in word ):
      if in_para == False:
         in_para = True
         para_type = word
      else:
         print 'Words in paragraph', para_type, ':', len(paragraph)
         print ' '.join(paragraph)
         del paragraph[:]
         para_type = word
  else:
    paragraph.append(word)
else:
  if in_para == True:
    print 'Words in last paragraph', para_type, ':', len(paragraph)
    print ' '.join(paragraph)
  else:
    print 'No words'

thanks guys! @james: you got it right, it works as I wanted. now I have to digest all the knowledge and try to understand what went wrong :) — epo3, Sep 10 '11 at 17:13
@epo3 Your welcome. Have a look at my corrected snippet for a better way of doing it. — James Hurford, Sep 10 '11 at 22:24
I don't understand this bit:
if in_para == False:
in_para = True. How can I add all the values for a certain paragraph? e.g. summing up all P1 word counts. I tried writing a code but didn't come up with anything that would make sense :/ — epo3, Sep 13 '11 at 11:59
in_para is a flag that makes sure the that you have encountered a P1, P2 or P3 word, thus not counting anything that does not start with those words. How can you sum up the count of words in all P1 paragraphs sounds like a new question, which, if you posted it, I would be happy to answer. — James Hurford, Sep 13 '11 at 20:23
http://stackoverflow.com/questions/7429845/python-how-to-sum-up-the-word-count-for-each-person-in-a-dialogue — epo3, Sep 15 '11 at 11:14

How to count the number of words in a paragraph and exclude some words (from a file)?

3 Answers3