Find and print text in quotation marks from a text file with python

Question

I am a python beginner and want python to capture all text in quotation marks from a text file. I have tried the following:

filename = raw_input("Enter the full path of the file to be used: ")
input = open(filename, 'r')
import re
quotes = re.findall(ur'"[\^u201d]*["\u201d]', input)
print quotes

I get the error:

Traceback (most recent call last):
  File "/Users/nithin/Documents/Python/Capture Quotes", line 5, in <module>
    quotes = re.findall(ur'"[\^u201d]*["\u201d]', input)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

Can anyone help me out?

does it need to work across multiple lines? e.g., does it need to work for 'one two "thr\nee"' ? — aleph_null, Jan 30 '13 at 19:39
The error is due to the fact that you are passing a `file` object, when `re.findall` wants a string. Use `re.findall(the-regex, input.read())` — Bakuriu, Jan 30 '13 at 19:44
On top of everything else… you're trying to match a `unicode` expression against an 8-bit `str` (the file contents). You need to know what charset the file uses, and either `encode` the file contents or `decode` the regex. — abarnert, Jan 30 '13 at 20:13
Also, the regex you've written looks for a `"`, 0 or more characters from the set `^` `u`, `2`, `0`, `1`, or `d`, then a `"` or `”`. That doesn't sound like the same thing you described in your text. Are you looking for anything enclosed in either `"…"` or `”…”`? Or…? — abarnert, Jan 30 '13 at 20:18

Justin O Barber · Accepted Answer · 2013-01-30T20:52:07.157

3

As Bakuriu has pointed out, you need to add .read() like so:

quotes = re.findall(ur'[^\u201d]*[\u201d]', input.read())

open() merely returns a file object, whereas f.read() will return a string. In addition, I'm guessing you are looking to get everything between two quotation marks instead of just zero or more occurences of [\^u201d] before a quotation mark. So I would try this:

quotes = re.findall(ur'[\u201d][^\u201d]*[\u201d]', input.read(), re.U)

The re.U accounts for unicode. Or (if you don't have two sets of right double quotation marks and don't need unicode):

quotes = re.findall(r'"[^"]*"', input.read(), re.U)

Finally, you may want to choose a different variable than input, since input is a keyword in python.

Your result might look something like this:

>>> input2 = """
cfrhubecf "ehukl wehunkl echnk
wehukb ewni; wejio;"
"werulih"
"""
>>> quotes = re.findall(r'"[^"]*"', input2, re.U)
>>> print quotes
['"ehukl wehunkl echnk\nwehukb ewni; wejio;"', '"werulih"']

edited Jan 30 '13 at 20:52

answered Jan 30 '13 at 19:51

Justin O Barber

11,291
2
40
45

Which one did you try? That last one? – Justin O Barber Jan 30 '13 at 20:30
After you assign your input variable, try to print type(input) and len(input) to see if it is what you expect. – Justin O Barber Jan 30 '13 at 20:34
If you are dealing with unicode, you might need to add re.U to the end of your search string, like I have done above. – Justin O Barber Jan 30 '13 at 20:39
What do you mean by input variable? – user2026718 Jan 30 '13 at 20:44
The variable that you call input in this line: input = open(filename, 'r'). Follow that with print type(input), len(input) – Justin O Barber Jan 30 '13 at 20:46
Traceback (most recent call last): File "/Users/nithin/Documents/Python/Capture Quotes.py", line 4, in print type(data), len(data) TypeError: object of type 'file' has no len() I use the variable data instead of input now. – user2026718 Jan 30 '13 at 21:02
Sorry. Add.read() to the end of that assignment to make it a string. – Justin O Barber Jan 30 '13 at 21:08

score 0 · Answer 2 · answered Jan 30 '13 at 19:47

Instead of using regular expressions you could try some python builtins. Ill let you do the hard work:

message = '''
"some text in quotes", some text not in quotes. Some more text 'In different kinds of quotes'.
'''
list_of_single_quote_items = message.split("'")
list_of_double_quote_items = message.split(""")

The challenging part will be interpreting what your split list means and dealing with all edge conditions (only one quote in string, escape sequences, etc.)

Find and print text in quotation marks from a text file with python

2 Answers2