3

I am a python beginner and want python to capture all text in quotation marks from a text file. I have tried the following:

filename = raw_input("Enter the full path of the file to be used: ")
input = open(filename, 'r')
import re
quotes = re.findall(ur'"[\^u201d]*["\u201d]', input)
print quotes

I get the error:

Traceback (most recent call last):
  File "/Users/nithin/Documents/Python/Capture Quotes", line 5, in <module>
    quotes = re.findall(ur'"[\^u201d]*["\u201d]', input)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

Can anyone help me out?

user2026718
  • 33
  • 1
  • 3
  • does it need to work across multiple lines? e.g., does it need to work for 'one two "thr\nee"' ? – aleph_null Jan 30 '13 at 19:39
  • 2
    The error is due to the fact that you are passing a `file` object, when `re.findall` wants a string. Use `re.findall(the-regex, input.read())` – Bakuriu Jan 30 '13 at 19:44
  • Yes, it should work across multiple lines. – user2026718 Jan 30 '13 at 20:13
  • On top of everything else… you're trying to match a `unicode` expression against an 8-bit `str` (the file contents). You need to know what charset the file uses, and either `encode` the file contents or `decode` the regex. – abarnert Jan 30 '13 at 20:13
  • Also, the regex you've written looks for a `"`, 0 or more characters from the set `^` `u`, `2`, `0`, `1`, or `d`, then a `"` or `”`. That doesn't sound like the same thing you described in your text. Are you looking for anything enclosed in either `"…"` or `”…”`? Or…? – abarnert Jan 30 '13 at 20:18

2 Answers2

3

As Bakuriu has pointed out, you need to add .read() like so:

quotes = re.findall(ur'[^\u201d]*[\u201d]', input.read())

open() merely returns a file object, whereas f.read() will return a string. In addition, I'm guessing you are looking to get everything between two quotation marks instead of just zero or more occurences of [\^u201d] before a quotation mark. So I would try this:

quotes = re.findall(ur'[\u201d][^\u201d]*[\u201d]', input.read(), re.U)

The re.U accounts for unicode. Or (if you don't have two sets of right double quotation marks and don't need unicode):

quotes = re.findall(r'"[^"]*"', input.read(), re.U)

Finally, you may want to choose a different variable than input, since input is a keyword in python.

Your result might look something like this:

>>> input2 = """
cfrhubecf "ehukl wehunkl echnk
wehukb ewni; wejio;"
"werulih"
"""
>>> quotes = re.findall(r'"[^"]*"', input2, re.U)
>>> print quotes
['"ehukl wehunkl echnk\nwehukb ewni; wejio;"', '"werulih"']
Justin O Barber
  • 11,291
  • 2
  • 40
  • 45
0

Instead of using regular expressions you could try some python builtins. Ill let you do the hard work:

message = '''
"some text in quotes", some text not in quotes. Some more text 'In different kinds of quotes'.
'''
list_of_single_quote_items = message.split("'")
list_of_double_quote_items = message.split(""")

The challenging part will be interpreting what your split list means and dealing with all edge conditions (only one quote in string, escape sequences, etc.)

Paul Seeb
  • 6,006
  • 3
  • 26
  • 38