5

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -

" and “

There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex

\"*\“*

I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?

Edit -

My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII

escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))

where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.

Edit2

I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.

Condition

No external libraries allowed, only native python libraries could be used.

Sam Hosseini
  • 813
  • 2
  • 9
  • 17
Dexter
  • 11,311
  • 11
  • 45
  • 61
  • Replacing quotes is hardly a task for regular expressions. I'd get a list of (unicode?) quotes and do an ordinary `replace`. – Lev Levitsky Mar 25 '12 at 13:23
  • @Lev Levitsky, How exactly would unicode work here? I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error. I am a newbie to unicode. – Dexter Mar 25 '12 at 13:28
  • Looks like your call to `urllib.unquote` runs into the following yet unresolved Python bug: http://bugs.python.org/issue8136 – Abel Mar 25 '12 at 13:41
  • @Abel What can I do in this case? – Dexter Mar 25 '12 at 13:44
  • @mcenley: you are escaping HTML as if it is a URL. Maybe you don't need escaping at all. Consider reading the HTML as UTF-8 (which it may already be, or fix it at the source), that way you don't need any escaping. – Abel Mar 25 '12 at 14:22
  • @Abel Here's the reason why I am escaping - http://stackoverflow.com/questions/9856990/encode-decode-of-strings-python – Dexter Mar 25 '12 at 14:25
  • @Abel The issues here is these lines " ada@graphics.maestro.com " While the ahref works fine the @ isn't converted to @ without escaping HTML entities. – Dexter Mar 25 '12 at 14:31
  • That post clearly explains what you should do in this case. I.e. it says "if you don't care about non-ASCII characters, do X". Just don't do X. Use the other approaches bernie explains in that answer. – Abel Mar 25 '12 at 14:32
  • @Abel I don't care about non-ASCII characters "after" escaping. For example, I care about "ada@graphics.maestro.com " to converted to ada@graphics.maestro.com and "" but I don't care about different types of quotes. – Dexter Mar 25 '12 at 14:37
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/9278/discussion-between-mcenley-and-abel) – Dexter Mar 25 '12 at 14:39
  • Maybe I misunderstand, but you clearly state in your question "two types of quotes". The second is non-ASCII. So how come you "don't care" about non-ASCII? Later you say you don't know the encoding, but you'll have to find out somehow, as without that you cannot do anything (apart from the first 127 positions, all encodings are different and quotes are, when available and apart from x22, encoded above x1F). – Abel Mar 25 '12 at 14:41
  • "it throws me NON-ASCII character error. " >> in other words, your data is **NOT** ASCII-only, hence you **MUST** find out the encoding before doing any data manipulation or searches. Sorry... – Abel Mar 25 '12 at 14:43
  • @Abel Probably I am mis communicating too. The issue is that I care about some non-ASCII while not bother about others. I care about @ which is @ (also care about %20 but that's a URL encode) but I don't care about the “ . Hence, the issue. I hope I have communicated more clearly now. – Dexter Mar 25 '12 at 14:44
  • @Abel Not all my data is the same encoding. How do I find out the encoding dynamically? Bernie on the other thread suggests that it's not recommended. – Dexter Mar 25 '12 at 14:45
  • See chat: http://chat.stackoverflow.com/rooms/9278/discussion-between-mcenley-and-abel – Abel Mar 25 '12 at 15:32

3 Answers3

3

I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.

You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.

kristaps
  • 1,705
  • 11
  • 15
  • Sorry to bother you on this, but how exactly would it work? I am thrown a NON-ASCII character error on the replace function (check edited question). – Dexter Mar 25 '12 at 13:30
  • A few things I would try: make sure your editor saves files encoded as utf-8, put a # coding: utf-8 comment at the top of your source file, put a "u" before the string containing the unicode quotation chars, like this: u"»". – kristaps Mar 25 '12 at 13:37
  • 2
    If you use Matthew Barnett’s `regex` library for Python 2 or 3, you get to use `\p{qmark}`. – tchrist Mar 25 '12 at 13:54
  • @tchrist I unfortunately can't use any external libraries. Need to do this with native python libraries only. – Dexter Mar 25 '12 at 14:20
1

I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.

How many different types of quotes exist?

29, according to Unicode, see below.

The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:

// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022   \u0027    \u00AB    \u00BB
\u2018    \u2019    \u201A    \u201B
\u201C    \u201D    \u201E    \u201F
\u2039    \u203A    \u300C    \u300D
\u300E    \u300F    \u301D    \u301E
\u301F    \uFE41    \uFE42    \uFE43
\uFE44    \uFF02    \uFF07    \uFF62
\uFF63]

As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.

Abel
  • 56,041
  • 24
  • 146
  • 247
  • Thanks but I can't use any external libraries. I have edited the question to specify this. – Dexter Mar 25 '12 at 14:39
  • @mcenley: I see, so choose the other option and use the character class. Just copy and paste and remove the spaces (but also: fix your issue with encodings, before that all bets are off ;). – Abel Mar 25 '12 at 14:45
  • How do I fix the encoding issues? This is really getting to me, not what I signed up for. :-( – Dexter Mar 25 '12 at 14:48
  • @mcenley: see chat http://chat.stackoverflow.com/rooms/9278/discussion-between-mcenley-and-abel – Abel Mar 25 '12 at 15:33
0

Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.

regexp = ru'\"*\“*'

Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.

re.findall(regexp, string, re.UNICODE)

Don't forget to include the

#!/usr/bin/python
# -*- coding:utf-8 -*-

at the start of the source file to make sure unicode strings can be written in your source file.

Dexter
  • 11,311
  • 11
  • 45
  • 61
  • This answers your second edit, not your original question, _"How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?"_ is not answered with this. First part: 29 types of quotes according to Unicode, second part: `\p{QuotationMark}` (but requires external libs currently). – Abel Mar 26 '12 at 08:35
  • @Abel Fair enough but I can now add it in the regex (29 types) which is what I wanted. – Dexter Mar 26 '12 at 10:43