-1

I need to write regex that do not match a word if it is in html tag.

Here is the sample of text:

asdd qwe <a href="http://example.com" title="Some title with word qwe" class="external-link" rel="nofollow">  qwe 

My regex for now looks like this:

(?!(\<.+))[^a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ](<class="bad-word"(?: style="[^"]+")?>)?(qwe)(<>)?[^a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ](?!.+\>)

It is a little bit complicated but everythink works expect that when i test it on regex101.com and regexr.com it only matches words that are after the html tag.

Any idea why ?

Edit:

I do not want to use html parser or DOM manipulation, I do not want to change so much code.

def test_tagged_word_present(self):
    input = 'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"> qwe some other words'
    expected = 'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"><strong class="bad-word" style="color:red">qwe</strong> some other words'
    parser = self.get_test_parser(input, search_word='qwe')
    text = parser.mark_words()
    self.assertEqual(text, expected)

everything works perfectly, except that regex still caches qwe in the title.

Cosaquee
  • 724
  • 1
  • 6
  • 22
  • How about using a parser, feeding the text content of the html back to you and then match against the text content? By doing so, no text within tags will be returned to you. – hwnd Oct 12 '15 at 06:49
  • Are you trying to match everything outside the <> tags ? – Ephreal Oct 12 '15 at 06:53
  • @Ephreal I`m trying to match every occurence of given word that is not in any sort of html tag. – Cosaquee Oct 12 '15 at 06:56
  • an alternative, use a html parser http://stackoverflow.com/a/2613246/3526330 – saikumarm Oct 12 '15 at 07:04
  • I cant answer you question i think, true that it does not do what you say in regex 101. however if it works why not use it?. Are you looking for a simpler example ? – Ephreal Oct 12 '15 at 08:03

2 Answers2

6

To exclude content within HTML tags a good trick is using 'not followed by' and including angle bracket characters in them. For example your regex ends with this:

(?!.+\>)

Which presumably should mean 'not followed by a one or more characters and a closing angle bracket.'

However that 'one or more characters' is too broad and will be matching more than you want: If you make that a bit stricter then it won't be as greedy:

(?![^<>]*>)

So that's 'not followed by non-angle-brackets and a closing bracket.'

That way it'll only do the replacement if it's OUTSIDE an HTML tag, because if it's inside, then that will match, so the NOT followed by will stop it from replacing.

You may need to include <> in other character classes as well to limit them.

Note that this isn't strictly 100% compliant, in that attributes can legally have those characters in them, however in many cases you know enough about your input that you can safely use [^<>] to simplify the task without causing any issues.

$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> mystring = 'asdd qwe <a href="http://example.com" title="Some title with word qwe" class="external-link" rel="nofollow">  qwe '
>>> import re
>>> p=re.compile(r'([^\s<>]+)(?![^<>]*>)')
>>> p.findall(mystring)
['asdd', 'qwe', 'qwe']
>>>
$

Second test:

$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> mystring = r'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"> qwe some other words'
>>> p=re.compile(r'([^\s<>]+)(?![^<>]*>)')
>>> p.findall(mystring)
['words', 'qwe', 'some', 'other', 'words']
>>> mystring = r'words <a href="example.com" title="title with word qwe" class="external-link" rel="nofollow"> qwe <strong class="bad-word" style="color:red">podmiotu</strong> some other words'
>>> p.findall(mystring)
['words', 'qwe', 'podmiotu', 'some', 'other', 'words']
>>>

Note that 'qwe' is in both strings, outside of an HTML tag, so it SHOULD match I think.

To search for a specific word, just use that in the regex:

Find the word 'some' if it's outside HTML:

>>> p=re.compile(r'(some)(?![^<>]*>)')
>>> p.findall(mystring)
['some']
>>>

Find the word 'external' if it's outside HTML (fails, correctly):

>>> p=re.compile(r'(external)(?![^<>]*>)')
>>> p.findall(mystring)
[]
>>>
Jeremy Jones
  • 4,561
  • 3
  • 16
  • 26
  • It works like a charm but not in Python. Any idea why ? In my test I have same text as in my question but after changing the regexp, test does not pass, word in link is still matched. – Cosaquee Oct 12 '15 at 10:58
  • Could you include your expected output? I'm not clear what you're actually trying to match & end up with. Thanks. – Jeremy Jones Oct 12 '15 at 11:05
  • test case is now in question – Cosaquee Oct 12 '15 at 11:10
1

Why don't you use the following: first delete any html tags from the string and then search for the word?

import re
>>> s = "asdd qwe <a href="http://example.com" title="Some title with word qwe" class="external-link" rel="nofollow">  qwe "
>>> re.findall(r"\bqwe\b", re.sub(r"<[^>]*>", "", s))
['qwe', 'qwe']
emvee
  • 4,371
  • 23
  • 23
  • I need this html tag to be in text. I really liek your idea but not in this case. – Cosaquee Oct 12 '15 at 07:59
  • This modifies a *copy* of the text. So you can easily do: `if re.findall(...) do_something_with_string(s)`; it's just making it easy to test if the word you're looking for occurs outside any tag. – emvee Oct 12 '15 at 08:10
  • Maybe use re.search instead and then used the index to slice the string in any way you want – Ephreal Oct 12 '15 at 08:28