Ignore tag

Asked Jun 04 '18 at 16:33

Active Jun 04 '18 at 17:47

Viewed 184 times

0

I'm not so acquainted with RegEx and I need to extract all information between `<td> NEED HERE </td>`. But I just got to match the tag `<td>` when it has CSS attributes. I need to skip them, both `<table><tr><td>` with, or without attributes

`<td[^>]>`

Example:

`<table height="100%" border="0" cellpadding="0" cellspacing="0"> <tr><td width="4" class="cll">Hello</td> <td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table>`

Output Desired:

`Hello, Output Status, 100%`

In some cases, &nbsp will be between these tags and I'd like to skip them too.

html regex

edited Jun 04 '18 at 16:51
Zoe

27,060

21

118

148

asked Jun 04 '18 at 16:33
Shinomoto Asakura

1,473

7

25

45

If you tell us the environment you want to do this from someone can suggest an appropriate HTML Parser - which is the correct way to parse HTML. – Alex K. Jun 04 '18 at 16:35

Is this regex on backend script processing or on frontend after a user action? – CSSBurner Jun 04 '18 at 16:36

2

https://stackoverflow.com/a/1732454/616443 – j08691 Jun 04 '18 at 16:38

Did you say, the backend engine? Using Python – Shinomoto Asakura Jun 04 '18 at 16:43

2 Answers2

2

You'll want to use an HTML parser like BeautifulSoup. You mentioned that your backend was Python. If you don't already have it, you'll need to grab BeautifulSoup, just pip it like this:

`pip install beautifulsoup4`

This should give you exactly what you are looking for:

`from bs4 import BeautifulSoup html_doc = """ <p class="story">...</p> <table height="100%" border="0" cellpadding="0" cellspacing="0"> <tr><td width="4" class="cll">Hello</td> <td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table> """ soup = BeautifulSoup(html_doc, 'html.parser') td_list = soup.find_all('td') td_list_text = [] for td in td_list: td_list_text.append(td.text) my_string = ", ".join(td_list_text) print(my_string)`

Output:

`Hello, Output Status, 100%`

You can read more about the options available here: https://www.crummy.com/software/BeautifulSoup/

edited Jun 04 '18 at 17:36

sniperd

answered Jun 04 '18 at 17:30
sniperd

5,124

6

28

44

2

Caveat upfront:

Using regexes on HTML is inherently error-prone, and many well-intentioned people will tell you to never do it ever. I generally* recommend using an HTML parser like in sniperd's answer.

But for simple data extraction (e.g. no tag nesting) regexes are sometimes just fine:

`extract_td_regex = re.compile(r"<td[\w\"'=\s]>([^><]+)<\/td")`

Lets break that down:

`"<td" # start td tag "[\w\"'=\s]" # match any word character, white space, =, ', " zero or more times ">" # close opening td tag "([^><]+)" # capture group gets anything not > or <, "<\/td" # closing td tag`

The capture group will contain the inner td contents.

Here's the regex101

Note that this will fail if you have tags (like `span`s) inside the td's.

edited Jun 04 '18 at 17:47

Jared Smith

answered Jun 04 '18 at 17:42
Jared Smith

19,721

5

45

83

Question

I'm not so acquainted with RegEx and I need to extract all information between <td> NEED HERE </td>. But I just got to match the tag <td> when it has CSS attributes. I need to skip them, both <table><tr><td> with, or without attributes

<td[^>]*>

Example:

<table height="100%" border="0" cellpadding="0" cellspacing="0">
<tr><td width="4" class="cll">Hello</td>
<td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table>

Output Desired:

Hello, Output Status, 100%

In some cases, &nbsp will be between these tags and I'd like to skip them too.

If you tell us the environment you want to do this from someone can suggest an appropriate HTML Parser - which is the correct way to parse HTML. — Alex K., Jun 04 '18 at 16:35
Is this regex on backend script processing or on frontend after a user action? — CSSBurner, Jun 04 '18 at 16:36

sniperd · Accepted Answer · 2018-06-04T17:36:25.590

You'll want to use an HTML parser like BeautifulSoup. You mentioned that your backend was Python. If you don't already have it, you'll need to grab BeautifulSoup, just pip it like this:

pip install beautifulsoup4

This should give you exactly what you are looking for:

from bs4 import BeautifulSoup

html_doc = """
<p class="story">...</p>
<table height="100%" border="0" cellpadding="0" cellspacing="0">
<tr><td width="4" class="cll">Hello</td>
<td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

td_list = soup.find_all('td')
td_list_text = []

for td in td_list:
    td_list_text.append(td.text)

my_string = ", ".join(td_list_text)
print(my_string)

Output:

Hello, Output Status, 100%

You can read more about the options available here: https://www.crummy.com/software/BeautifulSoup/

Jared Smith · Answer 2 · 2018-06-04T17:47:59.697

Caveat upfront:

Using regexes on HTML is inherently error-prone, and many well-intentioned people will tell you to never do it ever. I generally recommend using an HTML parser like in sniperd's answer.

But for simple data extraction (e.g. no tag nesting) regexes are sometimes just fine:

extract_td_regex = re.compile(r"<td[\w\"'=\s]*>([^><]+)<\/td")

Lets break that down:

"<td"         # start td tag
"[\w\"'=\s]*" # match any word character, white space, =, ', " zero or more times
">"           # close opening td tag
"([^><]+)"    # capture group gets anything *not* > or <, 
"<\/td"       # closing td tag

The capture group will contain the inner td contents.

Here's the regex101

Note that this will fail if you have tags (like spans) inside the td's.