0

I'm not so acquainted with RegEx and I need to extract all information between <td> NEED HERE </td>. But I just got to match the tag <td> when it has CSS attributes. I need to skip them, both <table><tr><td> with, or without attributes

<td[^>]*>

Example:

<table height="100%" border="0" cellpadding="0" cellspacing="0">
<tr><td width="4" class="cll">Hello</td>
<td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table>

Output Desired:

Hello, Output Status, 100%

In some cases, &nbsp will be between these tags and I'd like to skip them too.

Zoe
  • 27,060
  • 21
  • 118
  • 148
Shinomoto Asakura
  • 1,473
  • 7
  • 25
  • 45

2 Answers2

2

You'll want to use an HTML parser like BeautifulSoup. You mentioned that your backend was Python. If you don't already have it, you'll need to grab BeautifulSoup, just pip it like this:

pip install beautifulsoup4

This should give you exactly what you are looking for:

from bs4 import BeautifulSoup

html_doc = """
<p class="story">...</p>
<table height="100%" border="0" cellpadding="0" cellspacing="0">
<tr><td width="4" class="cll">Hello</td>
<td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

td_list = soup.find_all('td')
td_list_text = []

for td in td_list:
    td_list_text.append(td.text)

my_string = ", ".join(td_list_text)
print(my_string)

Output:

Hello, Output Status, 100%

You can read more about the options available here: https://www.crummy.com/software/BeautifulSoup/

sniperd
  • 5,124
  • 6
  • 28
  • 44
2

Caveat upfront:

Using regexes on HTML is inherently error-prone, and many well-intentioned people will tell you to never do it ever. I generally recommend using an HTML parser like in sniperd's answer.

But for simple data extraction (e.g. no tag nesting) regexes are sometimes just fine:

extract_td_regex = re.compile(r"<td[\w\"'=\s]*>([^><]+)<\/td")

Lets break that down:

"<td"         # start td tag
"[\w\"'=\s]*" # match any word character, white space, =, ', " zero or more times
">"           # close opening td tag
"([^><]+)"    # capture group gets anything *not* > or <, 
"<\/td"       # closing td tag

The capture group will contain the inner td contents.

Here's the regex101

Note that this will fail if you have tags (like spans) inside the td's.

Jared Smith
  • 19,721
  • 5
  • 45
  • 83