0

I'm trying to extract the price from the following HTML.

<td>$75.00/<span class='small font-weight-bold text-
danger'>Piece</span></small> *some more text here* </td>

What is the regex expression to get the number 75.00?

Is it something like:

<td>$*/<span class='small font-weight-bold text-danger'>
jwpfox
  • 5,124
  • 11
  • 45
  • 42
  • Most often, it is .*, but regexes vary over languages. Then you often have to mark with parens what you like to capture, so it would be "(.*)/ – user unknown Apr 01 '18 at 01:30
  • Obligatory, [You Shouldn't Be Using Regex To Parse HTML](https://stackoverflow.com/a/1732454/1547004). Use an actual DOM parser like `beautifulsoup` or `requests-html` – Brendan Abel Apr 01 '18 at 02:54
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Ulysse BN Apr 01 '18 at 02:54

3 Answers3

0

The dollar sign is a special character in regex, so you need to escape it with a backslash. Also, you only want to capture digits, so you should use character classes.

<td>\$(\d+[.]\d\d)<span

As the other respondent mentioned, regex changes a bit with each implementing language, so you may have to make some adjustments, but this should get you started.

bikeonastick
  • 769
  • 1
  • 5
  • 5
0

I think you can go with /[0-9]+\.[0-9]+/.

  • [0-9] matches a single number. In this example you should get the number 7.
  • The + afterwards just says that it should look for more then just one number. So [0-9]+ will match with 75. It stops there because the character after 5 is a period.
  • Said so we will add a period to the regex and make sure it's escaped. A period usually means "every character". By escaping it will just look for a period. So we have /[0-9]+\./ so far.
  • Next we just to add [0-9]+ so it will find the other number(s) too.

It's important that you don't give it the global-flag like this /[0-9]+\.[0-9]+/g. Unless you want it to find more then just the first number/period-combination.


There is another regex you can use. It uses the parentheses to group the part you're looking for like this: /<td>\$(.+)<span/

It will match everything from <td>$ up to <span. From there you can filter out the group/part you're looking for. See the examples below.

// JavaScript

const text  = "<td>$something<span class='small font-weight..."
const regex = /<td>\$(.+)<span/g
const match = regex.exec(text) // this will return an Array

console.log( match[1] ) // prints out "something"

// python

text = "<td>$something<span class='small font-weight..."
regex = re.compile(r"<td>\$(.+)<span")

print( regex.search(text).group(1) ) // prints out "something"
TommyS
  • 1
  • 3
0

As an alternative you could use a DOMParser.

Wrap your <td> inside a table, use for example querySelector to get your element and get the first node from the childNodes.

That would give you $75.00/.

To remove the $ and the trailing forward slash you could use slice or use a regex like \$(\d+\.\d+) and get the value from capture group 1.

let html = `<table><tr><td>$75.00/<span class='small font-weight-bold text-
danger'>Piece</span></small> *some more text here* </td></tr></table>`;
let parser = new DOMParser();
let doc = parser.parseFromString(html, "text/html");
let result = doc.querySelector("td");
let textContent = result.childNodes.item(0).nodeValue;
console.log(textContent.slice(1, -1));
console.log(textContent.match(/\$(\d+\.\d+)/)[1]);
The fourth bird
  • 154,723
  • 16
  • 55
  • 70