1

I have an HTML code that contain CSS code inside tag under the header tag. I want to use regex to extract all text in HTML, only pure text (between HTML tags ). I tried,

console.log(HTML_TEXT.replace(/(<([^>]+)>)/g, ""))

which replace every thing between <> by empty char, the problem is the CSS code inside STYLE tag is still there, so i want to know how to write the regular expression to remove CSS code inside tags.

How do I solve this problem?

Emma
  • 27,428
  • 11
  • 44
  • 69
Ali Salhi
  • 47
  • 1
  • 5
  • 4
    Have you tried .innerText()? That's is what it's for. – Diodeus - James MacFarlane Apr 30 '19 at 17:35
  • 2
    [Obligatory reference](https://stackoverflow.com/a/1732454/1715579) -- try using DOM methods instead. e.g. [Parse an HTML string with JS](https://stackoverflow.com/questions/10585029/parse-an-html-string-with-js) – p.s.w.g Apr 30 '19 at 17:38

1 Answers1

1

This RegEx might help you to do so:

(\>)(.+)(<\/style>)
  • It creates a right boundary in a capturing group: (<\/style>)
  • It has a left boundary in another capturing group: (\>), which you can add additional boundaries to it, if you wish/necessary
  • Then, it has a no-boundary middle capturing group, (.+), where your target is located, and you can call it using $2 and replace it with an empty string, or otherwise.

I'm not so sure, did not test it, but your code might look like something similar to:

console.log(HTML_TEXT.replace(/(\>)(.+)(<\/style>)/g, '\\$1\\$3'))

This post explains how to do a string replace in JavaScript.

enter image description here

Edit:

Based on the comment, this RegEx might help you to filter your tags using $1:

(\<style type=\"text\/css\"\>)([\s\S]*)(\<\/style\>)

enter image description here

Emma
  • 27,428
  • 11
  • 44
  • 69