Regex to remove everything before and after the first
tag

Question

I need to get the content of the first p tag in a string (but without the actual tags).

Example:

<h1>I don't want the title</h1>
<p>This is the text I want</p>
<p>I don't want this</p>
<p>I also don't want this</p>

I guess I need to finde everything else and replace it with nothing? But how do I create the regex?

REGEX is not the right tool to parse HTML ! Use a proper parser. Are you using Linux shell ? See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Gilles Quénot, Nov 27 '14 at 18:12
@sputnick that's unfortunately not an option. I need to do this in vbscript. — Peter Schrøder, Nov 27 '14 at 18:16

score 1 · Answer 1 · answered Nov 27 '14 at 21:15

1

Try something like this:

Set fso  = CreateObject("Scripting.FileSystemObject")
Set html = CreateObject("HTMLFile")
html.write fso.OpenTextFile("C:\path\to\your.html").ReadAll
Set p = html.getElementsByTagName("p")
WScript.Echo p(0).innerText

answered Nov 27 '14 at 21:15

Ansgar Wiechers

193,178
25
254
328

alpha bravo · Accepted Answer · 2014-11-27T21:38:23.053

use this pattern to capture what you want

^[\s\S]*?<p>([^<>]*?)<\/p>

Demo

^               # Start of string/line
[\s\S]          # Character Class [\s\S]
*?              # (zero or more)(lazy)
<p>             # "<p>"
(               # Capturing Group (1)
  [^<>]         # Character not in [^<>]
  *?            # (zero or more)(lazy)
)               # End of Capturing Group (1)
<\/p>           # "<\/p>"

or use this pattern to match everything else and replace with nothing

^[\s\S]*?<p>|<\/p>[\s\S]*$

Demo

^               # Start of string/line
[\s\S]          # Character Class [\s\S]
*?              # (zero or more)(lazy)
<p>             # "<p>"
|               # OR
<               # "<"
\/              # "/"
p>              # "p>"
[\s\S]          # Character Class [\s\S]
*               # (zero or more)(greedy)
$               # End of string/line

score 0 · Answer 3 · edited May 23 '17 at 12:20

0

You can do it properly with a xpath expression :

//p[1]/text()

Adapted from Navigating XML nodes in VBScript, for a Dummy :

Set objDoc = CreateObject("MSXML.DOMDocument")
objDoc.Load "C:\Temp\Test.xml"

' Find a particular element using XPath:

Set objNode = objDoc.selectSingleNode("//p[1]/text()")
MsgBox objNode.getAttribute("value")

edited May 23 '17 at 12:20

Community

1
1

answered Nov 27 '14 at 18:15

Gilles Quénot

173,512
41
224
223

How would that give me the content of the first p tag? – Peter Schrøder Nov 27 '14 at 18:17
I tried your example with a few modifications, but the last line returns "Object required". Set objDoc = CreateObject("MSXML.DOMDocument") objDoc.LoadXML(myHtmlString) Set objNode = objDoc.selectSingleNode("//p[1]/text()") response.write objNode.getAttribute("value") – Peter Schrøder Nov 27 '14 at 19:15
1

HTML and XML are not the same. – Ansgar Wiechers Nov 27 '14 at 21:16
http://stackoverflow.com/questions/9822520/parsing-xhtml-with-xpath-using-microsoft-xmlhttp-in-vbscript just 2 seconds search ! – Gilles Quénot Nov 27 '14 at 21:44

Regex to remove everything before and after the first tag

3 Answers3

Regex to remove everything before and after the first
tag