3

I would like to remove any paragraph for article body that has curly brackets inside.

For example, from this piece of content:

<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here&nbsp;if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology &amp; What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five &nbsp;−&nbsp; &nbsp;=&nbsp; 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>

I would like to remove this part:

<p>five &nbsp;−&nbsp; &nbsp;=&nbsp; 2 .hide-if-no-js { display: none !important; } </p>

Using the following regex: <p>.*?\{.*?\}.*?</p>

It removes the whole article instead of this paragraph that contains curly braces, for some strange reason...

What am I doing wrong with the regex code? Thanks!

Andrey Kurnikovs
  • 407
  • 1
  • 5
  • 21
  • 3
    What language are you using? In Java, I would match all `

    ` tags and then check each one for the presence of braces.

    – Tim Biegeleisen Feb 22 '16 at 16:47
  • I'm not sure if this is what's resulting in the entire article being removed but you should escape the slash in the closing paragraph tag. `<\/p>` because it can be interpreted as a delimiter. – Michael Feb 22 '16 at 16:49
  • @TimBiegeleisen: Exactly implemented this in PHP - see the answer below :) – Jan Feb 22 '16 at 18:17

4 Answers4

0

Lazy / greedy quantifiers not always work as intended, instead of them match the string excluding <, this works for me: <p>[^<]*\{[^<]*</p>

Máté Juhász
  • 2,197
  • 1
  • 19
  • 40
  • @Shafizadeh: "need to create a capturing group and replace it with $1 to remove

    and

    " - he wants to remove everything, not only

    and

    .
    – Máté Juhász Feb 22 '16 at 17:10
  • @Michael: thanks for your comment, however depending on the flavour it can work without escaping (I've tried with python:)) – Máté Juhász Feb 22 '16 at 17:11
0

Try this:

var str = '<p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here&nbsp;if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology &amp; What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five &nbsp;−&nbsp; &nbsp;=&nbsp; 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>';
var result = str.replace(/(<p>[^<]*\{.*<\/p>)/, '');
console.log(result);

Regex Demo

Shafizadeh
  • 9,960
  • 12
  • 52
  • 89
  • Just ran the snippet and there's still `

    five  −   =  2 .hide-if-no-js { display: none !important; }` at the end of the string.

    – Michael Feb 22 '16 at 17:00
0

I'd suggest a two step approach (parsing and analyzing the text node).
Below you'll find examples for both Python and PHP (could be adopted for other languages, obviously):

Python:

# -*- coding: utf-8> -*-
import re
from bs4 import BeautifulSoup

html = """
<html>
    <p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here&nbsp;if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology &amp; What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five &nbsp;−&nbsp; &nbsp;=&nbsp; 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
regex = r'{[^}]+}'
for p in soup.find_all('p', string=re.compile(regex)):
    p.replaceWith('')

print soup 

PHP:

<?php
$html = "<html>
            <p>While orthotic inserts are able to provide great support and pain relief, they aren’t quite as good as a specialty shoe. Remember that an ill-fitting insert can cause permanent damage and talk to a podiatrist about your foot pain for the best recommendation. Click here&nbsp;if you want to learn more about pain in the foot arch unrelated to plantar fasciitis.</p> <h2>Related Posts</h2> <h2>So What Are These Socks Really Good For?</h2> <h2>Are the bottom of your feet causing you problems?</h2> <h2>A PF Relief Guide</h2> <h2>What is Foot Reflexology &amp; What is it Good For?</h2> <h2>Leave a Reply Cancel reply</h2> <p>Your email address will not be published. Required fields are marked *</p> <p>Name</p> <p>Email</p> <p>Website</p> <p>five &nbsp;−&nbsp; &nbsp;=&nbsp; 2 .hide-if-no-js { display: none !important; } </p><h2>Food For Thought January 2016</h2> <h2>Show Us Some Social Love!!</h2> <h2>Recent Posts</h2> <li> The Climate Pledge of Resistance</li> <li> Green Activism in Boulder, Colorado</li> <li> The Truth About Money and Happiness</li> <li> Why Is There So Much Skepticism About Climate Change?</li> <li> Which Device Would Work Best For You?</li>
        </html>";

$html = str_replace('&nbsp;', ' ', $html); // only because of the &nbsp;
$xml = simplexml_load_string($html);

# look for p tags
$lines = $xml->xpath("//p");

# the actual regex - match anything between curly brackets
$regex = '~{[^}]+}~';

for ($i=0;$i<count($lines);$i++) {
    if (preg_match($regex, $lines[$i]->__toString())) {
        # unset it if it matches
        unset($lines[$i][0]); 
    }
}
// vanished without a sight...
print_r($xml);

// convert it back to a string
$html = echo $xml->asXML();
?>
Jan
  • 42,290
  • 8
  • 54
  • 79
0

I'd suggest a two step approach (parsing and analyzing the text node). Below you'll find examples for both Python and PHP (could be adopted for other languages, obviously):