0

Given the following text:

<p style="color: blue">Some text</p>
<p style="color:blue; margin-left: 10px">* Item 1</p> // Should match
<p style="margin-left: 10px">* Item 2</p>
<p style="margin-left: 20px">* Sub Item 1a</p> // Should match
<p style="margin-left: 20px">* Sub Item 2a</p>
<p style="margin-left: 10px">* Item 3</p>
<p style="margin-left: 20px">* Sub Item 1b</p> // Should match
<p style="margin-left: 20px">* Sub Item 2b</p>
<p style="margin-left: 30px">* Sub Item 1c</p> // Should match
<p>Some text</p>
<p style="color:blue; margin-left: 10px">* Item 1</p> // Should match

I am trying to find any p elements which match the following criteria:

  • The begin with an asterisk character
  • They have a margin-left inline style
  • The preceding content is either:
    • A p element which has no margin-left
    • A p element with a margin-left which is lower than the matched element
    • Any other element

So in the example, I need to match the following elements:

<p style="color:blue; margin-left: 10px">* Item 1</p> (preceding element is a p but doesn't have any margin-left)
<p style="margin-left: 20px">* Sub Item 1a</p> (preceding element is a p but has a different margin-left value)
<p style="margin-left: 20px">* Sub Item 1b</p> (preceding element is a p but has a different margin-left value)
<p style="margin-left: 30px">* Sub Item 1c</p> (preceding element is a p but has a margin-left value lower than the current matched element)
<p style="color:blue; margin-left: 10px">* Item 1</p> (preceding element is a p but has no margin-left value)

I cannot use DomDocument because the markup I receive is not always valid markup (generally comes from a Microsoft Office > HTML conversion), so I am using regular expressions to solve the problem.

My current regex is:

(?!<p style=".*?(margin-left:\s?(?!\k'margin')px;).*?">\* .*?<\/p>)<p style="(?P<styles>.*?)margin-left:\s?(?P<margin>[0-9]{1,3})px;?">\* (?P<listcontent>.*)<\/p>

But this only matches based on the existing of preceding elements being a p with a margin-left.

How can I factor in the matched margin-left group and return values which are greater than the previous match?

I have created an online regex to demonstrate the problem, with sample data and my current output.

Amo
  • 2,884
  • 5
  • 24
  • 46
  • Does this have to be done in a single pass/method? Could you match all tags then use PHP to reduce the set? – sjdaws Jun 11 '17 at 11:38
  • As it's part of a greater series of manipulations it needs to be done as a single regex if possible. – Amo Jun 11 '17 at 12:45
  • I don't believe this is possible in a single pass since you need to compare values with other values. You will be able to find `p` elements which contains `margin-left` but you'll need a secondary process to do the comparisons. – sjdaws Jun 11 '17 at 12:50
  • If there's no way to do it as one regex and you have a way to solve it using two passes, please write it up as an answer. – Amo Jun 11 '17 at 13:03
  • I'll write up an answer, one question: how does the last element match? It's preceded by a p element which doesn't have a margin left therefore it fails the test 'Any element other than a p element'? – sjdaws Jun 11 '17 at 13:21
  • I've updated to make it clearer. I meant any p tag which has no margin set, or any other element. – Amo Jun 11 '17 at 14:04

1 Answers1

0

This code works as expected using regex to grab every element then a loop to iterate over them and check the business logic:

<?php

$data = '<p style="color: blue">Some text</p>
<p style="color:blue; margin-left: 10px">* Item 1</p>
<p style="margin-left: 10px">* Item 2</p>
<p style="margin-left: 20px">* Sub Item 1a</p>
<p style="margin-left: 20px">* Sub Item 2a</p>
<p style="margin-left: 10px">* Item 3</p>
<p style="margin-left: 20px">* Sub Item 1b</p>
<p style="margin-left: 20px">* Sub Item 2b</p>
<p style="margin-left: 30px">* Sub Item 1c</p>
<div>Some text</div>
<p style="color:blue; margin-left: 10px">* Item 1</p>';

// Get all HTML tags, the element in [1], the attributes (style etc) in [2], the content in [3]
preg_match_all("/<(\w+)\b([^>]+)*>(.*?)<\/\w+>/", $data, $matches);

$results = [];

// Keep track of last element margin-left, if it's is missing it will be set to 0 making the next
// element included automatically if it has a margin-left
$lastMarginLeft = 0;

// Loop through matches and apply business rules
for ($i = 0; $i <= count($matches[0]); $i++) {
    /**
     * Business rules:
     * - Contents begins with an asterisk character
     * - Elements have a margin-left inline style
     * - The preceding content is either:
     *   - A p element which has no margin-left
     *   - A p element with a margin-left which is lower than the matched element
     *   - Any other element
     */

    // Assume no margin-left found by default
    $marginLeft = 0;

    // Check element has a margin-left
    if (strpos($matches[2][$i], 'margin-left') !== false) {
        // Extract margin-left value
        preg_match("/margin-left:\s?(\d+)/", $matches[2][$i], $value);
        $marginLeft = isset($value[1]) ? $value[1] : 0;

        // Check if this margin is greater than the last
        if ($marginLeft > $lastMarginLeft) {
            // Check content
            if (strpos($matches[3][$i], '*') === 0) {
                $results[] = $matches[0][$i];
            }
        }
    }

    // Capture margin left for next run
    $lastMarginLeft = $marginLeft;
}

// Results:
// Array
// (
//     [0] => <p style="color:blue; margin-left: 10px">* Item 1</p>
//     [1] => <p style="margin-left: 20px">* Sub Item 1a</p>
//     [2] => <p style="margin-left: 20px">* Sub Item 1b</p>
//     [3] => <p style="margin-left: 30px">* Sub Item 1c</p>
//     [4] => <p style="color:blue; margin-left: 10px">* Item 1</p>
// )
sjdaws
  • 3,466
  • 16
  • 20