1

I have HTML strings, from which I need to extract HTML substrings (summary, keywords, ...). The strings look like:

const content = "<p>
<strong>Summary</strong><br />Some text with <strong>HTML</strong> tags...<br /><br />
<strong>Keywords</strong> keyword1, keyword2,...<br /><br />
...
</p>"

The aim is to get:

summary = "<br />Some text with <strong>HTML</strong> tags...<br /><br />"
keywords = "keyword1, keyword2,..."

For the parsing I use the library Cheerio, which enable to use jQuery methods on the parsed HTML code. I have tried e.g. the following approaches, but none of them works:

Simple nextUntil():

const $ = cheerio.load(content);
console.log($("strong:contains('Summary')").nextUntil( "strong:contains('Keywords')" ).html());
// Returns: "Summary" 

nextUntil() with foreach:

const $ = cheerio.load(content);
let container = $('<container/>');
for (let i = 0; i < $("strong:contains('Summary')").nextUntil( "strong:contains('Keywords')" ).length; i++) {
  container.append($("strong:contains('Summary')").nextUntil( "strong:contains('Keywords')" )[i]);
}
console.log('container: ', container.html());
// Returns: "<strong>Summary</strong>" 
Antonín Slejška
  • 1,980
  • 4
  • 29
  • 39

3 Answers3

2

The approach with nextUntil() does not work, because there are no sibling elements to the given <strong> DOM elements containing any usable content (html). Instead there is only textContent to be found as part of the parent's <p> element.

We will have to apply some kind of regex-matching method, like shown below (please be aware that if the Summary and Keywords sections appear more than once only the latest occurence of each of them will be considered.):

const content = $("<p>\n\
<strong>Summary</strong><br />Some text with\n\ <strong>HTML</strong> tags...<br /><br />\n\
<strong>Keywords</strong> keyword1, keyword2,...<br /><br />\n\
...\n\
</p>").html(); // I user jquery-html() to extract the innerHTML of the outer <p> element


const arr=content.split(/<strong>(Summary|Keywords)<\/strong>/);
for (var i=1;i<arr.length;i+=2) window[arr[i]]=arr[i+1];

console.log('\nsummary:',Summary,'\nkeywords:',Keywords);  
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
Carsten Massmann
  • 26,510
  • 2
  • 22
  • 43
  • `"there are no sibling elements to the given DOM elements"` - I might be misunderstanding, but aren't the `
    ` elements the sibling elements? **Edit:** doh, I thought you meant my answer; ignore me!
    – user7290573 Aug 28 '19 at 15:00
  • Yes, strictly speaking you are right! The `
    `s are siblings of the `` elements. But, unfortunately, they are of little help when you need to extract the HTML content between them.
    – Carsten Massmann Aug 28 '19 at 15:03
1

I think the problem stems from the Summary & Keyword text not being a sibling of their respective headings.

You could instead parse the HTML string with regex

const content = '<p>' + 
'<strong>Summary</strong><br />Some text with <strong>HTML</strong> tags...<br /><br />' +
'<strong>Keywords</strong> keyword1, keyword2,...<br /><br />' +
'</p>';

var summary = content.match('<strong>Summary</strong><br />(.*?)<br /><br />');
var keywords = content.match('<strong>Keywords</strong> (.*?)<br /><br />');
alert (summary[1]);
alert (keywords[1]);
Scott Cook
  • 88
  • 5
1

Here's a different approach; hacky, but working:

const content = `<p>
    <strong>Summary</strong><br />Some text with <strong>HTML</strong> tags...<br /><br />
    <strong>Keywords</strong> keyword1, keyword2,...<br /><br />
    ...
    </p>`,
    html = $(content);

const summary  = getHtml(html.find("strong:contains(Summary)"));
const keywords = getHtml(html.find("strong:contains(Keywords)"));

console.log(summary);
console.log(keywords);

function getHtml(html) {
    const summary = [];
    let currentEl = html.prop("nextSibling");

    while (true) {
        // If the current and next element are both <br>, the end is reached
        if (currentEl.tagName === "BR" && currentEl.nextSibling.tagName === "BR") {

            // If this is "Keywords", don't add the trailing <br> elements
            if (html.text().trim() !== "Keywords") {
                // summary.push("<br><br>") would also work here
                summary.push(currentEl.outerHTML, currentEl.nextSibling.outerHTML);
            }

            return summary.join("").trim();
        } else {
            // nodeType 1 = element
            // nodeType 3 = text
            const content = currentEl.nodeType === 1 ? currentEl.outerHTML : currentEl.textContent;

            // Push HTML string and continue
            summary.push(content);
            currentEl = currentEl.nextSibling;
        }
    }
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
user7290573
  • 1,320
  • 1
  • 8
  • 14