Wrapping Text That Isn't Inside an Anchor Tag With Nokogiri

Question

I have some HTML:

<p>Lorem ipsum example laoreet. <a href="#">example</a>Cum porttitor</p>
<p>Phasellus <a href="#">gravida tempor example</a> magna</p>

I need to wrap a span around any instances of the text 'example' that occurs in the HTML unless it is inside an anchor tag. So that the above would become:

<p>Lorem ipsum <span class="something">example</span> laoreet. <a href="#">example</a>Cum porttitor</p>
<p>Phasellus <a href="#">gravida tempor example</a> posuere. Fusce vitae urna eu <span class="something">example</span> magna</p>

I can select the content of paragraphs that isn't inside an anchor tag using:

doc.xpath('//p//text()') - doc.xpath('//p//a/text()')

I can wrap tags around the text content of another tag using:

doc.search('div.some-class text()').wrap('<span class="something"></span>')

But how do I wrap tags around text within that content?

Just as an aid to help those helping you, reduce your HTML to the bare-minimum needed to show the problem or act as sample input. Try to fit it into as small a space as is possible while keeping it readable. In this case, the HTML is so long, because of a bunch of unnecessary Lorem text, that its scrolling when there's no need for that. — the Tin Man, Aug 14 '13 at 14:34

score 1 · Answer 1 · edited May 23 '17 at 12:05

1

The text() Xpath selector can be used to match text like this too:
Using XPath, How do I select a node based on its text content and value of an attribute?

doc.xpath("//p//text()='example'")

But i don't think this would work:

doc.search("div.some-class text()='example'").wrap('<span class="something"></span>')

edited May 23 '17 at 12:05

Community

1
1

answered Aug 14 '13 at 11:54

MurifoX

14,991
3
36
60

This doesn't answer my question. – Undistraction Aug 14 '13 at 11:56

score 1 · Answer 2 · edited May 23 '17 at 10:25

You will probably have to manipulate the text node in question in Ruby, and then replace it in the document with the new text that Nokogiri will parse for you.

doc.xpath('//p/descendant-or-self::node()[name() != "a"]/text()[contains(., "example")]').each do |n|
  n.replace(n.content.gsub(/(example)/, '<span class="something">\1</span>'))
end

In this example I’ve used a slightly more complex XPath query than you have. It selects all text node descendants of any p elements, unless they are descendants of an a element, which I think is what you want. (I don’t know if this is better for you, try it and see.)

The bit that answers your question is the contents of the block. Here I take the string content of each of these text nodes and use gsub to create a new string of markup with the new span elements in. I then use replace to put this fragment in place of the original text node in the document. Nokogiri will parse this string and add the created nodes in place of the original text node. This is in many ways similar to the Tin Man’s answer but is more targeted as it only involves using gsub and re-parsing the text nodes in question.

the Tin Man · Answer 3 · 2013-08-14T15:10:20.007

Here's how I'd do it:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<p>Lorem ipsum example sit amet. <a href="#">example</a>Sed porttitor</p>
<p>Phasellus <a href="#">tempor example</a> posuere. Example </p>
EOT

a_tags = doc.search('a')

new_doc = Nokogiri::HTML(
  doc.to_html.gsub(
    /\b (example) \b/ix,
    '<span class="foo">\1</span>'
  )
)
new_doc.search('a').each do |a_tag|
  a_tag.replace(a_tags.shift)
end

puts new_doc.to_html
# >> </body></html>
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>Lorem ipsum <span class="foo">example</span> sit amet. <a href="#">example</a>Sed porttitor</p>
# >> <p>Phasellus <a href="#">tempor example</a> posuere. <span class="foo">Example</span> </p>
# >> </body></html>

Basically it does this:

a_tags = doc.search('a') grabs all the existing <a> tags to remember them.
I convert the doc DOM back into HTML using Nokogiri for consistency using to_html, then do a global search and replace to wrap all "example" instances in a <span>, then reparse it into a new DOM. Notice I'm using /\b (example) \b/ix for the search and \1 in the replace. Why I'm using a capture and the flags are for you to research but you should notice it's letting me find and process either "example" or "Example".
Loop through the document looking for the <a> tags again, and replace each one with its original version. This will clean up any that were mangled by the gsub in the previous step.

It's a little more brute-force than I like, but it's straight-ahead also. This will break if the word "example" is found inside a tag.

Maybe one of the smart XPath folks will chime in with something more elegant.

score 0 · Answer 4 · answered Aug 14 '13 at 18:11

Here's how I did it in the end:

doc = Nokogiri::HTML(html)
# Select paragraph content that isn't inside an anchor tag
elements = doc.xpath('//p//text()') - doc.xpath('//p//a/text()')
# interate over the elements, wrapping 'phrase' with anchor tag
elements.each do |element|
    element.content = element.content.gsub(phrase, "<a href='#' class='glossary-term-link' data-content='#{definition.html_safe}'>#{phrase}</a>")
end
# Fix Nokogiri's lust for escaping angle brackets no matter what
doc.xpath('//body')[0].inner_html.gsub("&lt;", "<").gsub("&gt;", ">")

Wrapping Text That Isn't Inside an Anchor Tag With Nokogiri

4 Answers4