Replace markup (as a string) including certain inline elements

Question

My intent is to modify a sentence within a tag.

For example change:

<div id="1">
  This is text in the TD with <strong> strong </strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
     <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>

To this:

<div id="1">
  This is modified text in the TD with <strong> strong </strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
   </div>
</div>

Which would mean I need to traverse the nodes grabbing a tag and getting all the text & style nodes, but not grabbing the children tags. Modifying the sentences and putting them back. I would need to do this for each tag with full text until all the content was modified.

For example grabbing the text and style nodes for div#1 would be: "This is text in the TD with strong tags" but as you can see, none of the other text underneath would be grabbed. It should be accessible and modifiable through a variable.

div#1.text_with_formating= "This is modified text in the TD with <strong> strong </strong> tags"

The below code removes all content, not just the children tags, keeping content leaves all content even the tags under div#1. Therefore, I'm not sure how to proceed.

Sanitize.clean(h,{:elements => %w[b em i strong u],:remove_contents=>'true'})

How would you recommend solving this?

Note that it is syntactically illegal to have an `id="..."` attribute that starts with a number. I have changed your ids in my answer below. — Phrogz, Jan 15 '13 at 20:17
I don't fully understand your question. Is it: (a) how do I modify text in an HTML document without affecting existing markup, or (b) how do I strip out specific style-like elements (replacing them with their text content), or (c) how do I get and set the HTML markup for an element including elements like ``? I've answered below assuming that you want (a). — Phrogz, Jan 15 '13 at 20:25
I'm sorry, I'm not clear, but the question is more like (C), however, I would modify it. C) How do I get and set TEXT() with elements like , , ? For example, I want to modify the whole line: "This is text in the TD with strong tags", from start to end, but nothing else. — user1896290, Jan 15 '13 at 21:40
This is a duplicate of http://stackoverflow.com/questions/14224594/nokogiri-grab-text-with-formating-and-link-tags-em-strong-a-etc-etc — the Tin Man, Jan 16 '13 at 15:38
Tin Man, you stated these are duplicates, but you also asked me to clarify my questions, so I did in another post. This posted question got much better responses than the original. I tried deleting the other ones, but i could not because it was answered. however, the answers were not the solution. Additionally, you closed this post, which was actually starting to become helpful. How would you like me to solve this question. If you have advice as to how to better use stackoverflow, I will listen. — user1896290, Jan 16 '13 at 16:02
I vote to re-open this question: while it may be a "bad" idea to perform a gsub on markup (compared to just gsubbing the text itself), it is not (IMO) an uncommon desire. Perhaps this was closed because the question was misinterpreted; I've edited the title based on new understanding of the question based on comments. — Phrogz, Jan 16 '13 at 18:57
The question and the subsequent one were closed because the voters don't see it as having much use by the community at large. The question has been asked three times by the same user, with minor variations. The original question remains open and should be the one allowed to continue, not reopening this one which would become a duplicate again. This question will remain visible, and can be referenced using the "Linked" section on the right side of the page. — the Tin Man, Jan 17 '13 at 00:32
Thanks, @theTinMan; for some reason I totally missed your duplicate comment. — Phrogz, Jan 17 '13 at 02:39
@theTinMan Can you delete the other post? No need for duplicates and this resolve the issue. It seems you have higher access than I do. — user1896290, Jan 17 '13 at 15:18

Phrogz · Answer 1 · 2013-01-16T19:23:54.317

If you want to find all the text nodes underneath an element, use:

text_pieces = div.xpath('.//text()')

If you want to find only the text that is an immediate child of an element, use:

text_pieces = div.xpath('text()')

For each text node, you can change the content any way you like. You must, however, just be sure you use my_text_node.content = ... instead of my_text_node.content.gsub!(...).

# Replace text that is a direct child of an element
def gsub_my_text!( el, find, replace=nil, &block )
    el.xpath('text()').each do |text|
        next if text.content.strip.empty?
        text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
    end
end

# Replace text beneath an element.
def gsub_text!( el, find, replace=nil, &block )
    el.xpath('.//text()').each do |text|
        next if text.content.strip.empty?
        text.content = replace ? text.content.gsub(find,replace,&block) : text.content.gsub(find,&block)
    end
end


d1 = doc.at('#d1')
gsub_my_text!( d1, /[aeiou]+/ ){ |found| found.upcase }

puts d1
#=> <div id="d1">
#=>   ThIs Is tExt In thE TD wIth <strong> strong </strong> tAgs
#=>   <p>This is a child node. with <b> bold </b> tags</p>
#=>   <div id="d2">
#=>       "another line of text to a <a href="link.html"> link </a>"
#=>      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
#=>   </div>
#=> </div>


gsub_text!( d1, /\w+/, '(\\0)' )
puts d1
#=> <div id="d1">
#=>   (ThIs) (Is) (tExt) (In) (thE) (TD) (wIth) <strong> (strong) </strong> (tAgs)
#=>   <p>(This) (is) (a) (child) (node). (with) <b> (bold) </b> (tags)</p>
#=>   <div id="d2">
#=>       "(another) (line) (of) (text) (to) (a) <a href="link.html"> (link) </a>"
#=>      <p> (This) (is) (text) (inside) (a) (div) <em>(inside)<em> (another) (div) (inside) (a) (paragraph) (tag)</em></em></p>
#=>   </div>
#=> </div>

Edit: Here is code that allows you to extract runs of text+inline markup as a string, run a gsub on that, and replace the result with new markup.

require 'nokogiri'

doc = Nokogiri.HTML '<div id="d1">
  Text with <strong>strong</strong> tag.
  <p>This is a child node. with <b>bold</b> tags.</p>
  <div id=d2>And now we are in <a href="foo">another</a> div.</div>
  Hooray for <em>me!</em>
</div>'

module Enumerable
  # http://stackoverflow.com/q/4800337/405017
  def split_on() chunk{|o|yield(o)||nil}.map{|b,a|b&&a}.compact end
end

require 'set'
# Given a node, call gsub on the `inner_html` 
def gsub_markup!( node, find, replace=nil, &replace_block )
  allowed = Set.new(%w[strong b em i u strike])
  runs  = node.children.split_on{ |el| el.node_type==1 && !allowed.include?(el.name) }
  runs.each do |nodes|
    orig   = nodes.map{ |node| node.node_type==3 ? node.content : node.to_html }.join
    next if orig.strip.empty? # Skip whitespace-only nodes
    result = replace ? orig.gsub(find,replace) : orig.gsub(find,&replace_block)
    puts "I'm replacing #{orig.inspect} with #{result.inspect}" if $DEBUG
    nodes[1..-1].each(&:remove)
    nodes.first.replace(result)
  end
end

d1 = doc.at('#d1')

$DEBUG = true
gsub_markup!( d1, /[aeiou]+/, &:upcase )
#=> I'm replacing "\n  Text with <strong>strong</strong> tag.\n  " with "\n  TExt wIth <strOng>strOng</strOng> tAg.\n  "
#=> I'm replacing "\n  Hooray for <em>me!</em>\n" with "\n  HOOrAy fOr <Em>mE!</Em>\n"

puts doc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body><div id="d1">
#=>   TExt wIth <strong>strOng</strong> tAg.
#=>   <p>This is a child node. with <b>bold</b> tags.</p>
#=>   <div id="d2">And now we are in <a href="foo">another</a> div.</div>
#=>   HOOrAy fOr <em>mE!</em>
#=> </div></body></html>

This might be a better example of what I want to do. Replace the full line: "This is text in the TD with strong tags" with another sentence like "This is the replacement sentence". The key is grabbing the text() along with all the other style tags, like , , , etc. The style tags are seen as nodes and all I want to the text with style tags, but not the
,
or non-style tags, so I can set and get them. I hope this helps. I feel I'm not expressing myself well. — user1896290, Jan 15 '13 at 21:47
You are expressing yourself clearly now. There are edge cases that I'm not sure how you want to handle, however. How should `
Hello World!
NO NO NO
OH YES!
` be handled? Clearly `NO NO NO` should not appear when you read, but should `OH YES!` show up when you read the content? If you get `"New result"` back, where should it go? Before d2, after d2, on both sides? — Phrogz, Jan 15 '13 at 23:48
The ideal scenario would be to parse "Hello World!" as one string, then grab "Oh Yes!" as a second string.therefore the "New result" placement should not be an issue. — user1896290, Jan 16 '13 at 03:07
See my edit for new code that I believe does what you are asking for. — Phrogz, Jan 16 '13 at 18:55
+1 for sticking with it and getting something to work. You actually deserve some contracting fees being paid, but you'll need to take that up with the OP. See the comment by @pguardiario in his answer about hiring someone. :-) — the Tin Man, Jan 17 '13 at 00:35
+100 to Phrogz! I really can't thank you enough, it helped so much! — user1896290, Jan 17 '13 at 15:17

score 0 · Answer 2 · answered Jan 15 '13 at 05:56

0

The easiest way would be:

div = doc.at('div#1') 
div.replace div.to_s.sub('text', 'modified text')

answered Jan 15 '13 at 05:56

pguardiario

53,827
19
119
159

The above was just an example. Therefore, the div would be unknown as the code will be traversing down the html page to find the text. Additionally, the modified text example was not a literal example. The key to the question is how to grab and modify the text with style tags, but not grab the children tags like
and div#2, etc. Thank you for your feedback.
– user1896290 Jan 15 '13 at 18:41
In that case use: `div = doc.at('div[text()*="text to match"]')` – pguardiario Jan 15 '13 at 23:52
I am parsing random HTML documents so I need to parse the text to get the text with styles, hence I can't match on specific text or sentences. The criteria is any text with it's styling tags. It's more like I need text() | em/text()|strong/text() | b/text() , but only one level down, not two level's down. However, when I try to use xpath, I don't seem to get all the nodes. – user1896290 Jan 16 '13 at 02:44
It sounds like you need to give a better example of what you're trying to accomplish, but then again it appears that this question has been closed for being too localized, so therefore I suggest you might consider hiring someone to help you. – pguardiario Jan 16 '13 at 09:56
solved above, thank you for the input. – user1896290 Jan 17 '13 at 15:19

Replace markup (as a string) including certain inline elements

2 Answers2