Questions tagged [nokogiri]

An HTML, XML, SAX and Reader parser for Ruby with the ability to search documents via XPath or CSS3 selectors… and much more

Nokogiri (鋸) is an HTML, XML, SAX and Reader parser for Ruby. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.

See the Nokogiri cheat-sheet for tips using Nokogiri.

A digest of most of the methods documented at nokogiri.org. Reading the source can help, too.

From the Nokogiri readme:

XML is like violence - if it doesn’t solve your problems, you are not using enough of it.

3699 questions
18
votes
3 answers

How can I add a child to a node at a specific position?

I have a node which has two children: an HTML text and an HTML element.

Installation on server

In this case the HTML text is: Installation on server…
Manuel
  • 461
  • 1
  • 3
  • 8
18
votes
1 answer

Ruby open-uri can't open url (m1 mac)

i start to learn ruby and scraping and i try to open an url with open and i got lib/scrapper.rb:7:in `initialize': No such file or directory @ rb_sysopen - https://en.wikipedia.org/wiki/Douglas_Adams (Errno::ENOENT) from lib/scrapper.rb:7:in `open'…
Maxime Crespo
  • 199
  • 1
  • 10
18
votes
1 answer

Find and replace entire HTML nodes with Nokogiri

i have an HTML, that should be transformed, having some tags replaced with another tags. I don't know about these tags, because they will come from db. So, set_attribute or name methods of Nokogiri are not suitable for me. I need to do it, in a way,…
AntonAL
  • 16,692
  • 21
  • 80
  • 114
18
votes
2 answers

How can I create a nokogiri case insensitive Xpath selector?

I'm using nokogiri to select the 'keywords' attribute like this: puts page.parser.xpath("//meta[@name='keywords']").to_html One of the pages I'm working with has the keywords label with a capital "K" which has motivated me to make the query case…
Rick
  • 810
  • 7
  • 10
18
votes
6 answers

WARNING: Nokogiri was built against LibXML version 2.7.3, but has dynamically loaded 2.7.8

After making a fresh install of Mac OS X 10.8 Mountain Lion, and after installing Ruby 1.9.3 and Ruby on Rails 3.2.6, I started the Rails console and I got this warning message: WARNING: Nokogiri was built against LibXML version 2.7.3, but has …
David Morales
  • 17,816
  • 12
  • 77
  • 105
17
votes
7 answers

XPath axis, get all following nodes until

I have the following example of HTML:

Foo bar

lorem

ipsum

etc

Bar baz

dum dum dum

poopfiddles

I'm looking to extract all paragraphs following…
Lee Jarvis
  • 16,031
  • 4
  • 38
  • 40
17
votes
2 answers

Parsing simple XML with Nokogiri

I have the following XML: Title 1 http://www.example.com/url-1 Title 2 http://www.example.com/url-2 Title…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/ruby-on-rails" class="post-tag grid--cell" title="show questions tagged 'ruby-on-rails'" rel="tag">ruby-on-rails</a> <a href="../../questions/tagged/ruby" class="post-tag grid--cell" title="show questions tagged 'ruby'" rel="tag">ruby</a> <a href="../../questions/tagged/xml" class="post-tag grid--cell" title="show questions tagged 'xml'" rel="tag">xml</a> <a href="../../questions/tagged/xpath" class="post-tag grid--cell" title="show questions tagged 'xpath'" rel="tag">xpath</a> <a href="../../questions/tagged/nokogiri" class="post-tag grid--cell" title="show questions tagged 'nokogiri'" rel="tag">nokogiri</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Oct 15 '10 at 01:25">asked Oct 15 '10 at 01:25</time> <a href="../../users/270663/vincent" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/270663.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Vincent" /> </a> <div class="s-user-card--info"> <a href="../../users/270663/vincent" class="s-user-card--link">Vincent</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">16,086</li> <li class="s-award-bling s-award-bling__gold" title="18 gold badges">18</li> <li class="s-award-bling s-award-bling__silver" title="67 silver badges">67</li> <li class="s-award-bling s-award-bling__bronze" title="73 bronze badges">73</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-24711508"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>17</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>10</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/24711508/failure-to-install-nokogiri-libiconv-is-missing-on-yosemite-mac-os-x-10-10" class="question-hyperlink">Failure to install nokogiri libiconv is missing on Yosemite Mac OS X 10.10</a></h3> <div class="excerpt">Trying to install Nokogiri I’m getting the following error Maxims-MacBook-Air:ScrapingTheApple maximveksler$ gem install nokogiri Fetching: nokogiri-1.6.2.1.gem (100%) Building native extensions. This could take a while... Building nokogiri using…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/rubygems" class="post-tag grid--cell" title="show questions tagged 'rubygems'" rel="tag">rubygems</a> <a href="../../questions/tagged/nokogiri" class="post-tag grid--cell" title="show questions tagged 'nokogiri'" rel="tag">nokogiri</a> <a href="../../questions/tagged/osx-yosemite" class="post-tag grid--cell" title="show questions tagged 'osx-yosemite'" rel="tag">osx-yosemite</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Jul 12 '14 at 09:19">asked Jul 12 '14 at 09:19</time> <a href="../../users/48062/maxim-veksler" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/48062.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Maxim Veksler" /> </a> <div class="s-user-card--info"> <a href="../../users/48062/maxim-veksler" class="s-user-card--link">Maxim Veksler</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">29,272</li> <li class="s-award-bling s-award-bling__gold" title="38 gold badges">38</li> <li class="s-award-bling s-award-bling__silver" title="131 silver badges">131</li> <li class="s-award-bling s-award-bling__bronze" title="151 bronze badges">151</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-16219343"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>17</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>1</strong> answer </div> </div> </div> <div class="summary"> <h3><a href="../../questions/16219343/set-tag-attribute-and-add-plain-text-content-to-the-tag-using-nokogiri-builder" class="question-hyperlink">set tag attribute and add plain text content to the tag using nokogiri builder (ruby)</a></h3> <div class="excerpt">I am trying to build XML using Nokogiri with some tags that have both attributes and plain text inside the tag. So I am trying to get to this: <?xml version="1.0"?> <Transaction requestName="OrderRequest"> <Option…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/ruby" class="post-tag grid--cell" title="show questions tagged 'ruby'" rel="tag">ruby</a> <a href="../../questions/tagged/xml" class="post-tag grid--cell" title="show questions tagged 'xml'" rel="tag">xml</a> <a href="../../questions/tagged/nokogiri" class="post-tag grid--cell" title="show questions tagged 'nokogiri'" rel="tag">nokogiri</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Apr 25 '13 at 15:51">asked Apr 25 '13 at 15:51</time> <a href="../../users/398299/fflyer05" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/398299.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="fflyer05" /> </a> <div class="s-user-card--info"> <a href="../../users/398299/fflyer05" class="s-user-card--link">fflyer05</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">873</li> <li class="s-award-bling s-award-bling__silver" title="11 silver badges">11</li> <li class="s-award-bling s-award-bling__bronze" title="18 bronze badges">18</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-1274783"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>17</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status "> <strong>4</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/1274783/inserting-and-deleting-xml-nodes-and-elements-using-nokogiri" class="question-hyperlink">Inserting and deleting XML nodes and elements using Nokogiri</a></h3> <div class="excerpt">I want to extract parts of an XML file and make a note that I extracted some part in that file, like "here something was extracted". I'm trying to do this with Nokogiri, but it seems to not really be documented on how to: delete all childs of a…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/ruby" class="post-tag grid--cell" title="show questions tagged 'ruby'" rel="tag">ruby</a> <a href="../../questions/tagged/nokogiri" class="post-tag grid--cell" title="show questions tagged 'nokogiri'" rel="tag">nokogiri</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card__deleted"> <time class="s-user-card--time" datetime="asked Aug 13 '09 at 21:44">asked Aug 13 '09 at 21:44</time> <div class="s-avatar s-avatar__32 s-user-card--avatar"> </div> <div class="s-user-card--info">hans zeckinger</div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-9590038"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>16</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status "> <strong>2</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/9590038/can-t-install-nokogiri-gem-libxml-parser-h-not-found-but-its-there-why" class="question-hyperlink">Can't install Nokogiri gem, "libxml/parser.h" not found, but its there, why?</a></h3> <div class="excerpt">I tried to install Nokogiri but I always get an compiling error: checking for libxml/parser.h... *** extconf.rb failed *** but, I've installed it and all other dependencies. I try to give the installer hints like this: %> gem install nokogiri --…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/ruby" class="post-tag grid--cell" title="show questions tagged 'ruby'" rel="tag">ruby</a> <a href="../../questions/tagged/rubygems" class="post-tag grid--cell" title="show questions tagged 'rubygems'" rel="tag">rubygems</a> <a href="../../questions/tagged/nokogiri" class="post-tag grid--cell" title="show questions tagged 'nokogiri'" rel="tag">nokogiri</a> <a href="../../questions/tagged/libxml2" class="post-tag grid--cell" title="show questions tagged 'libxml2'" rel="tag">libxml2</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Mar 06 '12 at 18:54">asked Mar 06 '12 at 18:54</time> <a href="../../users/1252955/manuel-gorlich" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/1252955.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Manuel Görlich" /> </a> <div class="s-user-card--info"> <a href="../../users/1252955/manuel-gorlich" class="s-user-card--link">Manuel Görlich</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">169</li> <li class="s-award-bling s-award-bling__silver" title="1 silver badges">1</li> <li class="s-award-bling s-award-bling__bronze" title="4 bronze badges">4</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-6096327"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>16</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>3</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/6096327/strip-style-attributes-with-nokogiri" class="question-hyperlink">Strip style attributes with nokogiri</a></h3> <div class="excerpt">I'm scrapling an html page with nokogiri and i want to strip out all style attributes. How can I achieve this? (i'm not using rails so i can't use it's sanitize method and i don't want to use sanitize gem 'cause i want to blacklist remove not…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/html" class="post-tag grid--cell" title="show questions tagged 'html'" rel="tag">html</a> <a href="../../questions/tagged/nokogiri" class="post-tag grid--cell" title="show questions tagged 'nokogiri'" rel="tag">nokogiri</a> <a href="../../questions/tagged/sanitize" class="post-tag grid--cell" title="show questions tagged 'sanitize'" rel="tag">sanitize</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked May 23 '11 at 11:03">asked May 23 '11 at 11:03</time> <a href="../../users/715129/keepitterron" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/715129.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="keepitterron" /> </a> <div class="s-user-card--info"> <a href="../../users/715129/keepitterron" class="s-user-card--link">keepitterron</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">1,052</li> <li class="s-award-bling s-award-bling__gold" title="1 gold badge">1</li> <li class="s-award-bling s-award-bling__silver" title="8 silver badge">8</li> <li class="s-award-bling s-award-bling__bronze" title="12 bronze badge">12</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-5774957"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>16</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>3</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/5774957/how-do-i-search-for-text-then-traverse-the-dom-from-the-found-node" class="question-hyperlink">How do I search for "text" then traverse the DOM from the found node?</a></h3> <div class="excerpt">I have webpage that I need to scrape some data from. The problem is, each page may or may not have specific data, or it may have extra data above or below it in the DOM, and there is no CSS ids to speak of. Typically I could use either CSS ids or…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/ruby" class="post-tag grid--cell" title="show questions tagged 'ruby'" rel="tag">ruby</a> <a href="../../questions/tagged/nokogiri" class="post-tag grid--cell" title="show questions tagged 'nokogiri'" rel="tag">nokogiri</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Apr 25 '11 at 03:42">asked Apr 25 '11 at 03:42</time> <a href="../../users/109672/nick-faraday" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/109672.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Nick Faraday" /> </a> <div class="s-user-card--info"> <a href="../../users/109672/nick-faraday" class="s-user-card--link">Nick Faraday</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">598</li> <li class="s-award-bling s-award-bling__silver" title="5 silver badges">5</li> <li class="s-award-bling s-award-bling__bronze" title="23 bronze badges">23</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-4906681"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>16</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>2</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/4906681/using-nokogiri-html-builder-to-create-fragment-with-multiple-root-nodes" class="question-hyperlink">Using Nokogiri HTML Builder to create fragment with multiple root nodes</a></h3> <div class="excerpt">Well I have a simple problem with Nokogiri. I want to make Nokogiri::HTML::Builder to make an HTML fragment of the following form: <div> #Some stuff in here </div> <div> #Some other stuff in here </div> When trying to do: @builder =…</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/ruby" class="post-tag grid--cell" title="show questions tagged 'ruby'" rel="tag">ruby</a> <a href="../../questions/tagged/nokogiri" class="post-tag grid--cell" title="show questions tagged 'nokogiri'" rel="tag">nokogiri</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Feb 05 '11 at 11:51">asked Feb 05 '11 at 11:51</time> <a href="../../users/452931/gerry" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/452931.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="Gerry" /> </a> <div class="s-user-card--info"> <a href="../../users/452931/gerry" class="s-user-card--link">Gerry</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">5,326</li> <li class="s-award-bling s-award-bling__gold" title="1 gold badge">1</li> <li class="s-award-bling s-award-bling__silver" title="23 silver badge">23</li> <li class="s-award-bling s-award-bling__bronze" title="33 bronze badge">33</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="mln24"> <div class="question-summary" id="question-summary-34781600"> <div class="statscontainer"> <div class="stats"> <div class="vote"> <div class="votes"> <span class="vote-count-post"><strong>16</strong></span> <div class="viewcount">votes</div> </div> </div> <div class="status answered-accepted"> <strong>3</strong> answers </div> </div> </div> <div class="summary"> <h3><a href="../../questions/34781600/how-to-parse-a-html-table-with-nokogiri" class="question-hyperlink">How to parse a HTML table with Nokogiri?</a></h3> <div class="excerpt">I'm trying to parse a table but I don't know how to save the data from it. I want to save the data in each row row to look like: ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] The sample table is: html = <<EOT <table class="open"> …</div> <div class="grid ai-start jc-space-between fw-wrap"> <div class="grid gs4 fw-wrap tags "> <a href="../../questions/tagged/ruby-on-rails" class="post-tag grid--cell" title="show questions tagged 'ruby-on-rails'" rel="tag">ruby-on-rails</a> <a href="../../questions/tagged/ruby" class="post-tag grid--cell" title="show questions tagged 'ruby'" rel="tag">ruby</a> <a href="../../questions/tagged/html-parsing" class="post-tag grid--cell" title="show questions tagged 'html-parsing'" rel="tag">html-parsing</a> <a href="../../questions/tagged/nokogiri" class="post-tag grid--cell" title="show questions tagged 'nokogiri'" rel="tag">nokogiri</a> </div> <div class="started mt0"> <div class="s-user-card s-user-card"> <time class="s-user-card--time" datetime="asked Jan 14 '16 at 04:09">asked Jan 14 '16 at 04:09</time> <a href="../../users/5384436/verrom" class="s-avatar s-avatar__32 s-user-card--avatar"> <img class="s-avatar--image" src="../../users/profiles/5384436.webp" data-jdenticon-width="32" data-jdenticon-height="32" data-jdenticon-value="verrom" /> </a> <div class="s-user-card--info"> <a href="../../users/5384436/verrom" class="s-user-card--link">verrom</a> <ul class="s-user-card--awards"> <li class="s-user-card--rep" title="reputation score">419</li> <li class="s-award-bling s-award-bling__silver" title="6 silver badges">6</li> <li class="s-award-bling s-award-bling__bronze" title="18 bronze badges">18</li> </ul> </div> </div> </div> </div> </div> </div> </div> <div class="s-pagination pager fr"> <a class="s-pagination--item" href="../../questions/tagged/nokogiri_page=4" rel="prev" title="Go to page 4">Prev </a> <a class="s-pagination--item" href="../../questions/tagged/nokogiri_page=1" rel="" title="Go to page 1">1</a> <a class="s-pagination--item" href="../../questions/tagged/nokogiri_page=2" rel="" title="Go to page 2">2</a> <a class="s-pagination--item" href="../../questions/tagged/nokogiri_page=3" rel="" title="Go to page 3">3</a> <div class="s-pagination--item s-pagination--item__clear">…</div> <a class="s-pagination--item" href="../../questions/tagged/nokogiri_page=99" rel="" title="Go to page 99">99</a> <a class="s-pagination--item" href="../../questions/tagged/nokogiri_page=100" rel="" title="Go to page 100">100</a> <a class="s-pagination--item" href="../../questions/tagged/nokogiri_page=6" rel="next" title="Go to page 6"> Next</a> </div> </div> </div> </div> </div> <script src="../../static/js/stack-icons.js"></script> <script src="../../static/js/fromnow.js"></script> </body> </html>