1

I am trying to parse an HTML page using Nokogiri to get some companies names.

names = []
names << Nokogiri::HTML(mypage).css(".name a").text

My result is:

["MikeGetsLeadsUruBlondeLaunch LIVERynoRyderBoyer ProductionsStrangerxCerealLume CubeKatapyMacaulay Outdoor PromotionsFlixit ABMedia MosaicLiftCast.TVcool.mediaPeekKLIKseeStreamingo SolutionsPvgnaalughaUser"]

But what I'd like to get is:

["MikeGetsLeads", "Uru", "Blonde", "Launch LIVE", RynoRyderBoyer Productions", "Stranger", "xCereal", "Lume Cube", "Katapy", "Macaulay Outdoor Promotions", "Flixit AB", "Media Mosaic", "LiftCast.TV", "cool.media", "Peek", "KLIKsee", "Streamingo Solutions", "Pvgna", "alugha", "User"]

I tried to use .split but it does not give me the right result neither. On this page, each name belongs to a <div>so it's clearly separated in the HTML structure.

The HTML structure looks like this

<div class='name'>
<a href="https://angel.co/mikegetsleads-2" class="startup-link" data-id="1217822" data-type="Startup">MikeGetsLeads</a>
</div>
Eric
  • 95
  • 1
  • 8
  • Can I take a look how does the html, that you want to parse looks like? Can you paste it in your question? – maicher Jun 24 '16 at 12:35
  • Based on the result of your Nokogiri snippet, it does not appear to be possible to generate the array you want. Perhaps some more details regarding where is it that you are getting this data from will be helpful. – Sinstein Jun 24 '16 at 12:37
  • Thanks for your comments! – Eric Jun 24 '16 at 12:39
  • The result you say you're getting from Nokogiri has a lot more information in it than the HTML snippet you've posted. My Nokogiri is rusty but I suspect what you really want is something like `Nokogiri::HTML(mypage).css(".name a").map(&:text)` – Jordan Running Jun 24 '16 at 12:54
  • Many thanks @jordan, it works!! You rock :) – Eric Jun 24 '16 at 12:57

2 Answers2

0
require 'rubygems'
require 'nokogiri'
require 'pp'

names = []
mypage = File.open("myhtml.html", "r")
Nokogiri::HTML(mypage).css(".name a").each do |item|
 names << item.text
end

pp names

returns:

["MikeGetsLeads", "MikeGetsLeads2", "MikeGetsLeads3"]
rwaffen
  • 43
  • 7
0

The problem is, you are using text with a NodeSet, not with individual nodes. With a NodeSet all the text is concatenated into a single String. Per the NodeSet.inner_text AKA text documentation:

Get the inner text of all contained Node objects

and the actual code is:

def inner_text
  collect(&:inner_text).join('')
end

whereas Node.content AKA text or inner_text

Returns the content for this Node

Meditate on this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div>
  <p>foo</p>
  <p>bar</p>
</div>
EOT

doc.css('p').class # => Nokogiri::XML::NodeSet
doc.css('p').text # => "foobar"

Instead, you need to use text on individual nodes:

doc.css('p').map{ |n| n.class } # => [Nokogiri::XML::Element, Nokogiri::XML::Element]
doc.css('p').map{ |n| n.text } # => ["foo", "bar"]

The previous line can be simplified to:

doc.css('p').map(&:text) # => ["foo", "bar"]

See "How to avoid joining all text from Nodes when scraping" also.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303