Properly separate String elements in an Array

Question

I am trying to parse an HTML page using Nokogiri to get some companies names.

names = []
names << Nokogiri::HTML(mypage).css(".name a").text

My result is:

["MikeGetsLeadsUruBlondeLaunch LIVERynoRyderBoyer ProductionsStrangerxCerealLume CubeKatapyMacaulay Outdoor PromotionsFlixit ABMedia MosaicLiftCast.TVcool.mediaPeekKLIKseeStreamingo SolutionsPvgnaalughaUser"]

But what I'd like to get is:

["MikeGetsLeads", "Uru", "Blonde", "Launch LIVE", RynoRyderBoyer Productions", "Stranger", "xCereal", "Lume Cube", "Katapy", "Macaulay Outdoor Promotions", "Flixit AB", "Media Mosaic", "LiftCast.TV", "cool.media", "Peek", "KLIKsee", "Streamingo Solutions", "Pvgna", "alugha", "User"]

I tried to use .split but it does not give me the right result neither. On this page, each name belongs to a <div>so it's clearly separated in the HTML structure.

The HTML structure looks like this

<div class='name'>
<a href="https://angel.co/mikegetsleads-2" class="startup-link" data-id="1217822" data-type="Startup">MikeGetsLeads</a>
</div>

Can I take a look how does the html, that you want to parse looks like? Can you paste it in your question? — maicher, Jun 24 '16 at 12:35
Based on the result of your Nokogiri snippet, it does not appear to be possible to generate the array you want. Perhaps some more details regarding where is it that you are getting this data from will be helpful. — Sinstein, Jun 24 '16 at 12:37
The result you say you're getting from Nokogiri has a lot more information in it than the HTML snippet you've posted. My Nokogiri is rusty but I suspect what you really want is something like `Nokogiri::HTML(mypage).css(".name a").map(&:text)` — Jordan Running, Jun 24 '16 at 12:54

score 0 · Answer 1 · answered Jun 24 '16 at 13:01

0

require 'rubygems'
require 'nokogiri'
require 'pp'

names = []
mypage = File.open("myhtml.html", "r")
Nokogiri::HTML(mypage).css(".name a").each do |item|
 names << item.text
end

pp names

returns:

["MikeGetsLeads", "MikeGetsLeads2", "MikeGetsLeads3"]

answered Jun 24 '16 at 13:01

rwaffen

43
7

So great, thanks @rwaffen, it works too ! I am sorry, I am currently learning Ruby, so I'm kind of noob... – Eric Jun 24 '16 at 13:08
1

Maybe use `names = Nokogiri::HTML(mypage).css(".name a").map(&:text)` – Lukas Baliak Jun 24 '16 at 13:24
@Eric You shouldn't apologize. Nokogiri's behavior in this situation (calling `text` on a NodeSet object) is a little counterintuitive.. – Jordan Running Jun 24 '16 at 14:55
@LukasBaliak okay cool... i'm not so used to the map method... should get used to it :) – rwaffen Jun 27 '16 at 12:18

score 0 · Accepted Answer · edited May 23 '17 at 12:02

The problem is, you are using text with a NodeSet, not with individual nodes. With a NodeSet all the text is concatenated into a single String. Per the NodeSet.inner_text AKA text documentation:

Get the inner text of all contained Node objects

and the actual code is:

def inner_text
  collect(&:inner_text).join('')
end

whereas Node.content AKA text or inner_text

Returns the content for this Node

Meditate on this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div>
  <p>foo</p>
  <p>bar</p>
</div>
EOT

doc.css('p').class # => Nokogiri::XML::NodeSet
doc.css('p').text # => "foobar"

Instead, you need to use text on individual nodes:

doc.css('p').map{ |n| n.class } # => [Nokogiri::XML::Element, Nokogiri::XML::Element]
doc.css('p').map{ |n| n.text } # => ["foo", "bar"]

The previous line can be simplified to:

doc.css('p').map(&:text) # => ["foo", "bar"]

See "How to avoid joining all text from Nodes when scraping" also.

Properly separate String elements in an Array

2 Answers2