0

I am trying to use Nokogiri's CSS method to get some names from my HTML.

This is an example of the HTML:

<section class="container partner-customer padding-bottom--60">
    <div>
        <div>
            <a id="technologies"></a>
            <h4 class="center-align">The Team</h4>
        </div>
    </div>
    <div class="consultant list-across wrap">
        <div class="engineering">
            <img class="" src="https://v0001.jpg" alt="Person 1"/>
            <p>Person 1<br>Founder, Chairman &amp; CTO</p>
        </div>
        <div class="engineering">
            <img class="" src="https://v0002.png" alt="Person 2"/></a>
            <p>Person 2<br>Founder, VP of Engineering</p>
        </div>
        <div class="product">
            <img class="" src="https://v0003.jpg" alt="Person 3"/></a>
            <p>Person 3<br>Product</p>
        </div>
        <div class="Human Resources &amp; Admin">
            <img class="" src="https://v0004.jpg" alt="Person 4"/></a>
            <p>Person 4<br>People &amp; Places</p>
        </div>
        <div class="alliances">
            <img class="" src="https://v0005.jpg" alt="Person 5"/></a>
            <p>Person 5<br>VP of Alliances</p>
        </div>

What I have so far in my people.rake file is the following:

  staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
  all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)

I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.

Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.

How could I simply get the element within alt?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
theGreenCabbage
  • 5,197
  • 19
  • 79
  • 169
  • Please read "[mcve]". Your HTML is invalid; Please make sure that closing tags are in the right places. Without those Nokogiri will put them where it thinks they should be, which can vary wildly from what you think. What is your expected output? – the Tin Man Aug 24 '16 at 21:10

1 Answers1

1

Your desired output isn't clear and the HTML is broken.

Start with this:

require 'nokogiri'

doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]

Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:

doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"

This behavior is documented in NodeSet#text:

Get the inner text of all contained Node objects

Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:

Returns the content for this Node

doc.search('p').map(&:text) # => ["foo", "bar"]

See "How to avoid joining all text from Nodes when scraping" also.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303