SCRAPING RUBY with REGEX /w

Question

I am looking to do scraping of the website https://www.bananatic.com/es/forum/games/

and extract the tags "name", "views" and "replies". I have a big problem to get the non-empty content of the "name" tag. Can you help me? I need to save only the elements that do have text.

This is my code, I have three variables:

per save what is inside the replies.
pir save what is inside the views
res saves what is inside the names.

Each array should contain only the elements that they have. something but in the names the writings [" "] are saved and I want them not to be saved in my array.

    require 'nokogiri'
    require 'open-uri'
    require 'pp'
    require 'csv'


    unless File.readable?('data.html')
      url = 'https://www.bananatic.com/de/forum/games/'
      data = URI.open(url).read
      File.open('data.html', 'wb') { |f| f << data }
    end
    data = File.read('data.html')
    document = Nokogiri::HTML(data)


    per = document.xpath('//div[@class="replies"]/text()[string-length(normalize-space(.)) > 0]')
                  .map { |node| node.to_s[/\d+/] }

    p per

    pir = document.xpath('//div[@class="views"]/text()[string-length(normalize-space(.)) > 0]')
                  .map { |node| node.to_s[/\w+/] }

    p pir

    links2 = document.css('.topics ul li div')
    res = links2.map do |lk|
      name = lk.css('.name  p a').inner_text
      [name]
    end
    p res

To fix it I have added a regular expression, however I have failed in the attempt. I just replace .inner_textwith .to_s[/\w+/], but I don't get it.

Now I have an array with null values and some letters "a" that I don't know where they appear.

score 1 · Accepted Answer · answered Jan 20 '23 at 21:59

This Might Help XPath and CSS.

For your CSS check this out: https://kittygiraudel.github.io/selectors-explained/

The following will get you what you are looking for

document.xpath('//div[@class="topics"]/ul/li//div[@class="name"]/a[@class="js-link avatar"]/text()').map {|node| node.to_s.strip}`.

If you want to understand where your array is coming from take 1 step back and just print out lk.css('.name p a').to_s but the real issue is your selectors are just off.

All that being said looking at the construct of the page you would be better off with something like this:

require 'nokogiri'
require 'open-uri'

url = "https://www.bananatic.com/de/forum/games/"

doc = Nokogiri::HTML(URI.open(url))
# Set a root node set to start from
topics = doc.xpath('//div[@class="topics"]/ul/li')

# loop the set 
details = topics.filter_map do |topic| 
  next unless topic.at_xpath('.//div[@class="name"]') # skip ones without the needed info
  # Map details into a Hash
  {name: topic.at_xpath('.//div[@class="name"]/a[@class="js-link avatar"]/text()').to_s.strip,
   post_year: topic.at_xpath('.//div[@class="name"]/text()[string-length(normalize-space(.)) > 0]').to_s[/\d{4}/],
   replies: topic.at_xpath('.//div[@class="replies"]/text()').to_s.strip, 
   views: topic.at_xpath('.//div[@class="views"]/text()').to_s.strip 
  }
end

The result of details would be:

[{:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"236"},
 {:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"164"},
 {:name=>"EdgarAllen", :post_year=>"2022", :replies=>"0", :views=>"1"},
 {:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"0"},
 {:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"1"},
 {:name=>"tokyobreez", :post_year=>"2021", :replies=>"2", :views=>"18"},
 {:name=>"matrix12334", :post_year=>"2022", :replies=>"0", :views=>"2"},
 {:name=>"juggalohomie420", :post_year=>"2017", :replies=>"3", :views=>"89"},
 {:name=>"Imas86", :post_year=>"2022", :replies=>"2", :views=>"2"},
 {:name=>"SmilesImposterr", :post_year=>"2021", :replies=>"1", :views=>"17"},
 {:name=>"bebb", :post_year=>"2019", :replies=>"7", :views=>"22"},
 {:name=>"IMBANANAZ", :post_year=>"2016", :replies=>"1", :views=>"4"},
 {:name=>"IWantSteamKeys", :post_year=>"2021", :replies=>"1", :views=>"4"},
 {:name=>"gamormoment", :post_year=>"2021", :replies=>"1", :views=>"47"},
 {:name=>"Lovestruck", :post_year=>"2021", :replies=>"3", :views=>"46"},
 {:name=>"KillerBotAldwin1", :post_year=>"2021", :replies=>"1", :views=>"95"},
 {:name=>"purplevestynstr", :post_year=>"2020", :replies=>"1", :views=>"13"},
 {:name=>"Janabanana", :post_year=>"2021", :replies=>"3", :views=>"3"},
 {:name=>"apache724", :post_year=>"2017", :replies=>"3", :views=>"33"},
 {:name=>"MrsSue66", :post_year=>"2021", :replies=>"1", :views=>"38"}]

@CindyAdrianaBohrquezSantana okay then target that via `'.//div[@class="name"]/p/a/text()'`. See we target the div by class and then just walk down to the text `p` -> `a` -> `text()` — engineersmnky, Jan 21 '23 at 02:49
Your css could be changed to `'.topics > ul > li div.name > p > a'` to obtain the same concept — engineersmnky, Jan 21 '23 at 03:09

SCRAPING RUBY with REGEX /w

1 Answers1