2

Im a iterating through a list of URLs. The urls come in different formats like:

https://twitter.com/sdfaskj... 
https://www.linkedin.com/asdkfjasd...
http://google.com/asdfjasdj...

etc.

I would like to use Gsub or something similar to erase everything but the name of the website, to get only "twitter", "linkedin", and "google", respectively.

In my head, ideally I would like something like a .gsub that can check for multiple possibilities (url.gsub("https:// or https://www. or http:// etc.", "") and replace them when found with nothing "". Also it needs to delete everything after the name, so ".com/wkadslflj..."

attributes.css("a").each do |attribute|
  attribute_url = attribute["href"]
  attribute_scrape = attribute_url.gsub("https://", "")
  binding.pry
end
alexnewby
  • 61
  • 4
  • 15

2 Answers2

6

I would consider a combination of URI.parse to get the hostname from the URL and the PublicSuffix gem to get the second level domain:

require 'public_suffix'
require 'uri'

url  = 'https://www.linkedin.com/asdkfjasd'
host = URI.parse(url).host                 # => 'www.linkedin.com'
PublicSuffix.parse(host).sld               # => 'linkedin'
Sergio Tulentsev
  • 226,338
  • 43
  • 373
  • 367
spickermann
  • 100,941
  • 9
  • 101
  • 131
  • Personally I'd ditch the gem and do `host.split(?.)[-2]`. The domain name spec is pretty stable. – Max Dec 20 '17 at 21:11
  • 3
    @Max `split(?.)[-2]` doesn't reliably return the most important part of the domain, for example, for valid domains like `www.google.com.au` or `www.amazon.co.uk` it would return `com` or `co`. Whereas the `PublicSuffix` gem would return `google` and `amazon`. – spickermann Dec 21 '17 at 06:47
1

You can use this gsub regexp :

gsub(/http(s)?:\/\/(www.)?|.(com|net|co.uk|us)+.*/, '')

Output:

list = ["https://twitter.com/sdfaskj...", "https://www.linkedin.com/asdkfjasd...", "http://google.com/asdfjasdj..."] 

list.map { |u| u.gsub(/http(s)?:\/\/(www.)?|.(com|net|co.uk|us)+.*/, '') }
 => ["twitter", "linkedin", "google"] 
Alex Kojin
  • 5,044
  • 2
  • 29
  • 31