0

We know that most of websites have sitemaps which contain all major categories of the site. Now I have a list of different sitemaps' urls(More than 100K) and I wish to extract a specific category's url from all different sitemaps that I have. For example, suppose I am visiting microsoft's sitemap and there's a place called news, so I can simply using xpath to get that url, but this is only for one site, what if I have huge numbers of sites and I want to extract all 'news' urls from them as long as they exist . My first thought would be training a model to recognize news. However, I am very new to machine learning, if this is the way to solve it, can someone explain to me how to approach this? What steps will need to be taken. Or if this is not the best way, any other suggestions? Thank you.

jason
  • 25
  • 1
  • 8

1 Answers1

0

If you are actually using news sites there is an application for this called newspaper3k. https://github.com/codelucas/newspaper/

You can extract all news links with something like this.

response.css(':contains("News")::attr(href)').extract()

You can use xpath to make the above call a little better and ignore case if necessary.

I imagine there are many other links and you want it to extract from all sitemaps. You can use CrawlSpider and linkextractor rules to crawl these sitemaps....

See this answer Scrapy - Understanding CrawlSpider and LinkExtractor

ThePyGuy
  • 1,025
  • 1
  • 6
  • 15