Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or especially in the FOAF community – Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

Web crawlers indexes

More information

9683 questions
2
votes
2 answers

How to get scrapyrt's POST meta data?

In scrapyrt's POST documentation we can pass a JSON request like this, but how do you access the meta data like category and item in start_requests? { "request": { "meta": { "category": "some category", "item":…
Nobody
  • 21
  • 1
2
votes
0 answers

Change Selenium send_keys() encoding?

Lately I've been building a webscraper, and everything was working nicely, but today I needed to scrap a page that uses the character "ñ" and I haven't been able to do it. Say, I want to scrap foobar.com. I tried the following: url =…
Juan C
  • 5,846
  • 2
  • 17
  • 51
2
votes
0 answers

Crawling Google Play Store Apps

I want to crawl the google play store and get all the app ids of a particular category. When I executed the below code I just got the app ids of first 49 apps not more than that. But I want to get all the app ids. How can I achieve this? And the…
Darshil
  • 69
  • 7
2
votes
3 answers

Why can't I download a midi file with python requests?

I'm trying to download a series of classical music midi files with python and the requests library. Unfortunately, I can't seem to actually download the midi files themselves. The only thing I'm downloading is HTML files. I have searched SO and…
Hanzy
  • 394
  • 4
  • 17
2
votes
1 answer

Scraping multiple single pages from different domains(mostly) with different structure

I have a list of very specific urls that I need to scrape data from (different selectors/fields). There are total of around 1000 links from around 300 different websites that have different structure (selector/xpath). I am trying see if anyone has…
SorishK
  • 21
  • 2
2
votes
1 answer

python web-crawling error of 'no such element: Unable to locate element

I am studying web crawling with selenium and I found some error. This is my coding as follows. from selenium import webdriver as wd main_url = 'https://searchad.naver.com/' driver = wd.Chrome(executable_path='chromedriver.exe') and I logged in #…
2
votes
1 answer

Unable to locate element: css selector or xpath in python crawling

I want to make python crawling. I already logged on this website and want to send_keys(keyword) and click the button. I try to find css selector or xpath but there is some error as follows. from selenium import webdriver as wd import…
2
votes
1 answer

Python threading module - GUI still freezing

I built a twitter crawler with GUI that fetches the latest 20 tweets from any given Twitter Account and saves them into a csv file. The crawler should repeat crawling every x minutes (timeIntervall) for y times (times). The GUI freezes when the…
heslegend
  • 86
  • 8
2
votes
2 answers

Scrapy CrawlSpider doesn't quit

I have a problem with scrapy Crawlspider: basically, it doesn't quit, as it is supposed to do, if a CloseSpider exception is raised. Below is the code: from scrapy.spiders import CrawlSpider, Rule from scrapy.exceptions import CloseSpider from…
Luigi Tiburzi
  • 4,265
  • 7
  • 32
  • 43
2
votes
0 answers

Python Crawler - What do you think is the best setup?

I have been building a crawler in Python for the last 10 months. This crawler is using threading and Queue to hold all the visited and non-visited links. I use BeautifulSoup and request to access the urls and pick up page title, meta description,…
Dannie
  • 31
  • 4
2
votes
1 answer

how to on selenium plugin in storm crawler

How we can configure to on selenium plugin in storm crawler, for example in archetype project of that? There is a code for using selenium in storm crawler. But i don't know how to use it.
2
votes
2 answers

Classifying websites

I need to scrape a thousand websites that share the same structure: they all have a menu, a title, some text and a rating, much like a blog. Unfortunately, they are also coded very differently and some are manually, so I cannot reutilize CSS…
konr
  • 2,545
  • 2
  • 20
  • 38
2
votes
3 answers

how to use Beautifulsoup4 to check if parent tag has a direct child whose name is not "div"

I want to check if parent tag has a direct child whose name is not "div", so I'd like to check all the direct children of a tag. I tried like this: from bs4 import BeautifulSoup import urllib.request url =…
Forsworn
  • 112
  • 1
  • 10
2
votes
1 answer

beautifulsoup web crawling search id list

I am attempting to crawl the ncbi eutils webpage. I want to crawl the Id list from the web as shown in the below: Here's the code for it: import requests from bs4 import BeautifulSoup def get_html(url): """get the content of the…
Thomas.Q
  • 377
  • 1
  • 4
  • 12
2
votes
2 answers

Web crawler can not retrieve results from google search

I am in the proccess of creating a simple webcrawler and I would like it to scrape the result webpage of a google search query such as "Donald Trump". I have written the follownig code: # import requests from urllib.request import urlopen as…
Petris
  • 135
  • 2
  • 10