DO NOT USE THIS TAG. It is under an active cleanup: https://meta.stackoverflow.com/q/305314 Use [web-scraping] if your question is about scraping information from web resources (there is also [screen-scraping]) or use [pdf-scraping] if your question is about scraping information from pdf files. Use [data-extraction] if you need to extract data from other resources.
Questions tagged [scrape]
1204 questions
10
votes
2 answers
Web page scraping gems/tools available in Ruby
I'm trying to scrape web pages in a Ruby script that I'm working on. The purpose of the project is to show which ETFs and stock mutual funds are most compatible with the value investing philosophy.
Some examples of pages I'd like to scrape…

jhsu802701
- 573
- 1
- 7
- 23
10
votes
2 answers
PHP Curl following redirects
I'm trying to be a bit sneeky and as part of a learning process try and improve my page scraping skills.
One thing i've come across that I have yet to be able to solve is that certain sites will use an internal link which then redirects to an…

David
- 34,836
- 11
- 47
- 77
9
votes
0 answers
pinyin in google translate API
I want to scrape the pinyin off of the googletranslate API instead of having to scrape from some other website (which might change its format in ten thousand ways over time and across different requests). The JSON that it returns doesn't seem to…

gideonite
- 1,211
- 1
- 8
- 13
9
votes
2 answers
Python - save requests or BeautifulSoup object locally
I have some code that is quite long, so it takes a long time to run. I want to simply save either the requests object (in this case "name") or the BeautifulSoup object (in this case "soup") locally so that next time I can save time. Here is the…

bill999
- 2,147
- 8
- 51
- 103
8
votes
3 answers
Python data scraping
I want to download a couple songs off of http://www.youtube-mp3.org/. I'm using urllib2 and BeautifulSoup.
The problem is that when I urllib2 open the site with my video ID plugged in, http://www.youtube-mp3.org/?c#v=lV7r8PiuecQ, I get the site but…

Oliver
- 2,182
- 5
- 24
- 31
8
votes
5 answers
How can I input data into a webpage to scrape the resulting output using Python?
I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned?
I'm trying to obtain the geographic distance between two…

user728166
- 247
- 1
- 3
- 10
8
votes
1 answer
Python web scraping for javascript generated content
I am trying to use python3 to return the bibtex citation generated by http://www.doi2bib.org/. The url's are predictable so the script can work out the url without having to interact with the web page. I have tried using selenium, bs4, etc but cant…

Nick
- 115
- 2
- 3
- 7
7
votes
2 answers
How to download images from BeautifulSoup?
Image https://i.stack.imgur.com/S1BR2.png
import requests
from bs4 import BeautifulSoup
r = requests.get("xxxxxxxxx")
soup = BeautifulSoup(r.content)
for link in links:
if "http" in link.get('src'):
print link.get('src')
I get the…

Fist Heart
- 73
- 1
- 1
- 4
7
votes
3 answers
Accessing Metacritic API and/or Scraping
Does anybody know where documentation for the Metacritic api is/if it still works. There used to be a Metacritic API at https://market.mashape.com/byroredux/metacritic-v2#get-user-details which disappeared today.
Otherwise I'm trying to scrape the…

boblikesoup
- 302
- 1
- 3
- 16
6
votes
5 answers
Python: the right URL to download pictures from Google Image Search
I'm trying do obtain images from Google Image search for a specific query. But the page I download is without pictures and it redirects me to Google's original one. Here's my code:
AGENT_ID = "Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1)…

slwr
- 1,105
- 6
- 16
- 35
6
votes
1 answer
Scrapy 403 response because of Cloudflare (clutch.co)
I'm trying to scrape some info regarding different agencies from clutch.co. When I look up the urls in my browser everything is fine, but using scrapy it gives me 403 response. From all I read on the related issues, I suppose it's coming from…

Fateme Fouladkar
- 160
- 7
6
votes
3 answers
How to 'scrape' content from a page's source?
I have this code which gets the HTML source of a page:
$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);
I want to scrape some content from it. For example, say the page's source contains this: …

Joey Morani
- 25,431
- 32
- 84
- 131
6
votes
2 answers
How to scrape tables inside a comment tag in html with R?
I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I…

David Sung
- 519
- 1
- 6
- 14
6
votes
2 answers
How do I scrape information off ASP.NET websites when paging and JavaScript links are being used?
I have been given a staff list which is supposed to be up to date but it doesn't match an intranet People Finder which is written in ASP.NET.
As the information is sensitive I am not able to access the database the People Finder is using so the only…

Ian Roke
- 1,774
- 1
- 19
- 27
6
votes
2 answers
how to crawl a site only given domain url with scrapy
I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?
I just need to download all the pages of the site without extracting any item. Do I only…

David Thompson
- 149
- 2
- 7