Questions tagged [scraper]

Synonym of [web-scraping]

Synonym of : Let's [scrape] these tags off the bottom of our shoe

349 questions
0
votes
2 answers

Unable to scrape website: URL returned a bad HTTP response code

I noticed that this has been asked before, but no one else has yet to receive an answer, so I'll try my best to ask too. In the last several months, my Wordpress website, http://geekvision.tv/ , has been undetectable by Facebook's debugger. I…
Zach Hurst
  • 1
  • 1
  • 1
0
votes
1 answer

PHP: imdb scraper poster

i have an iMDb-Scraper from another site. It worked very well and now iMDb changed it's html-output and the regular expression doesn't find the poster anymore. I'm a noob at regex, so maybe someone can help me this is the line $arr['poster'] =…
Bubbleboy
  • 71
  • 9
0
votes
1 answer

How to get a clean result when scraping a data from website using scrapy

I am new in python and I am trying to scrape a data from yellow pages. I was able to scrape it but I get a messed result. This was the result i got: 2013-03-24 20:26:47+0800 [scrapy] INFO: Scrapy 0.14.4 started (bot: eyp) 2013-03-24 20:26:47+0800…
user2176372
0
votes
2 answers

Prevent or delete duplicates in textscraper?

I have a code that parses through text files in a folder, and saves a predefined number of words around certain search words. For example, it looks for words such as "date" and "year". If it finds both in the same sentence it will save the sentence…
Seeb
  • 199
  • 4
  • 16
0
votes
0 answers

javascript to finding and listing images of website?

I’d like to do some hygiene on a bloated images folder/directory for a website of mine. I’m a grade just above novice working with javascript, it seems like it might be possible achieve a solution using javascript… The solution I’m searching for…
0
votes
1 answer

Xpath content not saved

It might just be an idiotic bug in the code that I haven't yet discovered, but it's been taking me quite some time: When parsing websites using nokogiri and xpath, and trying to save the content of the xpaths to a .csv file, the csv file has empty…
Seeb
  • 199
  • 4
  • 16
0
votes
1 answer

Error loading GoutteClient when using Behat/Mink

I'm trying to use Behat/Mink in order to load a website. I've used Composer for the installation, this is my composer.json: { "require": { "behat/mink": "*", "behat/mink-goutte-driver": "*", "behat/mink-selenium-driver":…
rfc1484
  • 9,441
  • 16
  • 72
  • 123
0
votes
1 answer

ScraperWiki: How to save html so it only gets loaded once

When I execute a scraper, it loads the url using this method: $html = scraperWiki::scrape("foo.html"); So every time I add new code to the scraper and want to try it, it loads again the html, which takes a fair amount of time. Is there anyway…
rfc1484
  • 9,441
  • 16
  • 72
  • 123
0
votes
1 answer

How to download image and save image name based on URL?

How do I download all images from a web page and prefix the image names with the web page's URL (all symbols replaced with underscores)? For example, if I were to download all images from http://www.amazon.com/gp/product/B0029KH944/, then the main…
thdoan
  • 18,421
  • 1
  • 62
  • 57
0
votes
1 answer

Html Tag counting - Rate of Change formula

I've been trying to a find a statistics-esque formula for calculating the rate of change for html tags which are either added or removed from various websites. So, for example, with the scraper I'm writing, I obtain the initial tag count and then…
zeboidlund
  • 9,731
  • 31
  • 118
  • 180
0
votes
2 answers

A good methodology for obtaining the number of html tags for a page

I'm looking for a good way to do this: my current method seems to not allow depths of searches beyond 30-40, even after editing the php.ini settings in hopes to increase default execution time as well as max memory usage. Basically, as soon as the…
zeboidlund
  • 9,731
  • 31
  • 118
  • 180
0
votes
2 answers

PHP scrape remote images that do not have extensions

I've developed an image scraper that will scrape specific images from remote sites and display them upon pasting into a text field. The logic includes finding images that end in .jpg .jpeg . png etc. I'm running into an issue where alot of sites…
Chris Favaloro
  • 79
  • 1
  • 12
0
votes
1 answer

Facebook Open Graph Scraping URL

I'm trying to develop 'want' and 'own' buttons. If I use the Facebook debug tool it tells me the final URL is the home page and this has happened because the page has been redirected, which I don't want. I want the fetched URL to be scraped. As a…
Matt
  • 25
  • 1
  • 5
0
votes
2 answers

Scripted Browser Scapper

What can I use to achieve the following, script a browser or otherwise make a request to the server, login, browse the site, eg. find links and navigate to those links. For now, since I am into NodeJS, I was looking at node.io. It allows you to…
Jiew Meng
  • 84,767
  • 185
  • 495
  • 805
0
votes
2 answers

Scraperwiki scrape query: using lxml to extract links

I suspect this is a trivial query but hope someone can help me with a query I've got using lxml in a scraper I'm trying to build. https://scraperwiki.com/scrapers/thisisscraper/ I'm working line-by-line through the tutorial 3 and have got so far…
elksie5000
  • 7,084
  • 12
  • 57
  • 87