-4

Is there a library that would support synchronous JavaScript functions like the following?

function getPageHTML(url){
     // scrape HTML from external web page
     return html;
}

function getPageJS(url){
     // scrape final JavaScript variable results from external web page
     return js;
}

I like the concept behind pjscrape, but don't want to use command-line script. I don't mind using PHP, but I want my function to be synchronous.

zdebruine
  • 3,687
  • 6
  • 31
  • 50
  • 1
    yes. it's called ajax. – dandavis Jul 23 '16 at 17:00
  • 1
    Is this supposed to run on a browser? Or on a node.js server? FYI, synchronous networking in Javascript is either not supported at all (depends upon environment) or a really bad idea. – jfriend00 Jul 23 '16 at 17:03
  • @jfriend00 Well I'm looking for an equivalent to jQuery $.get for PHP scraping. I just want the result returned in my function. – zdebruine Jul 23 '16 at 17:10
  • 1
    So, this is a PHP question? If so, why does your question show Javascript? It is not clear at all what you are asking. – jfriend00 Jul 23 '16 at 17:14
  • @jfriend00 I'm asking for a javascript function which returns the javascript variables and HTML content at the url which is passed into the function – zdebruine Jul 23 '16 at 17:19
  • And, I repeat my original question. Does this Javascript function need to run in the browser or in node.js? Those are different environments with different networking tools. – jfriend00 Jul 23 '16 at 17:25
  • @jfriend00 browser – zdebruine Jul 23 '16 at 17:26

1 Answers1

1

There is no Javascript environment where it is recommended to use synchronous networking to retrieve data from some external server. This is just not how Javascript is designed. Javascript is designed to use asynchronous I/O where the result will be returned via a promise or a callback and cannot be returned directly from your function call.

The "A" in "Ajax" stands for asynchronous. That is a cornerstone of making networked requests from Javascript in the browser. The browser can technically do a synchronous Ajax call, but that is not recommended for a variety of reasons (like it hangs the UI in the browser during the call) and it is being deprecated in many circumstances too because it's almost never a good idea to use synchronous ajax. In addition Ajax calls from the browser are limited to either the same origin that your web page came from or to servers that explicitly allow cross origin requests. So, you can't expect to make an Ajax call to fetch any arbitrary page on the internet. You won't be able to fetch most other pages from a browser web page Ajax call.

What the browser is good at is asynchronous networking where the result is returned asynchronous via a callback or promise sometime in the future and the rest of your Javascript continues to run until then. This is how you should code your access to network requests.

If you want to get scraped results in a browser from some external site, the preferred architecture for that would be to set up a server that will do the work for you. Your Javascript in your web page will make an Ajax call to your own server asking it to scrape a specific web site. The server (which has no cross origin limitations on what hosts it can make requests from) will then fetch the content, scrape it into the desired results and then return the resulting scraped data to your Ajax call.


So, you could design a promise based interface in your client that could work asynchronously like this:

getPageJS(someUrl).then(function(data) {
    // process data here
}).catch(function(err) {
    // process error here
});
jfriend00
  • 683,504
  • 96
  • 985
  • 979