Best way to crawl a page with multiple redirects

Question

I want to crawl the NCBI website and send request for protein local alignment available at this link: http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch

I would like to know if I am able to submit a post request to this address and get the results which come in a new page, using PHP. There is also a issue there, before the final results are shown, the page undergoes multiple redirects - you can test this situation using the following input which goes into the text area:

MHSSIVLATVLFVAIASASKTRELCMKSLEHAKVGTSKEAKQDGIDLYKHMFEHYPAMKKYFKHRENYTP
ADVQKDPFFIKQGQNILLACHVLCATYDDRETFDAYVGELMARHERDHVKVPNDVWNHFWEHFIEFLGSK
TTLDEPTKHAWQEIGKEFSHEISHHGRHSVRDHCMNSLEYIAIGDKEHQKQNGIDLYKHMFEHYPHMRKA
FKGRENFTKEDVQKDAFFVNKDTRFCWPFVCCDSSYDDEPTFDYFVDALMDRHIKDDIHLPQEQWHEFWK
LFAEYLNEKSHQHLTEAEKHAWSTIGEDFAHEADKHAKAEKDHHEGEHKEEHH

Here is my attempt:

$link = 'http://blast.ncbi.nlm.nih.gov/Blast.cgi?
PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch';

$request = array(
    'http' => array(
        'method' => 'POST',
        'content' => http_build_query(array(
            'QUERY' => $aaText
            )
        ),
    )
);

$context = stream_context_create($request);
$html = file_get_html($link, false, $context);
echo $html;

This code gets me the initial page, as if no POST has been done. Thanks

UPDATE

I have tried one of the suggestions below - Goutte.

Here is my new code:

require_once 'goutte.phar';

use Goutte\Client;

$client = new Client();

$crawler = $client->request('GET', $link);

$form = $crawler->selectButton('b1')->form();

$crawler = $client->submit($form, array('QUERY' => $aaTest));

echo $crawler->html();

Variable $aaTest is the protein sequence I gave above. The good part is: it posts, gets me the new page, but does not follow all the redirects. How can I make it follow all the redirects?

Is your crawling done with "Blast" or your "legacy program"? If that's the problem, can we see the code (in either of the above categories) you are having problems with? — halfer, Apr 25 '14 at 23:47
I guess it depends on the kind of redirects (server-side, html / javascript), but you could try cURL. — jeroen, Apr 25 '14 at 23:52
I suspect the question will close, possibly because Blast is not known here, and thus the question (in its first version) sounded like a request for third-party support. Moreover, there was no code in the question, so it was not possible to answer. You now have code, so if the question closes, it can be re-opened. — halfer, Apr 25 '14 at 23:52
Voting now to re-open. In the meantime, you may want to use an HTTP library that supports redirects. Have a look at Goutte, which uses Guzzle internally - it's very flexible and easy to use. — halfer, Apr 25 '14 at 23:56
I re-formulated the question . I am sorry for the lack of clarity — Madrugada, Apr 25 '14 at 23:56

halfer · Accepted Answer · 2014-04-26T13:10:28.960

I should think this site is very crawlable. To understand what is going on, turn off JavaScript in your browser and try to browse the site (to do this, I use the Disable->Disable JavaScript menu in Firebug, which is a Firefox plugin).

If you go to your first link, and paste in your string, you get a form in a POST operation that effectively says your search is in progress. It will look something like this:

Job Title: Protein Sequence (333 letters)

Request ID: NR8ZP8E1071

Since there is not much of interest on this screen, I am assuming that you do not want to scrape from here - but that is effectively what you are currently doing.

What happens next is that a piece of JavaScript submits a hidden form, using this code:

<SCRIPT LANGUAGE="JavaScript">
setTimeout('document.forms[0].submit();',1000);
</SCRIPT>

My guess is that at times of heavy load, the delay here (presently set to 1000ms i.e. 1 second) would increase a bit. The hidden form looks like this:

<form action="Blast.cgi" enctype="application/x-www-form-urlencoded" method="post" name="RequestFormat" id="RequestFormat&quot;">               
<input name="CMD" value="Get" type="hidden">
<input name="ALIGNMENTS" value="100" type="hidden">
<input name="ALIGNMENT_VIEW" value="Pairwise" type="hidden">
<input name="BLAST_PROGRAMS" value="blastp" type="hidden">
<input name="CDD_RID" value="data_cache_seq:180192" type="hidden">
<input name="CDD_SEARCH" value="on" type="hidden">
<input name="CDD_SEARCH_STATE" value="4" type="hidden">
<input name="CLIENT" value="web" type="hidden">
<input name="COMPOSITION_BASED_STATISTICS" value="2" type="hidden">
<input name="CONFIG_DESCR" value="2,3,4,5,6,7,8" type="hidden">
<input name="DATABASE" value="nr" type="hidden">
<input name="DESCRIPTIONS" value="100" type="hidden">
<input name="EQ_OP" value="AND" type="hidden">
<input name="EXPECT" value="10" type="hidden">
<input name="FILTER" value="F" type="hidden">
<input name="FORMAT_NUM_ORG" value="1" type="hidden">
<input name="FORMAT_OBJECT" value="Alignment" type="hidden">
<input name="FORMAT_TYPE" value="HTML" type="hidden">
<input name="FULL_DBNAME" value="nr" type="hidden">
<input name="GAPCOSTS" value="11 1" type="hidden">
<input name="GET_SEQUENCE" value="on" type="hidden">
<input name="HSP_RANGE_MAX" value="0" type="hidden">
<input name="JOB_TITLE" value="Protein Sequence (333 letters)" type="hidden">
<input name="LAYOUT" value="OneWindow" type="hidden">
<input name="LINE_LENGTH" value="60" type="hidden">
<input name="MASK_CHAR" value="2" type="hidden">
<input name="MASK_COLOR" value="1" type="hidden">
<input name="MATRIX_NAME" value="BLOSUM62" type="hidden">
<input name="MAX_NUM_SEQ" value="100" type="hidden">
<input name="MYNCBI_USER" value="9311188414" type="hidden">
<input name="NEW_VIEW" value="on" type="hidden">
<input name="NUM_DIFFS" value="0" type="hidden">
<input name="NUM_OPTS_DIFFS" value="0" type="hidden">
<input name="NUM_ORG" value="1" type="hidden">
<input name="NUM_OVERVIEW" value="100" type="hidden">
<input name="OLD_BLAST" value="false" type="hidden">
<input name="OLD_VIEW" value="false" type="hidden">
<input name="PAGE" value="Proteins" type="hidden">
<input name="PAGE_TYPE" value="BlastSearch" type="hidden">
<input name="PROGRAM" value="blastp" type="hidden">
<input name="QUERY_INDEX" value="0" type="hidden">
<input name="QUERY_INFO" value="Protein Sequence (333 letters)" type="hidden">
<input name="QUERY_LENGTH" value="333" type="hidden">
<input name="REPEATS" value="5755" type="hidden">
<input name="RID" value="NR8ZP8E1071" type="hidden">
<input name="RTOE" value="21" type="hidden">
<input name="SELECTED_PROG_TYPE" value="blastp" type="hidden">
<input name="SERVICE" value="plain" type="hidden">
<input name="SHORT_QUERY_ADJUST" value="on" type="hidden">
<input name="SHOW_LINKOUT" value="on" type="hidden">
<input name="SHOW_OVERVIEW" value="on" type="hidden">
<input name="USER_DEFAULT_MATRIX" value="4" type="hidden">
<input name="USER_DEFAULT_PROG_TYPE" value="blastp" type="hidden">
<input name="USER_TYPE" value="2" type="hidden">
<input name="WORD_SIZE" value="3" type="hidden">
<input name="db" value="protein" type="hidden">
<input name="stype" value="protein" type="hidden">
<input name="x" value="41" type="hidden">
<input name="y" value="12" type="hidden">
</form>

This also creates a POST request to the program, and of most interest is the field RID which links the request with your initial query parameters. This is probably stored in a database or temporary file, and is assigned an ID, which expires in a matter of hours.

When this form is submitted, lots of interesting information is provided, rendered inside the POST request of the form that created it. It is possible that one of the above fields specifies the initial number of alignments to show. If you then turn JavaScript back on, you'll find that pointing at the end of the page (which itself is several screenfuls already) will load another chunk using this program:

http://blast.ncbi.nlm.nih.gov/t2g.cgi?CMD=Get&RID=NR8ZP8E1071&OLD_BLAST=false&DESCRIPTIONS=0&NUM_OVERVIEW=0&GET_SEQUENCE=on&DYNAMIC_FORMAT=on&ALIGN_SEQ_LIST=gi|160797,gi|9816,gi|121273,gi|428230092,gi|417051&HSP_SORT=0&SEQ_LIST_START=1&QUERY_INDEX=0&SHOW_LINKOUT=on&ALIGNMENT_VIEW=Pairwise&MASK_CHAR=2&MASK_COLOR=1&LINE_LENGTH=60

Interestingly, a GET request is used here. Using the network monitor in Firefox, I triggered a series of these to see if I could spot a sequence of incrementing numbers. I spotted that SEQ_LIST_START starts at 1 and increments in blocks of 5, but I am not sure where the elements in ALIGN_SEQ_LIST comes from - maybe from the current page. It's worth you having a look yourself to see if you can spot anything - especially since you will understand the subject matter in a way that I do not.

You may be able to tinker around with some of the query string parameters in this link to see what controls the number of items returned. However, be careful: if you request a much larger set that their systems are used to, you may be noticed and have a block placed on your IP address.

Further to that, remember that if you crawl a website, you are passing your costs onto a third party. Since the data appears to be available for free, this will be acceptable to them to some degree, and is the benefit of the funding they have already spent. However, be mindful of the load you are placing on their server: don't request chunks that are excessively large, and put a few seconds delay between each request.

If you plan to grab an enormous chunk of data (say more than half a gigabyte), then alternate between a few seconds and a couple of minutes waiting, or perhaps concentrate your downloading during the night (their time) when their servers might be less busy. Failure to "act responsibly" as a crawler may place your IP range on their blocklists, and in the worst cases could constitute a denial of service attack.

So, to summarise, here's what you need to do:

Make the initial POST request that retrieves the form
Wait a few seconds
Grab the response (in particular with the request ID) and resubmit that data in a new POST
Harvest the data from the screen
Use GET requests in the second program to get new data
Harvest the new data from response

Be willing to tinker with your POST and GET parameters to see the effect, and have fun!

it must have taken you a good while to put together such a complete answer, kudos! (and +1 from me) — Purefan, Apr 26 '14 at 12:59
Thanks Purefan. Yeah, but I don't mind having a tinker - it's a nice way to spend 20 minutes on a Saturday afternoon! — halfer, Apr 26 '14 at 13:00
Thanks a lot halfer, this is really helpful.. So if I understood well, should I also disable Javascript from my crawler too ? or just ignore it — Madrugada, Apr 26 '14 at 17:43
In general, crawlers don't run JavaScript - only headless browsers do that (e.g. PhantomJS). Turning off JS was just a way of checking to see whether the site is operational without client-side script, and the answer is yes, it can. — halfer, Apr 26 '14 at 17:45
Thanks. I am going to go step by step with my crawler and check all that. However I figure out with Goutte I cannot select a form by its name, only by a button it contains, and in 2nd redirect page I have a form without a button . urgh.. this should be simple operation for Goutte.. — Madrugada, Apr 26 '14 at 18:30
No, you just need to persist a bit more. Try something like this (untested): `$form = $crawler->filter('form[name=X]');`. It _really_ is very flexible - its only minor fault is that the docs are a bit sparse. It uses Guzzle internally, which has all manner of events you can attach to. — halfer, Apr 26 '14 at 18:40
Thanks, that worked. :-) I will accept the answer soon as it was really helpful. — Madrugada, Apr 26 '14 at 18:58

Best way to crawl a page with multiple redirects

1 Answers1