Python Mechanize - Open URL with hash mark

Question

I'm using Python Mechanize to open an URL with this format...

https://www.simplewebsite.com?view=discussions#/?page=2

When the page opens...it gets this...

https://www.simplewebsite.com?view=discussions

Completely bypassing what's after the "#" mark...

Any clue how to open the URL? I have spend a lot of time searching the web...without a positive answer...

Koterpillar · Answer 1 · 2013-06-26T23:30:28.460

1

Most likely the site is relying on its JavaScript to parse the rest of the URL (after #); see window.location.

Unless Mechanize can run JavaScript somehow, you won't get the results you want. Try Selenium, Phantom.JS/Phantompy or something like this.

The site might actually support passing the parameters directly, then you can request

https://www.simplewebsite.com?view=discussions&page=2

If not, you'll have to inspect the AJAX queries it makes to request the data you actually wanted.

edited Jun 26 '13 at 23:30

answered Jun 26 '13 at 22:55

Koterpillar

7,883
2
25
41

Thanks Koterpillar...I thought something like that...but was hoping for some hack or something :) – Blag Jun 26 '13 at 22:59
`&page=2` is a hack I'd try. Do you mind posting the actual site URL? – Koterpillar Jun 26 '13 at 23:00
I already tried with &page=2 but doesn't work...and for the actual URL...it's my company collaboration space and needs user and password to be accessed... – Blag Jun 26 '13 at 23:14
Then either use a JavaScript-capable thing or sniff AJAX calls. – Koterpillar Jun 26 '13 at 23:30

score 1 · Answer 2 · answered Jun 26 '13 at 23:01

The part of the URL that appears after the hashtag is a reference to an HTML anchor, these are handled by the client (typically a web browser), and are never sent to the server.

The website is likely loading Javascript code that runs on page load. That code parses the anchor name and updates the page base on that. In this case it is pretty clear that the javascript code will have to send an ajax request to the server to get page 2, then update the HTML document to show that data.

Unfortunately mechanize will not be able to handle this type of website because it depends on running Javascript code on the client. You can probably do something like this with phantom.js, a headless web browser client that can run client side scripts.

Thanks Miguel...I actually need to use Mechanize and Python...so while phantom.js looks cool...I don't think I can use it from my scenario... — Blag, Jun 26 '13 at 23:08
Then you need to make sure the target site does not run client side javascript, because your software cannot do that. This is the same problem that search engines have when trying to index Ajax sites, it is a tough problem. — Miguel Grinberg, Jun 26 '13 at 23:20

score -2 · Answer 3 · edited Oct 07 '21 at 06:31

Are you using the query string:

view=discussions%23%2F%3Fpage%3D2

?? For instance:

import mechanize as mech
from urllib import urlencode

host = "http://localhost:8080/1.php"

data = {"view": "discussions#/?page=2"}
data = urlencode(data)
print "encoded data sent by python:\n\t", data

resp = mech.urlopen(host + "?" + data)
print resp.read()

It certainly 'works'. Whether the other side knows how to properly decode and parse the query string is another matter. For instance, if you request the following php program at http://localhost:8080/1.php:

<?php

parse_str(
    urldecode($_SERVER['QUERY_STRING']),
    $data
);

//You might also call htmlentities() on the query string
//if a browser was going to display the result


echo "php received the following data:\n";

foreach($data as $key => $val)
{
    echo "\t $key ----> $val \n";
}

?>

...the python program outputs:

encoded data sent by python:
    view=discussions%23%2F%3Fpage%3D2

php received the following data:
     view ----> discussions#/?page=2

As for this:

When the page opens...it gets this...

    https://www.simplewebsite.com?view=discussions

Completely bypassing what's after the "#" mark...

an RFC says:

The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI. https://www.rfc-editor.org/rfc/rfc3986#section-3.4

7stud...thanks...but doesn't work...even when replacing the "#" with it's encoded code doesn't help... — Blag, Jun 26 '13 at 23:00
You should try to urlencode the whole query string as shown in the example I posted. — 7stud, Jun 27 '13 at 01:38

Python Mechanize - Open URL with hash mark

3 Answers3