0

i have simple app the gets all the links from web page , im using libexml2 to parse the html and extract the html links that are inside the and Qt QNetworkAccessManager for the http requests . now the problem is how to detecte automatcly the host name of the links if i have for example :

<a href="thelink.html" >
or 
<a href="../../../thelink.html" >  
or
<a href="../foo/boo/thelink.html" > 
i need to convert it to full host path like :
( just example .. ) 
<a href="http://www.myhost.com/thelink.html" >
or 
<a href="http://www.myhost.com/foo/boo/thelink.html" >  
or
<a href="http://www.myhost.com/m/thelink.html" > 

is there any way to do it programmatically ? without manually doing string manipulation

if you know perl its called : Return a relative URL if possible from the : http://search.cpan.org/~rse/lcwa-1.0.0/lib/lwp/lib/URI/URL.pm

$url->rel([$base])

code example that dosnt work ( Qt ) http://qt.digia.com/support/

QString s("/About-us/");
QString base("http://qt.digia.com");
QString urlForReq;

     if(!s.startsWith("http:"))
     {       
         QString uu = QUrl(s).toString();
         QString   rurl = baseUrl.resolved(QUrl(s)).toString();
         urlForReq = rurl;
     }

the urlForReq value is "/About-us/"

user63898
  • 29,839
  • 85
  • 272
  • 514
  • [The algorithm to resolve URLs to an absolute URL](http://www.whatwg.org/specs/web-apps/current-work/multipage/urls.html#resolving-urls) is defined by the HTML standard. – Joseph Mansfield Oct 29 '12 at 12:49

2 Answers2

2

I have not verified if the algorithm mentioned by @sftrabbit is completely followed by this approach, but you can use QUrl::resolved to convert your relative URLs to absolute URLs:

QUrl base("http://www.myhost.com/m/");
qDebug() << base.resolved(QUrl("thelink.html")).toString();
qDebug() << base.resolved(QUrl("../../../thelink.html")).toString();
qDebug() << base.resolved(QUrl("../foo/boo/thelink.html")).toString();

prints

"http://www.myhost.com/m/thelink.html"
"http://www.myhost.com/thelink.html"
"http://www.myhost.com/foo/boo/thelink.html"

I can not reproduce the code example from the question which does not work for the OP. The only issue is that the baseUrl object is missing in the code. The following SSCCE

#include <QApplication>
#include <QUrl>
#include <QDebug>

int main(int argc, char ** argv) {

    QApplication app( argc, argv );

    QString s("/About-us/");
    QString base("http://qt.digia.com");
    QString urlForReq;
    QUrl baseUrl(base);          // this was missing in the code from the question
    if(!s.startsWith("http:")) {       
        QString uu = QUrl(s).toString();
        QString rurl = baseUrl.resolved(QUrl(s)).toString();
        urlForReq = rurl;
    }
    qDebug() << "urlForReq:" << urlForReq;

    return 0;
}

prints

urlForReq: "http://qt.digia.com/About-us/"
Andreas Fester
  • 36,091
  • 7
  • 95
  • 123
  • Do you have some specific case which is not working, and what is actually not working? :) – Andreas Fester Oct 29 '12 at 14:06
  • well i can't give the real site and info ... how can i test on other public site? ok i just tryed with the http://qt.digia.com/support/ as in the example – user63898 Oct 29 '12 at 14:17
  • @user63898: You don't need an actual site, just a syntactically valid URL . Using `ftp://files.example.com/this/doesnt/really/exist` is fine. – MSalters Oct 29 '12 at 16:24
  • i dont have it , i have the site domain , can i do this with other c++ lib ? maybe uriparser? – user63898 Oct 29 '12 at 17:41
  • I can not reproduce the issue you added to the question, see my edited answer. I get the expected output `http://qt.digia.com/About-us/`. Anyway, if you are looking for something other than Qt, this posting might help: [Absolute URL from relative path](http://stackoverflow.com/questions/8749814/absolute-url-from-relative-path) – Andreas Fester Oct 30 '12 at 06:53
1

You should have the path to the webpage that you downloaded, e.g. http://www.myhost.com/examples/useless/test.html".

Take the directory prefix prefix = "http://www.myhost.com/examples/useless/". Every href that does not start with / or http:// is a relative link, and you get the absolute link using prefix + link.

E.g. if link =../foo/boo/thelink.html, then result is http://www.myhost.com/examples/useless/../foo/boo/thelink.html, which a browser will then translate to http://www.myhost.com/examples/useless/boo/thelink.html.

Zane
  • 926
  • 8
  • 21