Realtime URI-translation of HTML content in C/C++

Question

For the development of a custom reverse proxy (written in C++) I want to do a realtime translation of URIs in HTML content. For example if I want to access a ressource on http://myserver/ using http://my-reverse-proxy/myserver, all absolute and toplevel links like http://myserver/somecontent1.ext or /somecontent2.ext need to be modified.

An HTML tag

<img src="/sample.png">

would therefore be translated to

<img src="/myserver/sample.png">

From my point of view there are to approaches:

1) Using regular expressions and string replacement to find all related HTML tags and their paths using capture groups and do some string replacement.

2) Parse entire HTML content, do some transformation on the parse tree and pretty-print the result back to a valid HTML ressource.

And this is what this question is all about: Do you have any experiences what solution might be faster and maybe even more reasonable? Do you know a framework I might use to not reinvent the wheel? As this process should be used later for CSS and XML-based ressources as well, it should not be a HTML-depend solution.

Thanks in advance!

If you plan to use it for CSS, your option 2 is not possible... — FredericS, Apr 03 '13 at 09:36
@FredericS I could parse and tokenize CSS as well, why should't this work? Using something like [SDF](http://www.program-transformation.org/Sdf/SdfLanguage) I could even realize parsing of inline-CSS in HTML content. — muffel, Apr 03 '13 at 09:40
sure you could parse both CSS and XML, but the languages are not similar at all. You will have a CSS-dependant parser, a HTML/XML-dependant parser and minimal code re-use (the common transformation part will much likely to be your option 1 but on specific nodes of your parse-trees) — FredericS, Apr 03 '13 at 09:43

score 0 · Answer 1 · answered Apr 03 '13 at 09:54

0

Proxy servers generally work by being servers. They handle all HTTP requests, modify the requested URLs, and then pass the modified request on to the server on the other side.

You should stick to this paradigm. It is far easier and more efficient than mucking around with the files themselves. Anything that is being done real-time can be done at the point of the request.

Also, it should probably be asked: why a custom reverse proxy? Such things exist already.

answered Apr 03 '13 at 09:54

The usecase for the server I am developing is to provide external access to internal ressources. As the internal ressources cannot be modified (just like the web interface of a set top box or some home automation stuff), and some external devices like smartphones cannot connect to partial ressource using (HTTP-) proxy servers on their own, I really need this way: A web browser connects to the remote proxy system. The target web is identified using the URI path (like `/myserver`), and the reverse proxy is the 'mediator' that interfers between the browser and the target web server. – muffel Apr 03 '13 at 10:00
@muffel, why can your server not simply redirect all requests to the proxy? i.e. redirect any request on `http://myserver/` to `http://my-reverse-proxy/myserver`? – Apr 03 '13 at 10:06
Because the server would then not be able to intercept any request after the first one. Example: I want to access 'myserver' using some webbrowser that has no settings for proxy servers. All I can do is open a URL. So I access `http://reverse-proxy/myserver` which is translated to `http://myserver`. This result contains an image `` which the browser would load as `http://reverse-proxy/a.png` instead of `http://reverse-proxy/myserver/a.png`. Many mobile browsers cannot use proxy servers on 3G internet connections but I want to adress them as well. – muffel Apr 03 '13 at 10:13
@muffel, just make the browser access `http://myserver`, and have the server transmit that request to the internal server for myserver. Then you don't have to do any translation. This is what a standard reverse proxy does, and it won't be visible to any devices. – Apr 03 '13 at 10:19
I still don't get you. The browser is used externally, so it cannot reach `http://myserver` directly. The hostname is not full-qualified, so I cannot provide a related DNS-mapping. The only endpoint the client can reach (and that's intended) is the reverse proxy. So the connection chain would be Browser -> Proxy -> Service. Therefore it's up to the proxy to modify the content in a way all further requests will follow through it as well. Because the proxy should be used for several services at once the additional path /myservice us used to determine the target web. – muffel Apr 03 '13 at 10:26
@muffel, sorry, I understand what you mean now. – Apr 03 '13 at 10:31

Realtime URI-translation of HTML content in C/C++

1 Answers1