4

I have been asked to determine the feasibility of mirroring or caching websites on a mobile server on a train. Unfortunately, this has been dropped on me at the last minute and I have to come up with an answer in a day or two, and I don't have much experience in this area.

The train is:

  1. a long distance passenger carrier
  2. not always connected to the Internet (it comes in and out of range of 3G phone towers)
  3. will be using a 3G modem to connect to the Internet when it is in range

When out of service range, the guests should be able to continue to access websites that have previously been accessed. These will be general websites that guests might access in their off-train lives that we won't have control over. There will also be websites that are cached or "pulled in" automatically such as news and current affairs, that won't necessarily be guest initiated.

I know that we can mirror or cache visited pages, but I'm concerned about the 'environment' that we'll be operating in.

  1. Most of the mirror sites I see are permanent connections to the Internet, with updates propagating through from the main site using wget or similar. How would an intermittent connection affect this process?
  2. This should be seamless with the visitor typing in the normal URL of the site. If 3G isn't available or hasn't expired, the cached page should be shown; otherwise it should be loaded from the original website (and cached for later.) Is it feasible to mirror the URL as well or do we need our own domain name?
  3. I'll need to let guest know when they're out of range. I figure a custom version of the error page presented when browsers are not able to access a page, but served from the server, would be the way to go here. Reasonable?
  4. I'm thinking that we'll also need to do something special for managing content that is served through CDNs. (I suspect we'll need a decent amount of storage on this server.) Am I correct?
  5. I'm not sure of the terminology for this (which hinders searching) can anyone point me to the correct terms for what I want to do?

Any other resources you can point me to would be appreciated.

Thanks.

John Judd
  • 173
  • 6
  • this reads similiar to what the german train company "Deutsche Bahn" does (or at least did previously) you have wlan inside the train, the train connects via Mobile towers and via additional methods on the stations. for me it was utterly crap and i did not use it much when the train was moving. – Dennis Nolte Jul 28 '14 at 15:32
  • Thanks for the insight Dennis. Can you tell me why you thought the service was utterly crap? – John Judd Jul 29 '14 at 06:04
  • basically the issue was (for me as a user) following: you could load static cached content, but for everything else, like social media, or https (as written in the answers below) this did not work correctly. Basically you had internet only at the stations directly, for the drive from station A to station B you could only "request" to get stuff at the next station. It did not feel like internet, rather like someone from the library is handing you a page of a newspaper and you need to wait till the librarian gets to you again before you would be able to read the next page. – Dennis Nolte Jul 29 '14 at 07:13

2 Answers2

1

I think the feasibility is pretty low, you have to consider the following issues which combined I think rule out the possibility of a "usable" or transparent offline mirror.

  • HTTPs traffic is increasingly common these days and you won't be able to cache this without installing a CA certificate on their device which users should be very hesitant to do.

  • Many websites rely heavily on client side HTTP requests (ie. AJAX) to function, and in most cases sites go out of their way to avoid AJAX requests being cached by appending a timestamp to the URL so that every request is treated as a unique URL.

  • You can basically rule out any stateful site (ie. on that requires a login) - obviously you can't cache person X's facebook profile unless they've already viewed it and even if they have viewed it the value of these sites is severely diminished without real time updates. Plus this means your cache lookup will have to depend on the value of a cookie, therefore decreasing the chance that you'll get a hit on a page requested earlier by someone else.

  • How do most people get to sites? People rarely type URL's, typically they search for things - even things which they know the URL of such as facebook. It would be a challenge to try and cache complex search engine results because they're likely to be stateful (eg. if you search google while logged into a google account your results will be different)

  • When browsing the web what is the percentage of new content vs. content you've seen before? Even when browsing a site you visit a lot like facebook you'll frequently click onto new pages, etc.

  • Some sites now use WebSockets, not sure about the exact details but I imagine it'd be difficult to emulate / replay the WebSockets interaction.

If you have some reason to believe that your users will be visiting the same set of pages (eg. a set of documentation) a large percentage of the time and this content is not stateful then it might be feasible.

thexacre
  • 1,849
  • 13
  • 14
  • Thank you. I was wondering about dynamic and stateful websites, but hadn't considered https. That alone could make this difficult. And the searching too. Hmmm. – John Judd Jul 26 '14 at 04:13
  • WebSockets just plain don't work through a transparent proxy and there's nothing you can do about it, except ask the user to explicitly configure the proxy server in their browser. This is probably unreasonable for the proposed service. The latest browsers have workarounds for this, though, so it's not as bad as it was. – Michael Hampton Jul 26 '14 at 04:21
1

I actually set up something very similar to this once a year for a week-long event that's held in the middle of nowhere, so I have a little experience to share.

First, the TL;DR: You can do it, but it won't work nearly as well as you (or your higher-ups) might hope. It might not be worth bothering, especially if the interruptions are brief. But you might want to do it anyway, in order to save bandwidth and provide a faster experience when you are connected to 3G.


The component you're looking for is a transparent proxy, one which intercepts outgoing HTTP requests, which weren't intended by the client to be proxied, and diverts them to a proxy server. And squid is the most common software used for transparent proxying. This is what I use.

The way this works is: A switch or router will intercept packets intended for port 80 of a remote address, and mangle them so that they end up connecting to the proxy instead. It then checks its cache and if the cache misses it goes to the network. Typical proxy stuff. I do this diversion with some simple Linux iptables rules, though many routers and switches can also be configured to do it.

For your purposes, you will also need to do some significant tweaking to squid's configuration, to override its cache handling. In particular you will want to cause it to serve a stale cached item when it fails to revalidate it on the network. I don't have the configuration for this offhand, since it isn't necessary in my design, where I'm at a fixed point and have continuous wireless service. But some careful documentation reading ought to suggest a way to do it.

You will also want to create some custom Squid error pages which refer to your company and explain the various out of service conditions to be expected.

And now for the down side.

You won't be able to do this with HTTPS requests at all. While Squid does support a method of intercepting HTTPS requests similarly to HTTP requests, you won't be able to use it as it would require creating a CA and installing a certificate in every client's browser. Easy enough for an enterprise, but not something you can do for a public service. And even if you could, it is not at all user friendly, will set off alarms in any privacy-minded person's mind, and it is illegal to do so in some countries.

In addition, WebSockets, used by many web sites these days, will almost always fail when a transparent proxy is involved, because the proxy -- doing what it is supposed to do -- mangles the upgrade request beyond recognition. There is little you can do about this, except advise users to explicitly use the proxy server. In this case the browser knows to format the request differently, using HTTP CONNECT, so that it will pass through the proxy unmolested.

Finally, after having spoken to some people familiar with traveling on Australia's trains, I learned that these outages can sometimes last 10 to 15 minutes. There's very little that you can do about this; someone browsing the web during that time is quite likely to go try to click on a link to a site you haven't yet cached, and you are not much better off than you are now, though if you have the cache in place you can at least advise the passenger of the situation (at least on HTTP). While the Internet is out, passengers might be better served by looking out the windows and trying to spot the Nullarbor Nymph.


And some basic stats. Last year the service used 42 GB of data and served an additional 17GB from cache. This year the service used 87 GB of data and served just 744 MB from cache. That's not a mistaken calculation, or as far as I can tell a configuration error. The majority of the difference between caching last year and this year seems to be that more major web sites are now forcing HTTPS. For instance, last year I was able to cache some YouTube videos. This year I could not, because they are now served over HTTPS.

With more and more web sites moving to HTTPS, this caching strategy becomes less and less viable every year, and running the cache at all seems to be more and more pointless.

My recommendation is that you not bother. But you could set one up and run a trial on one train, and then measure the results.

You might also experiment with instructing users to configure the proxy explicitly, so that you can handle HTTPS and WebSockets, though in my experience this is something that's difficult for users to get right. You might be able to implement WPAD to configure some users automatically, but be aware that Android and iOS devices have poor or no support for it.

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972