14

I want to log access to any files in the /files folder, so I can process it with PHP to generate some statistics.

I don't want to write a custom PHP handler called via RewriteRule because I don't want to have to deal with status codes, MIME-types and caching headers, and file locking issues.

I don't have access to the server configuration, so I can't use CustomLog (I do have access to .htacess).

I can't use X-Sendfile because it's not enabled.

I don't have access to the access.log.


Looking for an authorative answer.

Halcyon
  • 57,230
  • 10
  • 89
  • 128
  • 2
    Do you have access to the `access_log` for parsing? – jprofitt Jan 31 '12 at 16:54
  • 1
    Are the files in /files downloadable, or are they just files that are served to the user, like images, stylesheets etc? – Marcus Olsson Jan 31 '12 at 17:33
  • 5
    So many limitations ... dont seen to worth it. If you want logging you need control. – Guilherme Viebig Jan 31 '12 at 17:42
  • 5
    Seriously? You want to log, but you don't want to do anything in order to log, and you don't have access to the *actual* logs. Well... just make index.php?file=12345.txt one level up from /files and log somewhere that someone requested files/12345.txt then do a header("Location: files/12345.txt") redirect. Of course, anyone who wants can bypass your tracking by just going to files/12345.txt, but... oh well. – gabe. Feb 01 '12 at 23:01
  • @gabe I want to avoid a costly performance or maintenance overhead, is that so weird? It's just that the environment I'm in (which is the environment 99% of us are in) is just really restrictive when it comes to this. I've thought about the header `Location` trick but it just feels 'hacky' and might lead to unforeseen issues. – Halcyon Feb 04 '12 at 00:26
  • 1
    If it's apache mod_php site then try virtual() function. It works similar to X-Sendfile header. See [example](http://www.php.net/manual/en/function.virtual.php#88722). – kupson Feb 09 '12 at 18:17
  • @kupson interesting! I'll look into that. Feel free to post it as an answer because you might get the bounty :P – Halcyon Feb 09 '12 at 19:13
  • Why this question full of artificial limitations is such upvoted? It is apparently not a real life question. – Your Common Sense Feb 10 '12 at 09:30
  • @Col.Shrapnel why do you say that? I'm posting this question exactly because this is my real life situation - and I'm sure it's the same for everyone else who doesn't self-manage their server. – Halcyon Feb 10 '12 at 13:03
  • @kupson `virtual` seems to send the wrong headers. – Halcyon Feb 10 '12 at 13:30
  • @FritsvanCampen I see now, linked example handled this problem. Answer updated. – kupson Feb 10 '12 at 14:00
  • @FritsvanCampen I have also similar [Question](http://webmasters.stackexchange.com/questions/39632/how-to-find-data-usage-of-a-user-on-my-website) for full fill my requirements. Which solution works for you?. Plz help. – Dharmik Jan 04 '13 at 13:46
  • The accepted solution, which, for some reason is at the bottom of the page currently. – Halcyon Jan 04 '13 at 16:00
  • 1
    @Halcyon sorry for my tone in my last comment. I'm not usually such a jerk. – gabe. Mar 22 '16 at 03:08

10 Answers10

12

That's quite a few restrictions you've placed there.

You can do this with a custom handler installed via a PHP include at the top of each applicable (or, with __FILE__ parsing, not applicable) script. You must have a script which runs when each file is hit, and you've excluded alterations to the server config (including, I believe, .htaccess when you said RewriteRule wasn't good enough), so that means that you will be doing this through a script-based gatekeeper. You cannot have a solution which meets your constraints and has users go to files without hitting PHP (or another server-side dynamic language) first. Caching can be preserved by redirecting the user to the actual files instead of running static content through PHP.

You can store the log information in a database, or a file in a location writable by the server (watch out for contention if you use files - append mode is tricky).

EDIT: quickshiftin points out two ways you can get PHP invoked without having to add include calls by hand.

Community
  • 1
  • 1
Borealid
  • 95,191
  • 9
  • 106
  • 122
  • A `RewriteRule` is fine, but I can only see that result in a full on handlers that requires to me to mirrir MIME-type headers and caching headers (which I don't want to do). – Halcyon Feb 04 '12 at 14:09
  • I'm not necessarily looking for a PHP solution. The `CustomLog` isn't PHP based (but I can't use it). I'm trying to be as open minded as possible because I think my constraints are pretty severe as it is, I just find it strange there doesn't seem to be an easy solution. I don't like @quickshiftin's way of envoking a PHP script. – Halcyon Feb 04 '12 at 14:15
  • @Fritsvacampen You don't like my way of invoking a PHP script but you're fine routing all requests through a PHP file as yes123 suggested? How does that make sense? – quickshiftin Feb 04 '12 at 16:59
  • 1
    @quickshiftin - Routing a request through PHP is fine, because I have control over what PHP does. If I _execute_ any random binary file as PHP I have no idea what's going happen. I do like the `auto_append/prepend` feature - I've used it successfully (for autoloading of stuff) in the past - but it doesn't work on non-PHP files. – Halcyon Feb 06 '12 at 10:34
5

Create an auto_prepend_file and define a function to log w/e you want. You'll need access to .htaccess in order to set these (and the webhost will need something like AllowOverride all in the vhost) or with PHP 5.3 you can use the per-directory INI feature.

.htaccess

php_value auto_prepend_file /path/to/file.php

per-directory php.ini (PHP 5.3 CGI/Fast CGI SAPI)

user_ini.auto_prepend_file = /path/to/file.php

Then for your file /path/to/file.php (something more elegant I'm sure ;))

file_put_contents(
    LOG_FILE,
    implode(PHP_EOL . PHP_EOL, array(
                'SERVER: ' . PHP_EOL . print_r($_SERVER, true),
                'REQUEST: ' . PHP_EOL . print_r($_REQUEST, true)
            )),
    FILE_APPEND
);

The beauty of this approach is you'll likely be able to get away with it and you'll only need to define / include the logging code in one place.

EDIT:

Upon retrospection I see you want this to work for arbitrary types of files... Yes that would be rather rough. Best bet I can think of is labeling these files as .php or defining custom mime types in .htaccess. The idea would be to run the files through the PHP interpreter, thereby executing the auto_prepend_file and since there are no PHP tags in the file the content is sent directly to the client. Maybe even a tiny bit of PHP atop each file of content setting the ContentType header. I'm not even sure that would work but it might.

quickshiftin
  • 66,362
  • 10
  • 68
  • 89
  • 1
    he doens't have acess to .htaccess! – dynamic Feb 04 '12 at 01:07
  • I see now he says he doesn't have access to server configuration. Originally I took that as vhost level configuration, surprised .htaccess isn't even available. – quickshiftin Feb 04 '12 at 01:10
  • I do have access to `.htacess`, just not to `httpd.conf` and `php.ini`. To my knowledge `php_auto_prepend` works only on PHP executable files. I want to log 'regular' binary files. – Halcyon Feb 04 '12 at 14:12
  • 1
    I'm not really comfortable with any random file being passed through the PHP parser. Aside from security issues, it just seems wasteful. – Halcyon Feb 04 '12 at 14:18
  • 1
    @FritsvanCampen If these files are ones you've put on the server then they aren't random; and with the approach I've suggested, clients would never know there was PHP in them (assuming setting the ContentType works correctly). Also, what security issues would there be with PHP sitting on top of a data file that wouldn't be present in any normal PHP file? Lastly regarding wastefulness, be prepared to make some sacrifices in performance (and implementation elegance) given all the restrictions in this scenario. – quickshiftin Feb 04 '12 at 16:52
  • @yes123 Ha, I knew .htaccess was available! Even 1&1 gives access to .htaccess in their shared environments. I've heard of know vhost access (same scenario) but no .htaccess access would be criminal! – quickshiftin Feb 04 '12 at 16:53
  • I'm not the one putting the files on the server. I'm looking for a transparent solution. – Halcyon Feb 05 '12 at 21:49
  • Ok, well now we know a little more about where the files are coming from. How are they being uploaded, through a script of yours? Also, running clients through log.php etc isn't transparent IMO. If you do have control of the upload process, you could sprinkle a little PHP code on top of each one as they are uploaded. As I said before the clients wouldn't know the difference if the headers were set correctly by the script. Looks like **AddHandler** is available in .htaccess as well, so any filetype in that directory could be pre-processed by PHP, and new extensions added on-the-fly. – quickshiftin Feb 05 '12 at 23:45
  • I was afraid you were going to suggest this :/ No, this is not what I want. I want a transparent solution, no touching the files. @Borealid's solution is a step in the wrong direction. (I don't know who upvotes that stuff) – Halcyon Feb 06 '12 at 10:25
  • Well I was aiming for transparency to the clients of the files over the service provider. Given the restrictions I think it's a pretty good effort. I'm curious to see what other suggestions arise. – quickshiftin Feb 06 '12 at 12:26
3

That's pretty simple to do considering you don't need to restrict access.

build a page logger.php that takes in input the file requested like:

logger.php?file=abc.exe

In the logger.php you just have to log this access then rediret to the file:

file_put_contents('log', $_GET['file'] . ' requested',FILE_APPEND);
header('Location: files/'.$_GET['file']);

Just check the $_GET['file'] for malicious files

Of course you have to replace links in your site, from:

<a href="files/abc.exe">

to

<a href="logger.php?file=abc.exe">
dynamic
  • 46,985
  • 55
  • 154
  • 231
  • Then clients have to access files through the PHP script. Perhaps that's ok. It sounded like OP wanted users to be able to hit files directly and still be able to log, but there isn't really a specific point about that. – quickshiftin Feb 04 '12 at 01:13
  • This solves the problem of having to mirror MIME-Types and other headers, but I'm not sure what the effect is on caching. – Halcyon Feb 04 '12 at 14:08
  • I suppose I could test the effects of a `301 Location` header. – Halcyon Feb 04 '12 at 14:16
  • 2
    The effect? It's just a redirect oO – dynamic Feb 04 '12 at 15:52
  • A redirect requires another GET request. Does the client cache the content properly? Is it a problem that all the URLs don't point to the actual files? What about search engines, do they penalize Location headers? You seem to underestimate the impact of your proposed solution. – Halcyon Feb 05 '12 at 21:47
  • search engine doens't crawl .exe files or other files like that. And for what regards client cache, yes they cache normally. You seem to have the ideas totally confused about this topic – dynamic Feb 05 '12 at 21:49
  • And considering your limitations this is the only real way to log your access – dynamic Feb 05 '12 at 21:55
  • My target files are `.doc` and `.pdf` files (and `.odf` :p), I certainly hope search engines will index those. I know the limitations are severe, that's why I'm asking Stackoverflow. Out of all the solutions I do think this is the best one (so far) as it's the most simple and least intrusive. I was hoping you had some authoritative information to back up the validity of this approach. – Halcyon Feb 06 '12 at 10:29
  • There aren't any problem with 301 redirect regarding google (or other search engines), There are tons of website talkign about this: for example http://www.seomoz.org/learn-seo/redirection – dynamic Feb 06 '12 at 10:34
  • Which part is relevant to file tracking? The `301/2` header indicates that content has moved, which is the legitimate way to use it. In my case, content hasn't moved, I just need a hack to do my logging .. Also, there is no timestamp on the article you linked, I'm guessing it's at least 5 years old :( And 'a random ISP' is not an authority on correct use of HTTP status codes, sorry. – Halcyon Feb 06 '12 at 17:23
  • You haven't a clue on what you are talking about, sorry. – dynamic Feb 06 '12 at 19:06
  • @yes123, probly a better idea to cite valid argument to your proposal short of going for insults. I think Frits brings up a valid point on SEO which he just wants addressed. – Jakub Feb 07 '12 at 21:23
  • 1
    A simple google search would show you that 301 redirects are perfectly fine for every search engine. Even google suggest you to use 301 Redirects when possibile. – dynamic Feb 09 '12 at 11:37
3

It seems like the intention here is to circumvent all the systems that are inherently in place in Apache and PHP. If these restrictions are actually in place on your server instance you are far much better off asking for a change to your privileges than devising a workaround that your system admin may or may not be happy with you implementing.

PFY
  • 326
  • 1
  • 7
  • These are kind of the restrictions you work with if you get any hosting plan that isn't a self-managed server. So about 99% of us are dealing with these restrictions. I belive I have sufficiently explained why I can't use some solutions. – Halcyon Feb 08 '12 at 19:50
  • 1
    I understand, however these restrictions are in place in these environments for very specific reasons. Attempting to do something that it seems the server admin very explicitly does not want you to do strikes me as a bad idea. – PFY Feb 08 '12 at 20:26
  • Yes, but I don't think logging is something that a admin would not want me to do. It's just that `CustomLog` is a `server config` level setting (which I think is a mistake). I don't think Apache was written with _"security through feature starvation"_ in mind. Blocking access to the `server config` is a really convenient way to protect your server :P – Halcyon Feb 08 '12 at 20:35
  • Yea, which is why I recommend just contacting your administrator. I've had problems like this in the past and in most cases when you point out that the restriction is unreasonable you can have the belt loosened a little. Don't forget Occam's Razor :) – PFY Feb 08 '12 at 20:41
  • I would if I could, but this involves administrators I don't even know yet. I want this piece of code to run on as many environments as possible, so I have to be very defensive about my restrictions. – Halcyon Feb 08 '12 at 20:43
  • Do you have access to run something like p0f? – PFY Feb 08 '12 at 20:53
  • No, I can't install anything, that would be too easy :P – Halcyon Feb 08 '12 at 21:09
  • 1
    I think that given the logical separation between Apache and PHP and the restrictive situation you are under, your only logical choice is one that you have already ruled out; a custom handler script. Since the problem there is dealing with all the issues that can come up, have you looked into existing solutions such as http://www.zubrag.com/scripts/download.php – PFY Feb 08 '12 at 22:05
  • I have a script that does this, I have a whole framework that does this and more but I'm looking for a _better_ solution. – Halcyon Feb 08 '12 at 22:12
3

Might not be exactly what you want but why don't you use a different solution altogether?

You could use Google Analytics VirtualPageviews to track the file downloads via Javascript.

See here for more information: http://support.google.com/googleanalytics/bin/answer.py?hl=en&answer=55529

You could even create your own JS to track the file downloads via the browser without having to bother with GA.

Update:

As I said you could easily create your own JS to track them without having to bother with GA. Here is a silly example in jQuery that would work (haven't tested it - just wrote it of the top of my head):

Code sample:

JS Side:

$(document).ready(function() {
  $("a").click(function() {
    if( $(this).attr('href').match(/\/files\/(.*)/) ) {
      $.ajax({
        url: '/tracking/the/file/downloads.php'
        data: {
          'ok': 'let\'s',
          'add': 'some information',
          'about': 'the user that initiated',
          'the': 'request',
          'file': $(this).attr('href')
        }
      });
    }

    return true;
  });
});
mobius
  • 5,104
  • 2
  • 28
  • 41
  • I have considered this but I can't really use this as I want the data readily available. I know there are APIs for Google Analytics (and I'm using those for regular page views) but it's not suitable for what I'm trying to do here. – Halcyon Feb 10 '12 at 13:04
  • 1
    @FritsvanCampen Check my updated answer. There is no need to go with GA. You can do it manually and have your data instantly available. – mobius Feb 10 '12 at 13:42
  • I suppose that could work too. I do see some problems though: First of all you have some overhead on each request, regardless of caching. Secondly, you need to fire this handler again if you're dealing with ajax-ed content. You can't catch file uses if they're referred from other domains, or a page that doesn't include the tracking code. And lastly you have a dependency on JavaScript, - I don't really care about this - but it can be relevant. – Halcyon Feb 10 '12 at 13:48
  • @FritsvanCampen Well there is the extra request, true, but this can be heavily be optimized. You could even skip the AJAX request and make it work using the nginx 1x1 magic image (http://wiki.nginx.org/HttpEmptyGifModule) and get the log from there. (which would be super-fast) This would require a separate server, yes, but on the other side it will not count up to the maximum web requests made by the browser. Also with jquery the events are being chained. So if you were to have a click() listener on all tags it will fire no matter what other clicks you have bound on the link. – mobius Feb 10 '12 at 13:53
3

Works only in mod_php case. There is some performance hit -- apache_lookup_uri() does additional apache internal sub-request.

As others pointed you need .htaccess like

RewriteEngine On
RewriteRule ^/handler.php$ - [L]
RewriteRule ^/([a-zA-Z0-9\.]+)$ /handler.php?filename=$1 [L]

In handler.php file use virtual() function to perform apache subrequest. Example here: http://www.php.net/manual/en/function.virtual.php#88722

Updated and tested (but rather minimal) solution:

<?php
//add some request logging here
$file = $_GET["filename"];

$file_info = apache_lookup_uri($file);
header('content-type: ' . $file_info -> content_type);
// add other headers?
virtual($file);
exit(0);
?>
kupson
  • 6,738
  • 1
  • 18
  • 14
2

OK, here's an idea. Bear with me on this, it might at first seem unsuitable, but read the bit at the end. Hopefully it works with what you have in place. In the folder containing your files, you place a .htaccess which rewrites all requests to a PHP handler script in the same directory, something like this (untested):

RewriteEngine On
RewriteRule ^/handler.php$ - [L]
RewriteRule ^/([a-zA-Z0-9\.]+)$ /handler.php?filename=$1 [L]

In the PHP script, you do whatever logging is necessary using file_put_contents(). Then, you create handler.php with this code:

<?php
if (!file_exists) {
    header("Status: 404 Not Found");
    //if you have a 404 error page, you can use an include here to show it
    exit(0);
}

header("Content-disposition: attachment; filename={$_GET["filename"]}");
header("Content-type: ".get_mime_type($_GET["filename"]));
readfile($filename);

function get_mime_type($filename, $mimePath = '/etc') {
    $fileext = substr(strrchr($filename, '.'), 1);
    if (empty($fileext)) return (false);
    $regex = "/^([\w\+\-\.\/]+)\s+(\w+\s)*($fileext\s)/i";
    $lines = file("$mimePath/mime.types");
    foreach($lines as $line) {
        if (substr($line, 0, 1) == '#') continue; // skip comments
        $line = rtrim($line) . " ";
        if (!preg_match($regex, $line, $matches)) continue; // no match to the extension
        return ($matches[1]);
    }
    return (false); // no match at all
}
?>

Basically, you are creating a layer between the file request and the actual serving of the file. This PHP layer logs the file access, then serves the file. You said you didn't want to mess around with status codes and MIME types, but the beauty of this is that all that is taken care of. In case the file doesn't exist, it just generates a standard 404, and you can include a custom 404 error page. Yes, the status header is being changed here, but it's nothing complicated. As to MIME types, they are detected for you according to the same MIME type rules Apache uses. Point the get_mime_type function to the mime.types file on your server. If you don't know where it is, just download a copy from here. I'll admit, this solution is probably more technical than you were looking for, but with the restrictions you have it's a good solution. The best part is, it's completely transparent to the end user, as well as those who upload stuff.

Ashley Strout
  • 6,107
  • 5
  • 25
  • 45
  • 1
    This is _exactly_ the kind of handler I _don't_ want. I don't have a path to a `mime.types` file. You're not outputting any caching headers and no ETags. The `Content-disposition: attachment` header is also wrong because it forces browsers to download the file. And `file_put_contents`, even with the `FILE_APPEND` flag, is not atomic - but I suppose there are ways to circumvent that. – Halcyon Feb 08 '12 at 19:49
2

The only unobtrusive monitoring you could do without filtering stuff through PHP would be checking up on all files and noting down their file access times each time any PHP file is requested (you just add a function to your php files or use a rewrite). It'll incur a little overhead but it's the only unobtrusive statistic you can get.

Obviously, this way you can't get exact numbers of accesses but more like frequencies so it's some kind of (viable) statistic too. To get something like hit numbers (this was opened 1000k times on march 25th at 2am) you need to have access to logs or pipe it all through a PHP or cgi script -- something just has to do the manual counting.

adioe3
  • 249
  • 2
  • 3
  • Interesting. The only problem I foresee is that using the filebrowser in the backend of my application might 'touch' the files as well so `fileatime` will not be accurate. It's too low-level (OS-level). – Halcyon Feb 10 '12 at 13:07
2

Assuming you're using PHP as a compiled Apache module then the virtual() function could make this happen. See: http://www.php.net/manual/en/function.virtual.php

<?php

$fn = $_GET['fn'];

log_file_access($fn); // You define how you want this to happen    
virtual($fn);

You then reference the files via:

http://example.com/file.php?fn=files/lolcat.jpg

A. R. Younce
  • 1,913
  • 17
  • 22
  • 1
    That's interesting however quoting one of the users comments on php manual : "Problem with most of the scripts posted below is that virtual() flushes the pending headers before making the subrequest. Requesting an image with virtual() still returns a text/html type document." – Bathz Feb 10 '12 at 09:44
  • `virtual` doesn't seem to send the right headers, see my 'answer'. – Halcyon Feb 10 '12 at 13:50
1

I've tried a great many things and there seems to be no easy solution.

My solution uses the Location header trick proposed by @yes123 but I've tweaked it to match my preferences.

The links to the files are kept intact, so it's still: /files/path/to/my/file.abc I have a RewriteRule:

RewriteRule ^files/(.*) path/to/tracker.php?path=/$1

Then in the file I issue a Location header by adding ?track=no to the URL and an exception to the earlier RewriteRule:

RewriteCond %{QUERY_STRING} !(&|^)track=no(&|$)

I've added one more optimization. I've enabled E-Tags so if the client send an E-Tag header, see if it matches the file and return a 304 Not Modified instead of a Location.

$fs = stat($document_root . $path);
$apache_etag = calculate_apache_etag($fs);
if ((isset($_SERVER["HTTP_IF_MATCH"]) && etag_within_range($_SERVER["HTTP_IF_MATCH"], $apache_etag))
    || (isset($_SERVER["HTTP_IF_NONE_MATCH"]) && etag_within_range($_SERVER["HTTP_IF_NONE_MATCH"], $apache_etag))
) {
    header("ETag: " . $apache_etag, true, 304);
    exit;
}

function etag_within_range($etag1, $etag2) {
    list($size1, $mtime1) = explode("-", $etag1);
    list($size2, $mtime2) = explode("-", $etag2);
    $mtime1 = floor(hexdec($mtime1) / 1000000);
    $mtime2 = floor(hexdec($mtime2) / 1000000);
    return $mtime1 === $mtime2 && $size1 === $size2;
}

And implementation for calculate_apache_etag can be found here: How do you make an etag that matches Apache?

etag_withing_range solves the issue of comparing against a higher precision mtime in Apache.


Notes on solutions that didn't work

virtual

Test script:

var_dump(apache_response_headers());
virtual("/path/to/image.jpg");
var_dump(apache_response_headers());

Outputs:

array(1) { ["X-Powered-By"]=> string(10) "PHP/5.2.11" }
[[binary junk]]
array(5) { ["X-Powered-By"]=> string(10) "PHP/5.2.11" ["Keep-Alive"]=> string(18) "timeout=5, max=100" ["Connection"]=> string(10) "Keep-Alive" ["Transfer-Encoding"]=> string(7) "chunked" ["Content-Type"]=> string(9) "text/html" }

Content-Type: text/html reaaaaalllly? :(

Perhaps PHP5.3's header_remove function can solve this? I haven't tried.

Community
  • 1
  • 1
Halcyon
  • 57,230
  • 10
  • 89
  • 128