0

A desktop application uploads a file with arbitrary length (it really could be anything from a few MB to multiple GB) to a PHP endpoint.

The PHP server in question is running in Docker. Specifically, it's the Apache 2 + PHP combination (no FastCGI).

While uploading the file using the desktop application I observe following things:

  • The network statistics show that there is network transmit activity of about 600 MB/s (fluctuates quite a bit) during the upload
  • During the upload, the desktop application shows a progress bar indicating how much the upload has progressed. It aligns with the upload speed observed independently.
  • There is no output in access/error log of Apache 2 nor is there any log output on PHP side during the upload
  • I know that the PHP script is not running because I have added an error_log line at the very beginning and the log is not written to error log during upload.
  • The error_log line is executed as soon as the upload reaches 100%. Network activity drops. Now the line appears in the error log of apache and the script starts executing.
  • The request is sent with Transfer-Encoding: chunked. I checked this in the logs.
  • RAM usage stays about the same during upload – only increases a bit during script execution.

Based on this I assume there are two places that could be causing this:

  • Apache buffering the data before handing it off to PHP
  • PHP buffering the data before starting script execution

This is an issue because the script uses php://input to process the upload and since it can be multiple GB, I would like the script to already start processing the data before the upload finished (let's leave the possibility of abrupt connection abortion aside).

How can I circumvent this?

I have searched everywhere on how to disable input buffering and similar but haven't found anything that would help. Everything only points to output buffering or adding lines to the script but that does not really help since the script only start execution when the file is uploaded so it can't be the code.

  • _"Apache buffering the data before handing it off to PHP"_ is most likely what is happening here. You could open a UDP/TCP port on the server side (circumventing apache and PHP's HTTP processing) and have the client upload the file in chunks to that port. – Marco Jul 14 '23 at 22:12
  • 2
    As far as I know, PHP's built in file upload works on a request basis i.e. the script only gets to run once the file has been uploaded to the server (which is not what you want here). – Marco Jul 14 '23 at 22:13
  • Another way would be to split the file in chunks on the client side, using an HTTP request for _each_ chunk, then you could do the processing on a per chunk basis. – Marco Jul 14 '23 at 22:17
  • 1
    PHP runs already (that's the SAPI) while your script does not run yet. Speaking of file-uploads and you want to peek perhaps into the progress of it, please see https://www.php.net/manual/en/features.file-upload.post-method.php and the link towards the bottom with the label "Session Upload Progress". (and also read all Marco already commented). Here the link target: https://www.php.net/manual/en/session.upload-progress.php – hakre Jul 14 '23 at 23:03
  • Webserver buffering settings aside, PHP [the interpreter] runs to accept the request, including the upload. Your code [the script] does not run until the request has been received and processed in full. – Sammitch Jul 15 '23 at 00:20
  • Also what would you actually want the php script to do while the file is still uploading? What should it do with a partially uploaded file? Obviously we don't know exactly what your php is programmed to do, but I wonder if you've thought through the implications thoroughly – ADyson Jul 15 '23 at 07:36
  • 1
    If you are uploading a multi-gigabyte object should change your API to upload blocks of data via multiple requests. That will solve your problem, improve reliability (you can retry failed uploads), and support cloud services that have limitations on upload size. – John Hanley Jul 15 '23 at 20:07
  • What is the purpose of `php://input` when everything is read into memory anyway? I understand that PHP was not made for these kind of things but in this day and age where PHP is used for pretty much anything, I would expect that streaming a request body would be possible. – Simao Gomes Viana Jul 17 '23 at 05:58
  • One item you might want to learn about. There are several encoding formats when transferring data to a server. That means the data must transfer completely for some types. `multipart/form-data` is an example. Your post does not have the low-level details to even know if your PHP code could begin processing data early in the data stream. If you are transferring more than a few hundred megabytes, then you are not following best practices. Connections timeout, get reset, etc. in the real world. Then there are limitations that some proxies will impose, routers that hang, etc. Design accordingly. – John Hanley Jul 17 '23 at 06:51
  • @JohnHanley I have explicitly stated to leave the possibility of connection abortion aside. The script in question is doing an import of a big file. This big file contains data that was exported and is then read and processed by this script. The import is reentrant and can be aborted without serious consequences. Additionally, a backup will be made before the import runs, with appropriate locking to prevent changes in the meantime. – Simao Gomes Viana Jul 17 '23 at 08:56
  • As for encoding, the data is transferred using `Transfer-Encoding: chunked` which should be enough information to deduce what I'm talking about, especially after mentioning `php://input`. – Simao Gomes Viana Jul 17 '23 at 08:57
  • `Transfer-Encoding: chunked` is only part of the required information. The fact that data is chunked, and most MIME data is chunked, is not sufficient. In your answer, you require writing custom code to send chunks. Then you might as well follow the advice you have been given and do it correctly from the start. – John Hanley Jul 17 '23 at 14:05
  • The receiving end is PHP. All I need to happen is data to travel from a desktop application to a PHP server. The data in question can be multiple GB in size. The question is specifically about circumventing input buffering so that the PHP script can receive the data as a stream. What this data is and what the script does is not really relevant as the question only focuses on receiving a stream of data in PHP. That this isn't possible wasn't clear to me. Now that I know, I have opted to use a FIFO file instead. – Simao Gomes Viana Jul 17 '23 at 19:25
  • A requirement in my specific use case is that it must use HTTP which is implied by the use of Apache 2 but I'm going to emphasize here that I can't open additional ports or use pure TCP/UDP. This PHP service is behind a reverse proxy and authentication is required to access the endpoint which makes things more complicated so I*d rather have the import file transmitted in one go. – Simao Gomes Viana Jul 17 '23 at 19:27

1 Answers1

1

I'm going to answer my own question with the solution that I've come up with.

This will only work with POSIX systems (ones that support FIFOs/named pipes).

First, create a FIFO and start reading from it:

posix_mkfifo("/path/to/fifo");

$file = fopen("/path/to/fifo", "rb");
while (!feof($file)) {
  // fread etc.
}
fclose($file);

// You might want to keep reopening the fifo, depending on how you want to implement the write part

The script will start reading from the FIFO. Now all you need to do is write to this FIFO.

You could use a separate PHP script where you submit one chunk (i. e. 64 MB) per request and use regular fopen, fwrite to write to this FIFO.

Alternatively, you can also write a separate service in a different langauge (like Go, Rust, Kotlin, or whatever suits you) that writes into this FIFO in a continuous stream. (This is what I did)

PHP does not support streaming the request body as it is being uploaded by the client. The entire request is read into memory, always. So the solution is to delegate the upload implementation to a different script or service and only handle reading the FIFO in the long-running request.


The code example above is just a very simple example to illustrate the idea. Of course error checking, locking, etc. needs to be done here.