3

I have been working on a strange PHP problem the last few days where the feof() function is returning true before the end of a file. Below is a skeleton of my code:

$this->fh = fopen("bigfile.txt", "r");    

while(!feof($this->fh))
{
    $dataString = fgets($this->fh);

    if($dataString === false && !feof($this->fh))
    {
        echo "Error reading file besides EOF";
    }
    elseif($dataString === false && feof($this->fh))
    {
        echo "We are at the end of the file.\n";

        //check status of the stream
        $meta = stream_get_meta_data($this->fh);
        var_dump($meta);
    }
    else
    {
        //else all is good, process line read in 
    }
}

Through lots of testing I have found that the program works fine on everything except one file:

  • The file is stored on the local drive.
  • This file is around 8 million lines long averaging somewhere around 200-500 characters per line.
  • It has already been cleaned and under close examination with a hex editor, no abnormal characters have been found.
  • The program consistently fails on line 7172714 when it believes it has reached the end of the file (even though it has ~800K lines left).
  • I have tested the program on files that had fewer characters per line but were between 20-30 million lines with no problems.
  • I tried running the code from a comment on http://php.net/manual/en/function.fgets.php just to see if it was something in my code that was causing the issue and the 3rd party code failed on the same line. EDIT: also worth mentioning is that the 3rd party code used fread() instead of fgets().
  • I tried specifying several buffer sizes in the fgets function and none of them made any difference.

The output from the var_dump($meta) is as follows:

 array(9) {
  ["wrapper_type"]=>
  string(9) "plainfile"
  ["stream_type"]=>
  string(5) "STDIO"
  ["mode"]=>
  string(1) "r"
  ["unread_bytes"]=>
  int(0)
  ["seekable"]=>
  bool(true)
  ["uri"]=>
  string(65) "full path of file being read"
  ["timed_out"]=>
  bool(false)
  ["blocked"]=>
  bool(true)
  ["eof"]=>
  bool(true)
}

In attempting to find out what is causing feof to return true before the end of the file I have to guess that either:

A) Something is causing the fopen stream to fail and then nothing is able to be read in (causing feof to return true)

B) There is some buffer somewhere that is filling up and causing havoc

C) The PHP gods are angry

I have searched far and wide to see if anyone else was having this issue and cannot find any instances except in C++ where the file was being read in via text mode instead of binary mode and was causing the issue.

UPDATE: I had my script constantly output the number of times the read function had iterated and the unique ID of the user associated with the entry it found beside it. The script is still failing after line 7172713 out of 7175502, but the unique ID of the last user in the file is showing up on line 7172713. It seems that the problem is for some reason lines are being skipped and are not read. All line breaks are present.

user2395126
  • 526
  • 1
  • 7
  • 20
  • Is it possible that php ran out of memory reading the file? – Get Off My Lawn Jan 14 '15 at 05:16
  • Forgot to mention, the read function is called for blocks of lines. It reads 500 lines, does some processing and returns a value and stores it's last location in a class-wide variable. The next time it is called it reads the next 500 lines starting where it left off using the class-wide variable. Everything is properly dealt with using unset and while monitoring server memory usage I have not noticed anything abnormal. Because that was too complicated to keep testing, I wrote this code and simply unset the line read in on a successful line read. Still seeing the same problem. – user2395126 Jan 14 '15 at 05:19
  • have you tried using `rb` = **Read Binary** instead of just `r`? – Get Off My Lawn Jan 14 '15 at 05:23
  • Didn't know you could do that in PHP since it wasn't on the list of options in the fopen docs. I am going to try it now and will let you know if that works! – user2395126 Jan 14 '15 at 05:25
  • yeah, for some reason it isn't really documented, but it is valid and is used in some of the php.net examples – Get Off My Lawn Jan 14 '15 at 05:26
  • Just saw those examples. I also just started the test and it appears that PHP will let me use that option. I will know in a few minutes if it worked once the script is done executing. – user2395126 Jan 14 '15 at 05:32
  • No luck, it just failed at the same line. But thanks for the binary read tip, I am going to continue to use that as it is much better practice. – user2395126 Jan 14 '15 at 05:41
  • That is too bad, I otherwise don't know. Someone posted code on my website for downloading large files, I haven't read it, but on line `72` there looks like there might be some code that could help you out. http://phpsnips.com/snip-579#.VLYBwc3d9hE – Get Off My Lawn Jan 14 '15 at 05:45

3 Answers3

4

you must split your file or increase the timeout in php by:

upload_max_filesize = 2M 
;or whatever size you want

max_execution_time = 60 ; also, higher if you must

because: Returns TRUE if the file pointer is at EOF or an error occurs (including socket timeout); otherwise returns FALSE. see:http://php.net/manual/en/function.feof.php

  • Timeout is set for 72 hours and upload_max_filesize is set to 50G. Also worth mentioning memory limit is set to 2048 MB. – user2395126 Jan 14 '15 at 05:27
  • its may be your file closed in security reason by Antivirus or firewall –  Jan 14 '15 at 05:43
  • I thought about that, disabled everything and no luck. Ran the script with root privileges to see if that would help as well and also no luck. – user2395126 Jan 14 '15 at 05:45
  • are your error_reporting and display_error is ON(to see errors)? –  Jan 14 '15 at 05:58
  • Absolutely! Just have to remove the text content since it is for user data, you will have to trust me it is only alphanumeric in between the quotes. ["REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED" – user2395126 Jan 14 '15 at 06:06
  • ,"REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","","","","","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED","REDACTED"] – user2395126 Jan 14 '15 at 06:07
  • Strict error reporting and display_error are turned on and so far nothing has shown up. – user2395126 Jan 14 '15 at 06:08
  • display all errors and warnings by E_ALL and try again –  Jan 14 '15 at 06:18
2

fgets() is seemingly randomly reading in some lines that do have content as empty. The script actually makes it to the end of the file even though my test that showed the line numbers being read was behind due to the way I did the error checking (and the way the error checking was written in the 3rd party code). Now the real question is what is causing fgets() and fread() to think that a line is empty even though it is not. I will ask that as a separate question as that is a change in topic. Thank you all for your help!

Also, just so no one is left hanging, the reason the 3rd party code did not work is because it relied on a line at least having a line break where the current problem with fgets and fread returning an empty string does not give the script what it needs to know the line ever existed, thus it continues trying to execute past the end of the file. Below is the slightly modified 3rd party script which I still consider excellent based on it's execution speed.

The original script can be found in the comments here: http://php.net/manual/en/function.fgets.php and I take absolutely no credit for it.

<?php

//File to be opened
$file = "/path/to/file.ext";
//Open file (DON'T USE a+ pointer will be wrong!)
$fp = fopen($file, 'r');
//Read 16meg chunks
$read = 16777216;
//\n Marker
$part = 0;

while(!feof($fp))
{
    $rbuf = fread($fp, $read);
    for($i=$read;$i > 0 || $n == chr(10);$i--)
    {
        $n=substr($rbuf, $i, 1);
        if($n == chr(10))break;
        //If we are at the end of the file, just grab the rest and stop loop
        elseif(feof($fp))
        {
            $i = $read;
            $buf = substr($rbuf, 0, $i+1);
            echo "<EOF>\n";
            break;
        }
    }
    //This is the buffer we want to do stuff with, maybe thow to a function?
    $buf = substr($rbuf, 0, $i+1);

    //output the chunk we just read and mark where it stopped with <break>
    echo $buf . "\n<break>\n";

    //Point marker back to last \n point
    $part = ftell($fp)-($read-($i+1));
    fseek($fp, $part);
}
fclose($fp);

?>

UPDATE: After hours more searching, analyzing, hair pulling, etc. it seems that the culprit was an uncaught bad character - in this case a 1/2 character hex value BD. While generating the file that I was reading from the script used stream_get_line() to read the line in from it's original source. It was then supposed to remove all bad characters (it appears that my regex was not up to par) and then use str_getcsv() to convert the content to an array, do some processing, then write to a new file (the one I was trying to read). Somewhere in this process, probably str_getcsv(), the 1/2 character caused the whole thing to just insert a blank line instead of the data. Several thousand of these were placed all throughout the file (wherever the 1/2 symbol appeared). This made the file appear to be the correct length, but for the EOF to be reached too quickly when counting input based on a known number of lines. I want to thank everyone who helped me with this problem and I am very sorry that the real cause had nothing to do with my question. However if it hadn't been for everyone's suggestions and questions I would not have looked in the right places.

Lesson learned from this experience - when EOF is reached too quickly the best place to look is for instances of double line breaks. When writing a script that reads from a formatted file a good practice is to check for these. Below is my original code modified to do just that:

$this->fh = fopen("bigfile.txt", "r");    

while(!feof($this->fh))
{
    $dataString = fgets($this->fh);

    if($dataString == "\n" || $dataString == "\r\n" || $dataString == "")
    {
        throw new Exception("Empty line found.");
    }

    if($dataString === false && !feof($this->fh))
    {
        echo "Error reading file besides EOF";
    }
    elseif($dataString === false && feof($this->fh))
    {
        echo "We are at the end of the file.\n";

        //check status of the stream
        $meta = stream_get_meta_data($this->fh);
        var_dump($meta);
    }
    else
    {
        //else all is good, process line read in 
    }
}
user2395126
  • 526
  • 1
  • 7
  • 20
0

Much time has passed, but it will be useful for others.

Regarding the 1st question, I dare to assume that your file share is split into 2 partitions, since 8M line X ~ 200-500 bytes per line = ~ 1600-4000Mb. Your memory is 2048MB. Computed interrupt between 6M-8M lines or ~ 7M.

About blank lines.

    $str ='hello/r/n';
    echo $str.false; // equivalent to $str. '';

Perhaps fgets returned "false" and the result was appended as a newline. This may explain why the empty line appears.

Another reason

test.txt

1
2
3
4
5

In the examples, I will indicate the iterations statically, by directly specifying the code, for clarity

    <?php
        $res=fopen(__DIR__."/test.txt", "r");
        var_dump('1=>',fread($res,2),feof($res)); //we read 2 bytes each since there is a line feed byte
        var_dump('2=>',fread($res,2),feof($res));
        var_dump('3=>',fread($res,2),feof($res));
        var_dump('4=>',fread($res,2),feof($res));
        var_dump('5=>',fread($res,1),feof($res)); //We read one byte since there is no line feed
        var_dump('6=>',fread($res),feof($res));

Result

string(3) "1=>"
string(2) "1
"
bool(false)
string(3) "2=>"
string(2) "2
"
bool(false)
string(3) "3=>"
string(2) "3
"
bool(false)
string(3) "4=>"
string(2) "4
"
bool(false)
string(3) "5=>"
string(1) "5"
bool(false)
string(3) "6=>"
string(0) ""
bool(true)

We see that the 5th line was read, but on it feof($res) ===false; . So there will be one more iteration . And in the next iteration (line 6) will return an empty string and feof will return true.

    <?php
       $filesize=filesize(__DIR__."/test.txt");
       $res=fopen(__DIR__."/test.txt", "r");
       Echo "----\n";
           var_dump(fread($res,$filesize),feof($res))
           var_dump('fread($res,$filesize),feof($res));
           Echo "----\n";
---
string(9) "1
2
3
4
5"
bool(false)
---
string(0) ""
bool(true)

The examples show that there is one extra iteration, because at the moment when all the bytes of the file are readed, feof does not determine the end of the file.

How can you fix such a moment.

    <?php
       $filesize=filesize(__DIR__."/test.txt")+1;
       $res=fopen(__DIR__."/test.txt", "r");
       var_dump('0=>',fread($res,$filesize),feof($res));

You noticed? I added one to the file size value.

For myself, I call EOF "conditional end file byte".

By itself, 'feof' does not compute anything. This is because feof depends on static metadata and readers (be it fread or fgetc or fgets and others). The reader evaluates whether there is an end of data at the specified length. If so, the eof flag will be set to true. If during $length the reader has not met the end of the data, then eof = false. This behavior is necessary because data can be added dynamically by other processes ($ mode = 'a +') and feof cannot do robust end-of-file calculations with a dynamic method. The reader alone has the right to determine if he has reached the end of the file.

Calculating the length of the last data block for fread

briefly

    <?php
        $filesize=filesize(__DIR__."/test.txt");
        $down_size=0;
        $length=8192;
        $data=[];
        $res=fopen(__DIR__."/test.txt", "r");
        $buf='';
        while(!feof($res)){
            if(($down_size+$length)===$filesize){$length++;}
            $buf=fread($res,$length);
            $down_size+=strlen($buf);
        }
AlexeyP0708
  • 307
  • 2
  • 5