0

I have written a code to convert an input UCS-2LE file to normal 8 bit ISO-8859-1 text. After I convert it, I am splitting the entire text into words using strtok function. Now I am applying strlen on each of the word obtained, but I am getting strange word length, which I am not able to understand.

<?php
$fileData = file('input.txt');

foreach( $fileData as $txt ){

    $txt = iconv( 'ISO-8859-1', 'UCS-2LE', $txt );
    $tok = strtok($txt, " \n\t");
    while ($tok !== false) {
        echo 'Word = '.$tok.', Length = '.strlen($tok).'<br />';
        $tok = strtok(" \n\t");
    }
}
?>

The input file, file name = input.txt (in UCS-2LE ) is

 Slot#  NumJobs ActiveJobID ActiveBatchJob  ActiveProcStartTime
 0  0   1   input3.dat  7:20 PM
 1  0   2   input3.dat  7:20 PM

The output is

Word = ÿþSlot#, Length = 24
Word = NumJobs, Length = 31
Word = ActiveJobID, Length = 47
Word = ActiveBatchJob, Length = 59
Word = ActiveProcStartTime , Length = 83
Word = , Length = 1
Word = 0, Length = 6
Word = 0, Length = 7
Word = 1, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = 1, Length = 6
Word = 0, Length = 7
Word = 2, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = , Length = 2

1) How is it that length is not getting shown correctly.

2) Line 6 in output is new line character which has not been properly tokenized by strtok. Why?

3) I read a little about BOM and I got to know that the first two characters in file is used to identify the format of characters used. Is there any way to avoid these characters, like in first line of output, it is showing two characters extra.

Thanks in advance for the help.

Prasad
  • 5,946
  • 3
  • 30
  • 36
  • Perhaps you should consider mb_strlen() (http://www.php.net/manual/en/function.mb-strlen.php) as an alternative for multibyte character sets such as UTF-8, that's what the `mb` in `mb_strlen` stands for – Mark Baker Feb 27 '13 at 14:15
  • Your character set conversion is going the wrong way. Don't you want `ISO-8859-1` **out**? If so, it should be the second parameter to `iconv`, not the first. Also, consider using `fgetcsv` instead of tokenizing the file yourself. `fgetcsv` allows you to specify the delimiter (so you can use a `\t` instead of the default `,`) – Colin M Feb 27 '13 at 14:25
  • [`iconv`](http://www.php.net/manual/en/ref.iconv.phpv) has number of string related functions e.g. instead of using `strlen` use `iconv_strlen($str, $charset);`. – Gerard Roche Feb 27 '13 at 16:00

0 Answers0