I have written a code to convert an input UCS-2LE file to normal 8 bit ISO-8859-1 text. After I convert it, I am splitting the entire text into words using strtok function. Now I am applying strlen on each of the word obtained, but I am getting strange word length, which I am not able to understand.
<?php
$fileData = file('input.txt');
foreach( $fileData as $txt ){
$txt = iconv( 'ISO-8859-1', 'UCS-2LE', $txt );
$tok = strtok($txt, " \n\t");
while ($tok !== false) {
echo 'Word = '.$tok.', Length = '.strlen($tok).'<br />';
$tok = strtok(" \n\t");
}
}
?>
The input file, file name = input.txt (in UCS-2LE ) is
Slot# NumJobs ActiveJobID ActiveBatchJob ActiveProcStartTime
0 0 1 input3.dat 7:20 PM
1 0 2 input3.dat 7:20 PM
The output is
Word = ÿþSlot#, Length = 24
Word = NumJobs, Length = 31
Word = ActiveJobID, Length = 47
Word = ActiveBatchJob, Length = 59
Word = ActiveProcStartTime , Length = 83
Word = , Length = 1
Word = 0, Length = 6
Word = 0, Length = 7
Word = 1, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = 1, Length = 6
Word = 0, Length = 7
Word = 2, Length = 7
Word = input3.dat, Length = 43
Word = 7:20, Length = 19
Word = PM , Length = 15
Word = , Length = 1
Word = , Length = 2
1) How is it that length is not getting shown correctly.
2) Line 6 in output is new line character which has not been properly tokenized by strtok. Why?
3) I read a little about BOM and I got to know that the first two characters in file is used to identify the format of characters used. Is there any way to avoid these characters, like in first line of output, it is showing two characters extra.
Thanks in advance for the help.