2

I'm trying to parse utf-8 encoded text files uploaded via a multipart/form-data form. I have built a small .txt file where I have entered some tab delimited (meaningless) text in both latin and Japanese characters (I copy/pasted the Jpz characters from a Jpz retail site).

All I am trying at this point is to replace new lines by (LINE) and tabs by (TAB). Here is my code:

...
$text=file_get_contents($_FILES['upload']['tmp_name']);

$LineArray=array('\r\n','\n\r','\r','\n');
foreach ($LineArray as $value){
  $pieces=(mb_split($value,$text));
  $text=implode ("(LINE)",$pieces);
}
echo "Here is the modified text:<br/>";
echo $text;
echo "<br/>";
var_dump($text);

$tab='\t';
$pieces=(mb_split($tab,$text));
$text=implode ("(TAB)",$pieces);
echo "Here is the modified text:<br/>";
echo $text;
echo "<br/>";
var_dump($text);
...

Here is a vardump of the text before modification:

string 'John    Fitzgerald  Kennedy

Winston     Churchill

John    Edgar   Hoover

素材の 生地を柿渋で染 めた和柄パンツです





火车票 火车票 火车票 火车票



' (length=175)

The first line of Asian characters has 2 tabs, the last line of the file has 3 tabs.

Here is a vardump of the text after all modifications:

string 'John(TAB)Fitzgerald(TAB)Kennedy(LINE)Winston(TAB)(TAB)Churchill(LINE)John(TAB)Edgar(TAB)Hoover(LINE)素材の 生地を柿渋で染(TAB)めた和柄パンツです(LINE)(LINE)(LINE)火车票  火车票 火车票 火车票(LINE)(LINE)' (length=235)

How come my code can only identify one of the tabs in the Japanese text part?

JDelage
  • 13,036
  • 23
  • 78
  • 112

1 Answers1

3

mb_split uses the value of mb_regex_encoding to determine what encoding to process the string in. This value is probably not set to UTF-8 and hence mb_split doesn't expect/work on the correct encoding. Try setting the mb_regex_encoding to UTF-8.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • Ding, ding, ding... That works. Thank you. I thought all I had to do to ensure I was UTF-8 throughout had been done. I need to get more info on `mb_internal_encoding` and `mb_regex_encoding`. – JDelage Feb 23 '12 at 01:48