2

I am searching for the string version in text read from a Unicode little-endian file.

With the $text 'version (apostrophe intended) I get

echo strpos($text, "r");          // Returns 7.
echo strpos($text, "version");    // Returns null.

I suspect that I need to convert either the needle or the haystack into the same format.

  • I had a look at mb_strpos but it doesn't do text searches in the same way as strpos.
  • I also considered changing by needle string to UTF-8 but haven't tried it yet. It seems a bit messy.

Any ideas?


Update after cmbuckley's answer.

$var = iconv('UTF-16LE', 'UTF-8', $fields[0]); 
// Returns Notice: iconv(): Detected an incomplete multibyte character in ...input string in 

So I checked the existing encoding and find

echo mb_detect_encoding($fields[0], mb_detect_order(), false);  // Returns 'ASCII'.

This is confusing. If the string is ASCII why was I having trouble with the original strpos function?


Update 2

The hex encoding of 'version is 2700 5600 6500 7200 7300 6900 6f00 6e00.

What encoding is that?

Transistor
  • 193
  • 12
  • If there are multibyte characters in the text, then it's not ASCII encoding. Sounds like it can't detect the encoding correctly - perhaps the content is badly encoded? – cmbuckley Sep 11 '18 at 12:43

2 Answers2

2

Even if you're using mb_strpos, you'd need to make sure $needle and $haystack are the same encoding anyway.

I'd suggest you use UTF-8 as much and as soon as possible, which means that I'd convert the UTF-16LE content to UTF-8 using iconv:

$text = file_get_contents('test.txt'); // contains 'version in UTF-16LE

var_dump(strpos($text, 'r'));          // 6
var_dump(strpos($text, 'version'));    // false

$text = iconv('UTF-16LE', 'UTF-8', $text);

var_dump(strpos($text, 'r'));          // 3
var_dump(strpos($text, 'version'));    // 1

Remember to do a strict !== false check (not null, as you mention in your post) as the file contents may start with the string version, in which case strpos would return 0.

cmbuckley
  • 40,217
  • 9
  • 77
  • 91
  • 2
    Personally I would use `mb_convert_encoding` instead of `iconv`. See https://stackoverflow.com/questions/8233517/what-is-the-difference-between-iconv-and-mb-convert-encoding-in-php for why – Marco Sep 11 '18 at 10:32
  • @cm: I've added an update. Thanks for the prompt about `!== false`. It was in the back of my mind to check. – Transistor Sep 11 '18 at 11:02
  • @d3L of course, it depends on the situation - in that answer, for my own uses `iconv` still sounds like the better option, and would only use `mb_convert_encoding` if I needed full platform independence. Definitely worth linking though. – cmbuckley Sep 11 '18 at 12:40
  • @Transistor it sounds like you're not actually sure it's UTF-16LE, or if it is there's an issue with the encoding of it. I expect you need to spend some time with a hex editor to work out the exact encoding, or clean up malformed multibyte characters, before you do any string operations. – cmbuckley Sep 11 '18 at 12:42
  • Thanks, chaps. I've added the hex encoding to the question. – Transistor Sep 11 '18 at 15:17
0

I created a file with the hex contents you provided and managed to find a solution:

<?php

$text = file_get_contents(__DIR__.'/test');

$text = mb_convert_encoding($text, 'UTF-8', 'UTF-16LE');

var_dump(strpos($text, "r"));          // int(3)
var_dump(strpos($text, "Version"));    // int(1)

Contents of test (viewed in Hex Fiend):

enter image description here

Version of PHP used: PHP 5.6.36

Marco
  • 7,007
  • 2
  • 19
  • 49