7

I was surprised that I was unable to find a straightforward answer to this question by searching.

I have a web application in PHP that takes user input. Due to the nature of the application, users may often use extended ASCII characters (a.k.a. "ALT codes").

My specific issue at the moment is with ALT code 26, which is a right arrow (→). This will be accompanied with other text to be stored in the same field (for example, 'this→that').

My column type is NVARCHAR.

Here's what I've tried:

  1. I've tried doing no conversions and just inserting the value as normal, but the value gets stored as thisâ??that.

  2. I've tried converting the value to UCS-2 in PHP using iconv('UTF-8', 'UCS-2', $value), but I get an error saying Unclosed quotation mark after the character string 't'.. The query ends up looking like this: UPDATE myTable SET myColumn = 'this�!that'.

  3. I've tried doing the above conversion and then adding an N before the quoted value, but I get the same error message. The query looks like this: UPDATE myTable SET myColumn = N'this�!that'.

  4. I've tried removing the UCS-2 conversion and just adding the N before the quoted value, and the query works again, but the value is stored as thisâ that.

  5. I've tried using utf8_decode($value) in PHP, but then the arrow is just replaced with a question mark.

So can anyone answer the (seemingly simple) question of, how can I store this value in my database and then retrieve it as it was originally typed?

I'm using PHP 5.5 and MSSQL 2012. If any question of driver/OS version comes into play, it's a Linux server connecting via FreeTDS. There is no possibility of changing this.

Christian Gollhardt
  • 16,510
  • 17
  • 74
  • 111
Travesty3
  • 14,351
  • 6
  • 61
  • 98
  • 1
    Are you sure that the value stored in sql server is incorrect? SSMS does not always do a good job of displaying characters in the extended set. You might check the actual unicode value of that character. It sounds like perhaps there is something on the PHP side you have to do to make this valid for the extended character set. – Sean Lange Jan 30 '17 at 21:24
  • To answer this: "I've tried using utf8_decode($value) in PHP, but then the arrow is just replaced with a question mark.", it's probably because there is no representation for that value in the ISO-8859-1 charset. utf8_decode converts strings into ISO-8859-1 charset. – georaldc Jan 30 '17 at 21:25
  • @SeanLange: When the value is retrieved from the database for display in the application, it displays as I mentioned (broken). – Travesty3 Jan 30 '17 at 21:25
  • I would look in the database and see what value is actually being stored. You can use the UNICODE function to get the code for the character that is problematic. – Sean Lange Jan 30 '17 at 21:26
  • @georaldc: Yeah, I know why it happens. I just mentioned that as one of the things I tried in order to prevent anyone from suggesting to try it. – Travesty3 Jan 30 '17 at 21:26
  • @SeanLange: The values I posted in the answer are the values that are stored in the database. It makes no difference to me how it is stored in the database, as long as it is retrieved and displayed correctly in the application. In any case, I'm using JetBrains DataGrip to pull the values directly from the database and that is how they look (and they look the same in my web application when it populates the existing values). – Travesty3 Jan 30 '17 at 21:28
  • Maybe the connection charset between php and mssql isn't set to utf-8? I read that the setting is case sensitive too: ```ini_set('mssql.charset', 'UTF-8');``` – georaldc Jan 30 '17 at 21:32
  • @georaldc: I tried adding that, same result. – Travesty3 Jan 30 '17 at 22:18

3 Answers3

5

You might try base64 encoding the input, this is fairly trivial to handle with PHP's base64_encode() and base64_decode() and it should handle what ever your users throw at it.

(edit: You can apparently also do the base64 encoding on the SQL Server side. This doesn't seem like something it should be responsible for imho, but it's an option.)

Community
  • 1
  • 1
John Jones
  • 2,027
  • 16
  • 25
  • Good idea! I just tested that and it appears to be working as expected. Kind of a bummer that I have to add those steps in before and after storing it, but this is the only thing that has worked for me so far. Thanks! FYI, it won't let me award the bounty for 16 hours. I guess others will have 16 hours to come up with a better solution! – Travesty3 Feb 03 '17 at 02:42
1

It seems like your freetds.conf is wrong. You need a TDS protocol version >= 7.0 to support unicode. See this for more details.

Edit your freetds.conf:

[global]
# TDS protocol version
tds version = 7.4
client charset = UTF-8

Also make sure to configure PHP correct:

ini_set('mssql.charset', 'UTF-8');
Christian Gollhardt
  • 16,510
  • 17
  • 74
  • 111
  • Well, the `tds version` _should_ be `7.4` as OP is accessing an SQL Server 2012 instance. The document you linked to says: `For best results, use the highest version of the protocol supported by your server.` – Lars Gyrup Brink Nielsen Feb 02 '17 at 19:33
  • @LayZee from the same link: *4.2 Still works with all products, subject to its limitations.* ASCI instead of UTF-8 is a limitation. Just using a new product does not assume you can not use an older protocol. – Christian Gollhardt Feb 02 '17 at 19:40
  • I agree. I just wanted to point out that 7.4 was possible and even recommended in the docs. – Lars Gyrup Brink Nielsen Feb 02 '17 at 19:41
  • My freetds.conf version specifies version 7.1. I did try doing `ini_set('mssql.charset', 'UTF-8')` per a suggestion in the question comments above. Unfortunately, these don't solve the issue for me. – Travesty3 Feb 03 '17 at 02:26
1

The accepted answer seems to do the job; yes you can encode it to base64 and then decode it back again, but then all the applications that use that remote database, should change and support the fields to be base64 encoded. My thought is that if there is a remote MS SQL Server database, there could be an other application (or applications) that may use it, so that application have to also be changed to support both plain and base64 encoding. And you'll have to also handle both plain text and base64 converted text.

I searched a little bit and I found how to send UNICODE text to the MS SQL Server using MS SQL commands and PHP to convert the UNICODE bytes to HEX numbers.

If you go at the PHP documentation for the mssql_fetch_array (http://php.net/manual/ru/function.mssql-fetch-array.php#80076), you'll see at the comments a pretty good solution that converts the text to UNICODE HEX values and then sends that HEX data directly to MS SQL Server like this:

Convert Unicode Text to HEX Data

// sending data to database 
$utf8 = 'Δοκιμή με unicode → Test with Unicode';  // some Greek text for example
$ucs2 = iconv('UTF-8', 'UCS-2LE', $utf8); 

// converting UCS-2 string into "binary" hexadecimal form 
$arr = unpack('H*hex', $ucs2); 
$hex = "0x{$arr['hex']}"; 

// IMPORTANT! 
// please note that value must be passed without apostrophes 
// it should be "... values(0x0123456789ABCEF) ...", not "... values('0x0123456789ABCEF') ..." 
mssql_query("INSERT INTO mytable (myfield) VALUES ({$hex})", $link);

Now all the text actually is stored to the NVARCHAR database field correctly as UNICODE, and that's all you have to do in order to send and store it as plain text and not encoded.

To retrieve that text, you need to ask MS SQL Server to send back UNICODE encoded text like this:

Retrieving Unicode Text from MS SQL Server

// retrieving data from database 
// IMPORTANT! 
// please note that "varbinary" expects number of bytes 
// in this example it must be 200 (bytes), while size of field is 100 (UCS-2 chars) 

// myfield is of 50 length, so I set VARBINARY to 100
$result = mssql_query("SELECT CONVERT(VARBINARY(100), myfield) AS myfield FROM mytable", $link); 

while (($row = mssql_fetch_array($result, MSSQL_BOTH))) 
{ 
    // we get data in UCS-2 
    // I use UTF-8 in my project, so I encode it back 
    echo '1. '.iconv('UCS-2LE', 'UTF-8', $row['myfield'])).PHP_EOL; 
    // or you can even use mb_convert_encoding to convert from UCS-2LE to UTF-8
    echo '2. '.mb_convert_encoding($row['myfield'], 'UTF-8', 'UCS-2LE').PHP_EOL;
} 

The MS SQL Table with the UNICODE Data after the INSERT

MS SQL Table

The output result using a PHP page to display the values

PHP Output

I'm not sure if you can reach my test page here, but you can try to see the live results: http://dbg.deve.wiznet.gr/php56/mssql/test1.php

Christos Lytras
  • 36,310
  • 4
  • 80
  • 113