I have a PHP application with a MYSQL database that "should" contain UTF8 encoded data. With regard to unicode characters, my application appears to work properly with beginning to end. If someone submits "Strömgren" into my database (via an HTML form), I see "Strömgren" when I get the data back out, etc.
My database tables are all UTF8 and my html pages and forms are all charset=utf-8.
I recently noticed that in one portion of my application my unicode characters appeared to be double-encoded. When I displayed what should be Strömgren, I saw Strömgren -- Str\xc3\xb6mgren vs Str\xc3\x83\xc2\xb6mgren. If I utf8_decode the bad string, it looks correct again.
I am assuming that this is "double-encoding."
I discovered that the portion of the application that was displaying the double-encoded data was using different code to make its database connection, and that code was making this call:
$db->set_charset("utf8")
I had intended to do that for ALL of my database connections, but somehow ended up only doing it in one place. So, almost all of my application is using connections without the set_charset command, and Strömgren always looks right, and the lone piece of code which does have set_charset("utf8") (and which only ever reads from the db, never writes to it), is displaying it incorrectly.
I am not certain what to make of this, but my suspicion is that the data in my database is not really stored in UTF8 encoding? Maybe when I send it Strömgren (without having set_charset("utf8")), it thinks it is receiving latin1 (or whatever), and when I read that back out I am getting latin1, but since my html pages have "charset=utf-8" it is being "mis-displayed" as Strömgren when really the database thinks it is sending me Strömgren. (I am probably not saying that either correctly OR clearly, but I hope it can be understood.)
I have two questions:
First, does any of my thinking here make sense, or am I completely off base?
Second, what is the best way for me to determine whether the data in my database is mis-encoded (i.e. does the database actually contain Strömgren or Strömgren)?