Monthly Archives: March 2012

HTML forms, PHP, MySQL and UTF-8 – solving the unnerving MS Word special characters problem

Ever had the problem that you paste copy from a MS Word document into a webform, hit the save or submit button and the outcome somehow looks like this…

�The quick brown fox � and so on and so forth�

…when it’s supposed to look like this…

“The quick brown fox – and so on and so forth”

…? Yes, I am specifically talking about the fancy quitation makrs and the elongated dash or hyphen that MS word automatically uses.

Well, welcome to the crazy world of character encoding. Just search the web and you’ll come across gazillion of blog posts and forum entries. All have a set of suggestions and it seems to work for some people while it doesn’t or others. Obviously I had the problem as well, otherwise I wouldn’t write about it now. Now how did I finally manage to solve the issue?

1. Set the correct character encoding on the page header

That was one of the suggestion I read most often, simply because it is the most important rule to follow. The html page that is supposed to display text with non-standard characters needs to be told that it should use the UTF-8 unicode character encoding.

 

Was my problem solved? Nope. Not yet.

2. Set the correct character encoding for html forms

The data that I was trying to display was actually extracted from a MySQL database which itself was populated through an HTML form. When I checked the database entries I realised that the data in there was already screwed up. So I suspected the HTML form to submit the data in an incorrect character encoding. So next step was to set the correct character encoding for the html forms submitting data to MySQL:

 

Note the bit at the end of the opening form tag, i.e. accept-charset=”utf-8″.

Was my problem solved? Nope. Not yet.

But when I sent the php file’s form data to itself and echoed the POST data, it displayed the special characters correctly. So the problem had something to do with the PHP/MySQL interface or MySQL itself.

3. Pass the data from PHP to MySQL with the correct character encoding

After reading lots of stackoverflow.com posts I ended up adjusting my data handling PHP scripts (i.e. the script that adds the form data to the database) to set the character encoding straight after the script the database connection and before the database selection:

$link=mysql_connect('localhost', 'usr', 'pass');
mysql_set_charset('utf8',$link);

Was my problem solved? Nope. Not yet.

4. Make sure the MySQL database & tables are set to use utf8 and utf_unicode_ci

It seems you can screw up a lot with the wrong settings in MySQL. First of all, you can chose the character encoding of your database as well as your individual table fields. On top of that, you then have to  chose the collation of the character set which, as far as I understand it, sets which alphabet will be used and which special characters it contains.

The biggest problem I had was that even though I used utf-8 as main character encoding in both database and table field settings and used utf8_bin as the collation, simply because I didn’t know it any better. But the copy was still stored in the messed up state. And that’s where I started to pull my hair out. I read tons of posts but they more or less all told me to do what I had done already (see points 1-3). Okay, after I while I noticed that it might be the collation that causes the problem and I set everythin to utf8_general_ci, simply because everybody and their dogs seemed to use it.

But it did not solve my problem.

Finally, I stumbled accross this post which to me suggested that the collation uft8_unicode_ci supports more special characters than utf8_general_ci (at the cost of performance). At that stage I couldn’t care less about performance. I just wanted the blood MS word quotation marks and hypens to render. So I changed database and table field settings to

character encoding: utf8
collation:          utf8_unicode_ci

And finally, the stuff ended up in the database just the way MS word threw it in the clipboard. That was it. Simple enough… once you know it.

P.S. I probably should point out that leaving out any of the above 4 points results in garbled text again.