MySQL, UTF8 and my sanity
I’ve never really had to do much with UTF-8, consequently my facility with it has been, shall we say… limited. Recently, though, I found myself needing to delve a bit deeper – it was surprisingly simple once I understood what was going on, but that understanding eluded me for many an hour. I called in the help of someone who had gone through some of these trials and tribulations and together we sounded it out.
Here’s the back story, I had a website for a client that used some spanish – it was a standard LAMP stack with MySQL and Perl. Using default everything, I was able to successfully have my web admin interface and command line tools throw UTF-8 text into MySQL and have the website itself spit out proper looking characters with accents. It was pretty simple to get that working and I gave myself a pat on the back.
The troubles began when the client wanted a simple search implemented – this was just going to be straight sql using the LIKE operator, fine for simple needs over a tiny database. Unfortunately, searching for “nino” refused to also find “niÃƒÂ±o”. I tried a variety of things, which gave me stranger and stranger problems, breaking what I thought was a working implementation.
The first thing I tried was changing the columns and table into a utf-8 table with a variety of different collations. This immediately broke the website changing non-latin1 characters into a garble. Trying “set names utf8″ in the cli showed the same problem. The most frustrating part of this was that I knew that what was in the database was valid utf8 data – mysqldump proved this when viewing the file showed the proper characters (given the proper LANG environment variable) and also od over the characters looked good to me.
I tried using various iconv and Encode tricks over the data and reloading it all of which tended to break things in new and exciting ways. After much trial and error, I finally figured out all the facts of what was working and what wasn’t and pieced together that rather simple picture that had eluded me.
Here it is. MySQL has two settings that matter – the setting on the table/column and the setting of the connection. When the settings on those two are the same MySQL does no conversion, but when they are different MySQL gets its helpful on and converts as best it can. Sigh.
So, when I first started the site everything was latin1 from the connection through to the tables – thus MySQL just passed all the bits through. So as I was entering UTF-8 characters, it was getting stored as UTF-8 because MySQL didn’t care that it wasn’t valid latin1. Unfortunately, when I changed the table to UTF-8 but not the connect, it would convert going in and going out which gave terrible, terrible results.
It also wasn’t enough to change the connection and the table in place – because changing the table and columns from latin1 to UTF-8 caused MySQL to do the conversion on all the data. The solution, simply enough, was to mysqldump the table out and change all instances of latin1 to UTF-8 in the file and reload it (ensuring that the connection itself was UTF-8) – this loaded in all my precious, pre-existing UTF-8 data without twiddling any of the bits.
I finally made sure all my DBI calls had “mysql_enable_utf8=1″ in the connect string. Initially, because I had applied mysql_enable_utf8 improperly (not in the connect string) I tried having it execute a “set names utf8″ right after connecting. This worked for awhile, but apparently on a reconnect due to a timeout or whatever) that effect would not be restarted and problems would arise again. Fortunately, it wasn’t hard to realize that I had misread the docs and wasn’t adding in the enable line correctly.
I also told Apache that the default charset was UTF-8 (in addition to my previous efforts putting the meta header in the html specifying the charset). And at last everything was happy, happier even than it had been because now Perl was always being properly informed of the string’s UTF-8 status (instead of me having to convince it so).
So, while initially it seemed outrageous and doing minor things differently provided a wealth of different results – ultimately it boiled down to how many times conversion was being applied to the data (going in and/or going out). Sometimes it takes a lot of hours and a lot of banging of heads to figure out these simple truths. I wish it didn’t, but at least it’s not a problem I’ll soon forget. I suspect that’s there’s much more to go on my road to masterhood, but at least now I feel like I’ve moved on from novice to journeyman. Not the time travelling kind, though.