I recently joined Automattic as a Code Wrangler (more on that soon), and I decided now was the time to de-spam, combine and resuscitate my old WordPress blogs.
Along the way I noticed a few annoying encoding issues crept in. The Wordpress XML exporter/importer, for example, double-escaped all my quotes so that <a href=”this”> became <a href=”this”>, which as you can imagine broke… everything.
The SQL query to fix this is:
UPDATE `wp_posts`
SET post_content_filtered = REPLACE(post_content_filtered, '\"', '"'),
post_content = REPLACE(post_content, '\"', '"')
WHERE post_content_filtered like '%"%' OR post_content LIKE '%"%'
Next I noticed a lot of double-quotes and ellipses had turned into gibberish like ““” and “—. Here’s the SQL queries to deal with that:
UPDATE wp_posts SET post_content = REPLACE(post_content, '’', '''), post_content_filtered = REPLACE(post_content_filtered, '’', ''');
UPDATE wp_posts SET post_content = REPLACE(post_content, '“', '“'), post_content_filtered = REPLACE(post_content_filtered, '“', '“');
UPDATE wp_posts SET post_content = REPLACE(post_content, 'â€', '”'), post_content_filtered = REPLACE(post_content_filtered, 'â€', '”');
UPDATE wp_posts SET post_content = REPLACE(post_content, '¦', '…'), post_content_filtered = REPLACE(post_content_filtered, '¦', '…');
Once I ran these few queries, I couldn’t see any more formatting errors. Others online have advised disabling the line “define(‘DB_CHARSET’, ‘utf8’)” in the wp-config.php. This seems inadvisable to me as it may allow your database to become even more inconsistent, by not enforcing a single clear and future-proof encoding on all content.
Now that everything’s been cleaned up my old content looks great and I can move on to customising the theme of my site.