PDA

View Full Version : PHP+MYSQL and Charsets


Vege
04-30-2007, 04:21 AM
This tutorial is basically just explaining different somewhat confusing terms.

Have u ever build a database driven site where data needs to be multilingual?
Have you ever wondered why special characters (ä,ö €) that are stored into database are shown wrong when displaying them in webpage or looking them inside the database?
Have u ever made a database where you can't dump because data is shown correctly in page but wrong in database?
(i sure have)

I assume you know how to set up your database and php and you know how to query your database from php. Thease Instructions might refer to linux, but are not OS related.




What is CharSet?

The term charset, is short for “character set”. Charsets are identifiers used to describe a series of universal characters used in web and internet protocols such as HTML and Microsoft Windows.
Universal characters are used in many languages for encoding and for designating a font format for pages or to digitally represent text. A charset table or tables list the type of charsets and its standard. Unicode, ascii and iso are types of charsets that reference text or universal symbols or characters used in various languages and meta tags.


What is collation?
It tells the natural order of characters in your native language.
Collation is used with different charsets to tell what is the order of selected rows in selected charset.
(Does ä come before ö or å in the alphabets?)
Some collations also describe are uppercase chars selected before lowercase characters.
Last 2 character in the end of collation are _ci (case insensitive) or _cs (case sensitive).

Thats why there is multiple collations to choose from for same charset in mysql.

Keep it also in mind that even if you choose a right charset, different languages has different characters and thats why choosing wrong collation for example uTF8_bin for a Scandinavian would mean that Ä is same as A and thats why your search results could be in wrong order.
Collation do not affect inserted rows, only the order of selected values.

In phpmyadmin you can choose only collation to a table, but that changes charset as well.
example:
latin1_swedish_ci
Charset
Collation
Case Insensitive/Sensitive

Random info:
Mysql latin1 charset is in reality Windows-1252 (http://en.wikipedia.org/wiki/Windows-1252) aka "ANSI" meaning it has the common european special characters included.
Most modern browsers also tell that they are using ISO-8559-1 but request is sent as Windows-1252, meaning that € you send from a form will insert to latin1 table quite nicely even thou the symbol is not part of official ISO-8559-1 standard.


How you build a website that uses PHP+Mysql and UTF-8 (or other, i use utf8 as an exaple here) as database charset?
Here it is in short.

Check that your mysql version is atleast mysql 4.1, if it's 4.0 or lower and you cant reconfiqure from shell, your stuck with the default charset (usually latin1 if server admin haven't changed it).
To determine your version, you can use the following sql query

select version();

Order of setup:
1.You make your database
2.Set one/all of the following to utf-8, lower setting (in this list) overrides the higher (i usually set it at table level, but that's just me)

default database value (this will be used for default for new databases)
database itself
table
column(s)


3.PHP:s default charset for a new mysql connection is latin1 ("Windows 1252") so you need to override it after making the connection to database with query:
mysql_query("SET NAMES utf8");
This also means that your database charset don't really matter (that mutch) as all data between database and server is converted to this charset. This only applys to charactes in Windows-1252 characters.

4.Also the usual default for Apache page charset is ISO-8559-1 so you need to change it before everything or queried data will not show up correctly.
header('Content-Type: text/html; charset=UTF-8');

5.You also need to put the html doctype in that matches your charset. If it's wrong the data will show up incorrectly.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />




So page would be in short

<?php

header('Content-Type: text/html; charset=UTF-8');
mysql_connect($server, $username, $pass);
mysql_select_db($db);
mysql_query("SET NAMES utf8");

?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>example</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
</body>
</html>