mysql> show create table t1; +-------+--------------------------------------------------------------------------------------- | Table | Create Table | +-------+-------------------------------------------------------------------------------------- | t1 | CREATE TABLE `t1` ( `info` varchar(10) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci | +-------+---------------------------------------------------------------------------------------- 1 row in set (0.00 sec) mysql> mysql> mysql> mysql> mysql> mysql> mysql> select length('你'); +---------------+ | length('你') | +---------------+ | 3 | +---------------+ 1 row in set (0.00 sec) mysql> select length('*'); +-------------+ | length('?') | +-------------+ | 4 | +-------------+ 1 row in set (0.00 sec) mysql> mysql> mysql> insert into t1 value('你'); Query OK, 1 row affected (0.05 sec) mysql> insert into t1 value('*'); ERROR 1366 (HY000): Incorrect string value: '\xF0\xA9\xB1\xBB' for column 'info' at row 1 mysql>
创建了一个编码为utf8的列info,插入byte长度分别为3位和4位的中文,可以发现插入4位中文是报错。同样我们看看emoji表情:
utf8可以直接转成utf8mb4,使用ALTER TABLE ... CONVERT TO CHARACTER SET ...语句,这是由于utf8是utf8mb4的子集。其他类型最好不要直接转,会出现问题,比如latin转utf8.
这篇文章描述了一种错误情况。
> mysql> select * from information_schema.CHARACTER_SETS; > +--------------------+----------------------+---------------------------------+--------+ > | CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | MAXLEN | > +--------------------+----------------------+---------------------------------+--------+ > | big5 | big5_chinese_ci | Big5 Traditional Chinese | 2 | > | dec8 | dec8_swedish_ci | DEC West European | 1 | > | cp850 | cp850_general_ci | DOS West European | 1 | > | hp8 | hp8_english_ci | HP West European | 1 | > | koi8r | koi8r_general_ci | KOI8-R Relcom Russian | 1 | > | latin1 | latin1_swedish_ci | cp1252 West European | 1 | > | latin2 | latin2_general_ci | ISO 8859-2 Central European | 1 | > | swe7 | swe7_swedish_ci | 7bit Swedish | 1 | > | ascii | ascii_general_ci | US ASCII | 1 | > | ujis | ujis_japanese_ci | EUC-JP Japanese | 3 | > | sjis | sjis_japanese_ci | Shift-JIS Japanese | 2 | > | hebrew | hebrew_general_ci | ISO 8859-8 Hebrew | 1 | > | tis620 | tis620_thai_ci | TIS620 Thai | 1 | > | euckr | euckr_korean_ci | EUC-KR Korean | 2 | > | koi8u | koi8u_general_ci | KOI8-U Ukrainian | 1 | > | gb2312 | gb2312_chinese_ci | GB2312 Simplified Chinese | 2 | > | greek | greek_general_ci | ISO 8859-7 Greek | 1 | > | cp1250 | cp1250_general_ci | Windows Central European | 1 | > | gbk | gbk_chinese_ci | GBK Simplified Chinese | 2 | > | latin5 | latin5_turkish_ci | ISO 8859-9 Turkish | 1 | > | armscii8 | armscii8_general_ci | ARMSCII-8 Armenian | 1 | > | utf8 | utf8_general_ci | UTF-8 Unicode | 3 | > | ucs2 | ucs2_general_ci | UCS-2 Unicode | 2 | > | cp866 | cp866_general_ci | DOS Russian | 1 | > | keybcs2 | keybcs2_general_ci | DOS Kamenicky Czech-Slovak | 1 | > | macce | macce_general_ci | Mac Central European | 1 | > | macroman | macroman_general_ci | Mac West European | 1 | > | cp852 | cp852_general_ci | DOS Central European | 1 | > | latin7 | latin7_general_ci | ISO 8859-13 Baltic | 1 | > | cp1251 | cp1251_general_ci | Windows Cyrillic | 1 | > | utf16 | utf16_general_ci | UTF-16 Unicode | 4 | > | utf16le | utf16le_general_ci | UTF-16LE Unicode | 4 | > | cp1256 | cp1256_general_ci | Windows Arabic | 1 | > | cp1257 | cp1257_general_ci | Windows Baltic | 1 | > | utf32 | utf32_general_ci | UTF-32 Unicode | 4 | > | binary | binary | Binary pseudo charset | 1 | > | geostd8 | geostd8_general_ci | GEOSTD8 Georgian | 1 | > | cp932 | cp932_japanese_ci | SJIS for Windows Japanese | 2 | > | eucjpms | eucjpms_japanese_ci | UJIS for Windows Japanese | 3 | > | gb18030 | gb18030_chinese_ci | China National Standard GB18030 | 4 | > | utf8mb4 | utf8mb4_0900_ai_ci | UTF-8 Unicode | 4 | > +--------------------+----------------------+---------------------------------+--------+
由于mysql中utf8的maxlen是3个byte,而正常的unicode使用的是2~4个byte,导致mysql的 不能显示emoji和一些生僻的中文。所以mysql又推出了uft8mb4,这个才是真正的utf8编码。不过在oracle接手mysql之后,默认的字符编码之后会改成utf8mb4。这里详细阐述了mysql utf8编码的坑以及应对方法。
[mysqld]
character_set_server=utf8mb4
参数组中:
官方文档:dms限制
mysql> show variables like 'character%'; +--------------------------+------------------------------------------------------+ | Variable_name | Value | +--------------------------+------------------------------------------------------+ | character_set_client | utf8mb4 | | character_set_connection | utf8mb4 | | character_set_database | utf8mb4 | | character_set_filesystem | binary | | character_set_results | utf8mb4 | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /usr/local/Cellar/mysql/8.0.12/share/mysql/charsets/ | +--------------------------+------------------------------------------------------+
mysql> select * from information_schema.SCHEMATA; +--------------+--------------------+----------------------------+------------------------+----------+ | CATALOG_NAME | SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME | SQL_PATH | +--------------+--------------------+----------------------------+------------------------+----------+ | def | mysql | utf8 | utf8_general_ci | NULL | | def | information_schema | utf8 | utf8_general_ci | NULL | | def | performance_schema | utf8mb4 | utf8mb4_0900_ai_ci | NULL | | def | sys | utf8mb4 | utf8mb4_0900_ai_ci | NULL | | def | mydb | utf8mb4 | utf8mb4_unicode_ci | NULL | | def | t1 | utf8mb4 | utf8mb4_0900_ai_ci | NULL | +--------------+--------------------+----------------------------+------------------------+----------+
mysql> show create table t1\G *************************** 1. row *************************** Table: t1 Create Table: CREATE TABLE `t1` ( `id` int(11) NOT NULL AUTO_INCREMENT, `description` varchar(50) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1 row in set (0.00 sec)
mysql> show table status from t1\G *************************** 1. row *************************** Name: t3 Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 0 Avg_row_length: 0 Data_length: 16384 Max_data_length: 0 Index_length: 0 Data_free: 0 Auto_increment: NULL Create_time: 2018-11-28 12:05:37 Update_time: NULL Check_time: NULL Collation: utf8mb4_0900_ai_ci Checksum: NULL Create_options: Comment: 1 row in set (0.15 sec)
这种情况只能看到Collation,Collation的概念是字符比较的规则,每种字符集都会有其默认的Collation,我们从information_schema.CHARACTER_SETS这个表中可以查询到相应的信息,一般我们设定好字符集之后,Collation会被默认该字符集的默认值。
mysql> select TABLE_SCHEMA,TABLE_NAME,TABLE_COLLATION from information_schema.tables where table_schema = 'mydb'; +--------------+---------------------+--------------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | +--------------+---------------------+--------------------+ | mydb | charset_test_latin1 | utf8mb4_0900_ai_ci | | mydb | student | utf8_general_ci | | mydb | t1 | utf8mb4_0900_ai_ci | | mydb | t2 | utf8_general_ci | | mydb | t3 | utf8_general_ci | | mydb | t4 | utf8_general_ci | | mydb | t5 | utf8mb4_unicode_ci | | mydb | t6 | utf8mb4_0900_ai_ci | | mydb | t8 | utf8mb4_unicode_ci | | mydb | vc | utf8mb4_unicode_ci | +--------------+---------------------+--------------------+
这种方法可以查看到一个数据库中所有表的TABLE_COLLATION,推导出对应使用什么类型的字符集。
mysql> select TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,DATA_TYPE,CHARACTER_SET_NAME,COLLATION_NAME from information_schema.columns where TABLE_SCHEMA='mydb'; +--------------+---------------------+-------------+-----------+--------------------+--------------------+ | TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | DATA_TYPE | CHARACTER_SET_NAME | COLLATION_NAME | +--------------+---------------------+-------------+-----------+--------------------+--------------------+ | mydb | charset_test_latin1 | id | int | NULL | NULL | | mydb | charset_test_latin1 | char_col | varchar | utf8mb4 | utf8mb4_0900_ai_ci | | mydb | student | course | varchar | utf8 | utf8_general_ci | | mydb | student | mark | int | NULL | NULL | | mydb | student | name | varchar | utf8 | utf8_general_ci | | mydb | t1 | id | int | NULL | NULL | | mydb | t1 | description | varchar | utf8mb4 | utf8mb4_0900_ai_ci | | mydb | t2 | info | varchar | utf8 | utf8_general_ci | | mydb | t3 | info | varchar | utf8mb4 | utf8mb4_0900_ai_ci | | mydb | t4 | info | char | utf8mb4 | utf8mb4_0900_ai_ci | | mydb | t5 | info | char | utf8mb4 | utf8mb4_unicode_ci | | mydb | t6 | info | varchar | utf8mb4 | utf8mb4_0900_ai_ci | | mydb | t8 | id | int | NULL | NULL | | mydb | vc | v | varchar | utf8mb4 | utf8mb4_unicode_ci | | mydb | vc | c | char | utf8mb4 | utf8mb4_unicode_ci | +--------------+---------------------+-------------+-----------+--------------------+--------------------+
For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
For each column:
ALTER TABLE table_name CHANGE column_name column_name data_type CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
utf8是utf8mb4的子集,所以直接转换理论上不会有问题。当然也可以使用dump转换编码。
批量生成脚本:
use information_schema; SELECT concat("ALTER DATABASE `",table_schema,"` CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;") as _sql FROM `TABLES` where table_schema like "yourDbName" group by table_schema; SELECT concat("ALTER TABLE `",table_schema,"`.`",table_name,"` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;") as _sql FROM `TABLES` where table_schema like "yourDbName" group by table_schema, table_name; SELECT concat("ALTER TABLE `",table_schema,"`.`",table_name, "` CHANGE `",column_name,"` `",column_name,"` ",data_type,"(",character_maximum_length,") CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;") as _sql FROM `COLUMNS` where table_schema like "yourDbName" and data_type in ('varchar'); SELECT concat("ALTER TABLE `",table_schema,"`.`",table_name, "` CHANGE `",column_name,"` `",column_name,"` ",data_type," CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;") as _sql FROM `COLUMNS` where table_schema like "yourDbName" and data_type in ('text','tinytext','mediumtext','longtext');
出现这种报错主要有两种情况:
ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes
一个是length 大于3072 bytes,一个是大于1000 bytes。
mysql5.7中支持index key最大的长度是 767 bytes,在开启了innodb_large_prefix这个参数之后,max len 限制是3072 bytes。在5.7之前这个参数没有默认开启,5.7之后默认是开启的。8.0之后去掉了这个参数,默认就支持3072个字节。
所以在转换字符集过程中,如果一个列上有索引,由于之前的utf8的编码是3个bytes,utf8mb4是4个bytes。转换之后key的值可能会超过767或则3072,这个时候就是出现类似的报错。如果是MyISAM的引擎,是直接不能超过1000 bytes这个限制的。
这个时候的解决办法是如果是MyISAM的引擎,改成innodb引擎。
如果改成innodb还不行,只能缩小字段的大小。
set names utf8mb4;
相当于设置
不得不说,mysql这个3个byte的utf8是个巨坑,没有按照国际的标准来设计,不过之后肯定会改成utf8mb4为默认字符集。
https://dba.stackexchange.com/questions/8239/how-to-easily-convert-utf8-tables-to-utf8mb4-in-mysql-5-5
http://aprogrammers.blogspot.com/2014/12/utf8mb4-character-set-in-amazon-rds.html
https://oracle-base.com/articles/mysql/mysql-converting-table-character-sets-from-latin1-to-utf8#the-problem
版权声明:转自阿里云-https://yq.aliyun.com/articles/674741