iso10646/unicode/utf8

来源：百度文库编辑：神马文学网时间：2024/04/30 01:42:52

http://www.surfchen.org/wiki.php/iso10646#wiki_h1
ISO10646/Unicode规定了一套字符集，包含了世界上大多数常用字符，规定了这些字符的编码。也就是说，每个字符会有一个规定好的编码。
UTF8,UTF16,UTF32则规定了一套算法。根据各自的算法存储ISO10646/Unicode的相应字符。
[Edit UTF-8]UTF-8
UTF-8的算法根据unicode字符的范围而有变化，主要表现在存储的字节数上，这是为了兼容ascii的单字节编码。具体是这样的：
0000-007F | 0xxxxxxx
0080-07FF | 110xxxxx 10xxxxxx
0800-FFFF | 1110xxxx 10xxxxxx 10xxxxxx
10000-10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
上面这个表里，字段1表示范围，字段2表示编码所使用的算法，或许称为模板更为准确。
例如有一个字符（汜），在unicode里的编码为十六进制的6C5C，范围在0800-FFFF之间，所以取模板1110xxxx 10xxxxxx 10xxxxxx（也就是说，该字符为三字节的宽字符），该字符编码换算成二进制后代入模板，得1110[0110] 10[110001] 10[011100]。中括号里的数字，连接起来就是该字符的unicode编码的二进制表示。
下面是一个我写的把html的实体转换为utf-8的函数，编程语言为php。html的实体的其中一种格式为 &#unicode编码的十进制;，注意最后有个分号。
function htmlentity2utf8($string) {
if (!preg_match("@&#(\d+);@",$string,$matches)) return $string;
$he=$matches[1];
$he=(int)$he;
if ($he>=0x0000 && $he<=0x007f) {
$template=array("0%");
} elseif ($he>=0x0080 && $he<=0x007ff) {
$template=array("110%","10%");
} elseif ($he>=0x0800 && $he<=0xffff) {
$template=array("1110%","10%","10%");
} elseif ($he>=0x10000 && $he<=0x10ffff) {
$template=array("11110%","10%","10%","10%");
} else {
return $string;
}
$template=array_reverse($template);
$utf8=‘‘;
$he_b=(string)sprintf("%b",$he);
$offset=0;
foreach ($template as $t) {
$t_len=strlen($t);
$need_count=9-$t_len;
$offset-=$need_count;
$current_he=substr(sprintf("%0".abs($offset)."s",$he_b),$offset,$need_count);
$tmp=sprintf("%0".$need_count."d",$current_he);
$utf8=chr(base_convert((str_replace(‘%‘,$tmp,$t)),2,10)).$utf8;
}
return $utf8;
}
Config........0.00045204162597656 SECs
Instantiate..0.010334014892578 SECs
Render......3.3175349235535 SECs
This wiki is underGPL and the latest version can be foundhere.
iso10646
Navigator
UTF-8
KeyShortCuts
o - login/logout e - edit h - histories v - recently views m - recently modifies w - go home [ - go to previous ] - go to next u - lock/unlock c - passwd Accesskey(s) - save AccessKey(c) - discard

iso10646/unicode/utf8 UTF8到Unicode 对字符编码与Unicode,ISO 10646,UCS,UTF8,UTF16,GBK,GB... 文件批量改名工具官网–新起飞部落 ? lazarus UTF8 unicode 对字符编码与Unicode,ISO 10646,UCS,UTF8,UTF16,GBK,GB... unicode utf8的编码算法 unicode编码 unicode 汉字编码 Unicode编码 java, unicode and xml 什么是Unicode(统一码)? Unicode字符编码规范 UNICODE环境设置什么是Unicode(统一码)? Unicode详解又一篇 Linux Unicode 编程 VC++的Unicode编程 UNICODE环境设置 Unicode字符编码规范对UTF8编码的初步认识 escape utf8字符串的php实现修改mysql字符编码成为UTF8 修改mysql字符编码成为UTF8