德问:UTF-8下的PHP全角标点转为半角的疑问

20 10 月, 2013 Categories: Open, 后端, 技术, 社会化媒体

一、PHP全角标点转为半角

< ?php
$str = "0123ABCDFWS\",.?<>{}[]*&^%#@!~()+-|:;";
echo "$str";
echo "
";
$str = preg_replace('/\xa3([\xa1-\xfe])/e', 'chr(ord(\1)-0x80)', $str);
echo $str;

这是网上看来的代码,所有的中文标点的第二个字节减去0X80(即128)所得的数字就是半角所得的数字了。而/e模式表达的是:如果设定了此修正符,preg_replace() 在替换字符串中对逆向引用作正常的替换,将其作为 PHP 代码求值,并用其结果来替换所搜索的字符串。

在非UTF-8模式下这个函数是可行的,但是UTF-8下 这个方法就似乎无效,求能满足UTF-8模式下的这个功能的实现…

二、如果文件是utf8的,那么可以通过
$str = iconv(‘utf-8’, ‘gbk’, $str); 转换编码后,再
preg_replace(‘/\xa3([\xa1-\xfe])/e’, ‘chr(ord(\1)-0x80)’, $str);

有些字符在utf-8中可以表示,而在gbk中无法表示,那么你将utf-8转成gbk不会导致字符丢失吗?可以使用IGNORE:
$str = iconv(‘utf-8’, ‘gbk//IGNORE’, $str);

下边是不同编码对应的正则匹配编码的范围:
UTF8: [\x01-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xff][\x80-\xbf]{3}
UTF16: [\x00-\xd7][\xe0-\xff]|[\xd8-\xdf][\x00-\xff]{2}
Big5: [\x01-\x7f]|[\x81-\xfe]([\x40-\x7e]|[\xa1-\xfe])
GBK :[\x01-\x7f]|[\x81-\xfe][\x40-\xfe]
GB2312汉字: [\xb0-\xf7][\xa0-\xfe]
GB2312半角标点符号及特殊符号: \xa1[\xa2-\xfe]
GB2312罗马数组及项目序号: \xa2([\xa1-\xaa]|[\xb1-\xbf]|[\xc0-\xdf]|[\xe0-\xe2]|[\xe5-\xee]|[\xf1-\xfc])
GB2312全角标点及全角字母: \xa3[\xa1-\xfe]
GB2312日文平假名: \xa4[\xa1-\xf3]
GB2312日文片假名:\xa5[\xa1-\xf6]
GB18030: [\x00-\x7f]|[\x81-\xfe][\x40-\xfe]|[\x81-\xfe][\x30-\x39][\x81-\xfe][\x30-\x39]

JIS [\x20-\x7e]|[\x21-\x5f]|[\x21-\x7e]{2}
SJIS [\x20-\x7e]|[\xa1-\xdf]|([\x81-\x9f]|[\xe0-\xef])([\x40-\x7e]|[\x80-\xfc])
SJIS全角空格 (?:\x81\x81)
SJIS全角数字 (?:\x82[\x4f-\x58])
SJIS全角大写英文 (?:\x82[\x60-\x79])
SJIS全角小写英文 (?:\x82[\x81-\x9a])
SJIS全角平假名 (?:\x82[\x9f-\xf1])
SJIS全角平假名扩展 (?:\x82[\x9f-\xf1]|\x81[\x4a\x4b\x54\x55])
SJIS全角片假名 (?:\x83[\x40-\x96])
SJIS全角片假名扩展 (?:\x83[\x40-\x96]|\x81[\x45\x5b\x52\x53])

EUC_JP [\x20-\x7e]|\x81[\xa1-\xdf]|[\xa1-\xfe][\xa1-\xfe]|\x8f[\xa1-\xfe]{2}
EUC_JP标点符号及特殊字符 [\xa1-\xa2][\xa0-\xfe]
EUC_JP全角数字 \xa3[\xb0-\xb9]
EUC_JP全角大写英文 \xa3[\xc1-\xda]
EUC_JP全角小写英文 \xa3[\xe1-\xfa]
EUC_JP全角平假名 \xa4[\xa1-\xf3]
EUC_JP全角片假名 \xa3[\xb0-\xb9]|\xa3[\xc1-\xda]|\xa5[\xa1-\xf6][\xa3][\xb0-\xfa]|[\xa1][\xbc-\xbe]|[\xa1][\xdd]
EUC_JP全角汉字 [\xb0-\xcf][\xa0-\xd3]|[\xd0-\xf4][\xa0-\xfe]|[\xB0-\xF3][\xA1-\xFE]|[\xF4][\xA1-\xA6]|[\xA4][\xA1-\xF3]|[\xA5][\xA1-\xF6]|[\xA1][\xBC-\xBE]
EUC_JP全角空格 (?:\xa1\xa1)

EUC半角片假名 (?:\x8e[\xa6-\xdf])
日文半角空格 \x20

三、我觉得没必要那么麻烦,直接写个方法替换就完了,而且正则的效率还低,我就是这么弄的:
/**
* 字符串半角和全角间相互转换
* @param string $str 待转换的字符串
* @param int $type TODBC:转换为半角;TOSBC,转换为全角
* @return string 返回转换后的字符串
*/
function convertStrType($str, $type) {

$dbc = array(
‘0’ , ‘1’ , ‘2’ , ‘3’ , ‘4’ ,
‘5’ , ‘6’ , ‘7’ , ‘8’ , ‘9’ ,
‘A’ , ‘B’ , ‘C’ , ‘D’ , ‘E’ ,
‘F’ , ‘G’ , ‘H’ , ‘I’ , ‘J’ ,
‘K’ , ‘L’ , ‘M’ , ‘N’ , ‘O’ ,
‘P’ , ‘Q’ , ‘R’ , ‘S’ , ‘T’ ,
‘U’ , ‘V’ , ‘W’ , ‘X’ , ‘Y’ ,
‘Z’ , ‘a’ , ‘b’ , ‘c’ , ‘d’ ,
‘e’ , ‘f’ , ‘g’ , ‘h’ , ‘i’ ,
‘j’ , ‘k’ , ‘l’ , ‘m’ , ‘n’ ,
‘o’ , ‘p’ , ‘q’ , ‘r’ , ‘s’ ,
‘t’ , ‘u’ , ‘v’ , ‘w’ , ‘x’ ,
‘y’ , ‘z’ , ‘-’ , ‘ ’ , ‘:’ ,
‘.’ , ‘,’ , ‘/’ , ‘%’ , ‘#’ ,
‘!’ , ‘@’ , ‘&’ , ‘(’ , ‘)’ ,
‘<’ , ‘>’ , ‘"’ , ‘'’ , ‘?’ ,
‘[’ , ‘]’ , ‘{’ , ‘}’ , ‘\’ ,
‘|’ , ‘+’ , ‘=’ , ‘_’ , ‘^’ ,
‘¥’ , ‘ ̄’ , ‘`’

);

$sbc = array( //半角
‘0’, ‘1’, ‘2’, ‘3’, ‘4’,
‘5’, ‘6’, ‘7’, ‘8’, ‘9’,
‘A’, ‘B’, ‘C’, ‘D’, ‘E’,
‘F’, ‘G’, ‘H’, ‘I’, ‘J’,
‘K’, ‘L’, ‘M’, ‘N’, ‘O’,
‘P’, ‘Q’, ‘R’, ‘S’, ‘T’,
‘U’, ‘V’, ‘W’, ‘X’, ‘Y’,
‘Z’, ‘a’, ‘b’, ‘c’, ‘d’,
‘e’, ‘f’, ‘g’, ‘h’, ‘i’,
‘j’, ‘k’, ‘l’, ‘m’, ‘n’,
‘o’, ‘p’, ‘q’, ‘r’, ‘s’,
‘t’, ‘u’, ‘v’, ‘w’, ‘x’,
‘y’, ‘z’, ‘-‘, ‘ ‘, ‘:’,
‘.’, ‘,’, ‘/’, ‘%’, ‘ #’,
‘!’, ‘@’, ‘&’, ‘(‘, ‘)’,
‘< ‘, ‘>’, ‘”‘, ‘\”,’?’,
‘[‘, ‘]’, ‘{‘, ‘}’, ‘\\’,
‘|’, ‘+’, ‘=’, ‘_’, ‘^’,
‘¥’,’~’, ‘`’

);
if($type == ‘TODBC’){
return str_replace( $sbc, $dbc, $str ); //半角到全角
}elseif($type == ‘TOSBC’){
return str_replace( $dbc, $sbc, $str ); //全角到半角
}else{
return $str;
}
}

Tags: , ,
尚无评论.

留言回复