Python

在Python 3中将Unicode序列转换为字符串

发布于 2021-01-29 15:00:01

在Bash CLI的Kubuntu 15.10上解析HTML响应以使用 Python 3.4
提取数据时，使用来print()获取如下所示的输出：

\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

如何在应用程序中输出实际文本本身？

这是生成字符串的代码：

response = requests.get(url)
messages = json.loads( extract_json(response.text) )

for k,v in messages.items():
    for message in v['foo']['bar']:
        print("\nFoobar: %s" % (message['body'],))

这是从HTML页面返回JSON的函数：

def extract_json(input_):

    """
    Get the JSON out of a webpage.
    The line of interest looks like this:
    foobar = ["{\"name\":\"dotan\",\"age\":38}"]
    """

    for line in input_.split('\n'):
        if 'foobar' in line:
            return line[line.find('"')+1:-2].replace(r'\"',r'"')

    return None

在搜寻该问题时，我发现了很多与Python
2
有关的信息，但是
Python 3*
完全改变了Python中处理字符串，尤其是Unicode的方式。 *

如何在Python 3中将示例字符串（\u05ea）转换为字符（ת）？

附录：

以下是一些有关的信息message['body']：

print(type(message['body']))
# Prints: <class 'str'>

print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'

print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין

请注意，最后一行确实按预期运行，但是存在一些问题：

用unicode-escape解码字符串文字是错误的事情，因为许多字符的Python转义与JSON转义不同。（谢谢bobince）
encode()依赖默认编码，这是一件坏事。（谢谢bobince）
在encode()一些较新的Unicode字符，如失败\ ud83d \ ude03，与UnicodeEncodeError“代理人不允许”。

关注者

被浏览

138

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

看来您的输入使用反斜杠作为转义字符，您应该先取消转义文本，然后再将其传递给json：

>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש

不要'unicode-escape'在JSON文本上使用编码；它可能会产生不同的结果：

>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['😂']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'

'😂' == '\U0001F602'是U + 1F602（充满喜悦的面孔）。

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看