python-将HTML实体转换为Unicode,反之亦然

可能重复:

  • 在Python中将XML / HTML实体转换为Unicode字符串
  • HTML实体代码到文本

如何在Python中将HTML实体转换为Unicode,反之亦然?

hekevintran asked 2019-10-08T06:11:18Z
6个解决方案
88 votes

至于“反之亦然”(我需要我自己,导致我发现这个问题没有帮助,随后又找到了一个有答案的站点):

u'some string'.encode('ascii', 'xmlcharrefreplace')

将返回一个纯字符串,其中任何非ASCII字符都将变成XML(HTML)实体。

Isaac answered 2019-10-08T06:11:36Z
28 votes

您需要有BeautifulSoup。

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&, ®, <, >, ¢, £, ¥, €, §, ©"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;
hekevintran answered 2019-10-08T06:12:00Z
18 votes

Python 2.7和BeautifulSoup4的更新

Unescape-Unicode HTML以bs4(Python 2.7 standard lib)进行Unicode编码:

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape-Unicode HTML以bs4(BeautifulSoup4)进行Unicode编码:

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

转义-Unicode以bs4(BeautifulSoup4)对HTML进行Unicode编码:

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
scharfmn answered 2019-10-08T06:12:47Z
7 votes

正如hekevintran答案所建议的那样,您可以使用cgi.escape(s)对字符串进行编码,但是请注意,默认情况下该函数中quote的编码为false,最好在字符串旁边传递quote=True关键字参数。 但是即使通过传递quote=True,该函数也不会转义单引号("'")(由于存在这些问题,自3.2版以来该函数已被弃用)

建议使用html.unescape(text)代替cgi.escape(s)。(3.2版中的新功能)

在版本3.4中还引入了html.unescape(text)

因此,在python 3.4中,您可以:

  • 使用html.unescape(text)将特殊字符转换为HTML实体。
  • html.unescape(text)用于将HTML实体转换回纯文本表示形式。
AXO answered 2019-10-08T06:13:48Z
1 votes

我使用以下函数将从xls文件中剥离的unicode转换为html文件,同时保留了xls文件中的特殊字符:

def html_wr(f, dat):
    ''' write dat to file f as html
        . file is assumed to be opened in binary format
        . if dat is nul it is replaced with non breakable space
        . non-ascii characters are translated to xml       
    '''
    if not dat:
        dat = '&nbsp;'
    try:
        f.write(dat.encode('ascii'))
    except:
        f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

希望这对某人有用

Stephen Ellwood answered 2019-10-08T06:14:20Z
1 votes

如果像我这样的人在那里想知道为什么某些实体编号(代码)(如&#153; (for trademark symbol), &#128; (for euro symbol))未正确编码,则原因是在ISO-8859-1(又名Windows-1252)中未定义这些字符。

另请注意,从html5开始,默认字符集为utf-8,对于html4则为ISO-8859-1

因此,我们将必须以某种方式解决该问题(首先查找并替换它们)

Mozilla文档的参考(起点)

[https://developer.mozilla.org/zh-CN/docs/Web/Guide/Localizations_and_character_encodings]

brucekaushik answered 2019-10-08T06:15:12Z
translate from https://stackoverflow.com:/questions/701704/convert-html-entities-to-unicode-and-vice-versa