mercurial-scm/hg-stable: mercurial/encoding.py comparison

comparison mercurial/encoding.py @ 27699:c8d3392f76e1

encoding: handle UTF-16 internal limit with fromutf8b (issue5031) Default builds of Python have a Unicode type that isn't actually full Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP codepoints with surrogate escaping. Since our UTF-8b hack escaping uses a plane that overlaps with the UTF-16 escaping system, this gets extra complicated. In addition, unichr() for codepoints greater than U+FFFF may not work either. This changes the code to reuse getutf8char to walk the byte string, so we only rely on Python for unpacking our U+DCxx characters.

author	Matt Mackall <mpm@selenic.com>
date	Thu, 07 Jan 2016 14:57:57 -0600
parents	c2effd1ecebf
children	ffa599f3f503

comparison

equal deleted inserted replaced

-:dad6404ccddb
+:c8d3392f76e1
 True
 >>> roundtrip("\\xef\\xbf\\xbd")
 True
 >>> roundtrip("\\xef\\xef\\xbf\\xbd")
 True
+>>> roundtrip("\\xf1\\x80\\x80\\x80\\x80")
+True
 '''
 # fast path - look for uDxxx prefixes in s
 if "\xed" not in s:
 return s
-u = s.decode("utf-8")
+# We could do this with the unicode type but some Python builds
+# use UTF-16 internally (issue5031) which causes non-BMP code
+# points to be escaped. Instead, we use our handy getutf8char
+# helper again to walk the string without "decoding" it.
 r = ""
-for c in u:
+pos = 0
-if ord(c) & 0xffff00 == 0xdc00:
+l = len(s)
-r += chr(ord(c) & 0xff)
+while pos < l:
-else:
+c = getutf8char(s, pos)
-r += c.encode("utf-8")
+pos += len(c)
+# unescape U+DCxx characters
+if "\xed\xb0\x80" <= c <= "\xed\xb3\xbf":
+c = chr(ord(c.decode("utf-8")) & 0xff)
+r += c
 return r

Mercurial > public > mercurial-scm > hg-stable

comparison mercurial/encoding.py @ 27699:c8d3392f76e1