mercurial-scm/hg: mercurial/encoding.py comparison

comparison mercurial/encoding.py @ 26879:a24b98f4e03c

encoding: re-escape U+DCxx characters in toutf8b input (issue4927) This is the final missing piece in fully round-tripping random byte strings through UTF-8b. While this issue means that UTF-8 <-> UTF-8b isn't fully bijective, we don't expect to ever see U+DCxx codepoints in "real" UTF-8 data, so it should remain bijective in practice.

author	Matt Mackall <mpm@selenic.com>
date	Thu, 05 Nov 2015 17:30:10 -0600
parents	d7e83f106459
children	de5ae97ce9f4

comparison

equal deleted inserted replaced

-:d7e83f106459
+:a24b98f4e03c
 arbitrary bytes into an internal Unicode format that can be
 re-encoded back into the original. Here we are exposing the
 internal surrogate encoding as a UTF-8 string.)
 '''
-if isinstance(s, localstr):
+if "\xed" not in s:
-return s._utf8
+if isinstance(s, localstr):
+return s._utf8
 try:
 s.decode('utf-8')
 return s
 except UnicodeDecodeError:
 pass
 r = ""
 pos = 0
 l = len(s)
 while pos < l:
 try:
 c = getutf8char(s, pos)
-pos += len(c)
+if "\xed\xb0\x80" <= c <= "\xed\xb3\xbf":
+# have to re-escape existing U+DCxx characters
+c = unichr(0xdc00 + ord(s[pos])).encode('utf-8')
+pos += 1
+else:
+pos += len(c)
 except UnicodeDecodeError:
 c = unichr(0xdc00 + ord(s[pos])).encode('utf-8')
 pos += 1
 r += c
 return r

Mercurial > public > mercurial-scm > hg

comparison mercurial/encoding.py @ 26879:a24b98f4e03c