mercurial-scm/hg-stable: mercurial/encoding.py comparison

comparison mercurial/encoding.py @ 26878:d7e83f106459

encoding: use getutf8char in toutf8b This correctly avoids the ambiguity of U+FFFD already present in the input and similar confusion by working a character at a time.

author	Matt Mackall <mpm@selenic.com>
date	Thu, 05 Nov 2015 17:21:43 -0600
parents	cb467a9d7593
children	a24b98f4e03c

comparison

equal deleted inserted replaced

-:cb467a9d7593
+:d7e83f106459
 try:
 s.decode('utf-8')
 return s
 except UnicodeDecodeError:
-# surrogate-encode any characters that don't round-trip
+pass
-s2 = s.decode('utf-8', 'ignore').encode('utf-8')
 r = ""
 pos = 0
-for c in s:
+l = len(s)
-if s2[pos:pos + 1] == c:
+while pos < l:
-r += c
+try:
-pos += 1
+c = getutf8char(s, pos)
-else:
+pos += len(c)
-r += unichr(0xdc00 + ord(c)).encode('utf-8')
+except UnicodeDecodeError:
-return r
+c = unichr(0xdc00 + ord(s[pos])).encode('utf-8')
+pos += 1
+r += c
+return r
 def fromutf8b(s):
 '''Given a UTF-8b string, return a local, possibly-binary string.
 return the original binary string. This

Mercurial > public > mercurial-scm > hg-stable

comparison mercurial/encoding.py @ 26878:d7e83f106459