comparison mercurial/encoding.py @ 27699:c8d3392f76e1

encoding: handle UTF-16 internal limit with fromutf8b (issue5031) Default builds of Python have a Unicode type that isn't actually full Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP codepoints with surrogate escaping. Since our UTF-8b hack escaping uses a plane that overlaps with the UTF-16 escaping system, this gets extra complicated. In addition, unichr() for codepoints greater than U+FFFF may not work either. This changes the code to reuse getutf8char to walk the byte string, so we only rely on Python for unpacking our U+DCxx characters.
author Matt Mackall <mpm@selenic.com>
date Thu, 07 Jan 2016 14:57:57 -0600
parents c2effd1ecebf
children ffa599f3f503
comparison
equal deleted inserted replaced
27698:dad6404ccddb 27699:c8d3392f76e1
514 True 514 True
515 >>> roundtrip("\\xef\\xbf\\xbd") 515 >>> roundtrip("\\xef\\xbf\\xbd")
516 True 516 True
517 >>> roundtrip("\\xef\\xef\\xbf\\xbd") 517 >>> roundtrip("\\xef\\xef\\xbf\\xbd")
518 True 518 True
519 >>> roundtrip("\\xf1\\x80\\x80\\x80\\x80")
520 True
519 ''' 521 '''
520 522
521 # fast path - look for uDxxx prefixes in s 523 # fast path - look for uDxxx prefixes in s
522 if "\xed" not in s: 524 if "\xed" not in s:
523 return s 525 return s
524 526
525 u = s.decode("utf-8") 527 # We could do this with the unicode type but some Python builds
528 # use UTF-16 internally (issue5031) which causes non-BMP code
529 # points to be escaped. Instead, we use our handy getutf8char
530 # helper again to walk the string without "decoding" it.
531
526 r = "" 532 r = ""
527 for c in u: 533 pos = 0
528 if ord(c) & 0xffff00 == 0xdc00: 534 l = len(s)
529 r += chr(ord(c) & 0xff) 535 while pos < l:
530 else: 536 c = getutf8char(s, pos)
531 r += c.encode("utf-8") 537 pos += len(c)
538 # unescape U+DCxx characters
539 if "\xed\xb0\x80" <= c <= "\xed\xb3\xbf":
540 c = chr(ord(c.decode("utf-8")) & 0xff)
541 r += c
532 return r 542 return r