Mercurial > public > mercurial-scm > hg-stable
comparison mercurial/encoding.py @ 27699:c8d3392f76e1
encoding: handle UTF-16 internal limit with fromutf8b (issue5031)
Default builds of Python have a Unicode type that isn't actually full
Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP
codepoints with surrogate escaping. Since our UTF-8b hack escaping
uses a plane that overlaps with the UTF-16 escaping system, this gets
extra complicated. In addition, unichr() for codepoints greater than
U+FFFF may not work either.
This changes the code to reuse getutf8char to walk the byte string, so we
only rely on Python for unpacking our U+DCxx characters.
author | Matt Mackall <mpm@selenic.com> |
---|---|
date | Thu, 07 Jan 2016 14:57:57 -0600 |
parents | c2effd1ecebf |
children | ffa599f3f503 |
comparison
equal
deleted
inserted
replaced
27698:dad6404ccddb | 27699:c8d3392f76e1 |
---|---|
514 True | 514 True |
515 >>> roundtrip("\\xef\\xbf\\xbd") | 515 >>> roundtrip("\\xef\\xbf\\xbd") |
516 True | 516 True |
517 >>> roundtrip("\\xef\\xef\\xbf\\xbd") | 517 >>> roundtrip("\\xef\\xef\\xbf\\xbd") |
518 True | 518 True |
519 >>> roundtrip("\\xf1\\x80\\x80\\x80\\x80") | |
520 True | |
519 ''' | 521 ''' |
520 | 522 |
521 # fast path - look for uDxxx prefixes in s | 523 # fast path - look for uDxxx prefixes in s |
522 if "\xed" not in s: | 524 if "\xed" not in s: |
523 return s | 525 return s |
524 | 526 |
525 u = s.decode("utf-8") | 527 # We could do this with the unicode type but some Python builds |
528 # use UTF-16 internally (issue5031) which causes non-BMP code | |
529 # points to be escaped. Instead, we use our handy getutf8char | |
530 # helper again to walk the string without "decoding" it. | |
531 | |
526 r = "" | 532 r = "" |
527 for c in u: | 533 pos = 0 |
528 if ord(c) & 0xffff00 == 0xdc00: | 534 l = len(s) |
529 r += chr(ord(c) & 0xff) | 535 while pos < l: |
530 else: | 536 c = getutf8char(s, pos) |
531 r += c.encode("utf-8") | 537 pos += len(c) |
538 # unescape U+DCxx characters | |
539 if "\xed\xb0\x80" <= c <= "\xed\xb3\xbf": | |
540 c = chr(ord(c.decode("utf-8")) & 0xff) | |
541 r += c | |
532 return r | 542 return r |