mercurial-scm/hg-stable: mercurial/encoding.py comparison

comparison mercurial/encoding.py @ 50434:95acba2c29f6

encoding: avoid quadratic time complexity when json-encoding non-UTF8 strings Apparently the code uses "+=" with a bytes object, which is linear-time, so the whole encoding is quadratic-time. This patch makes us use a bytearray object, instead, which has a(n amortized-)constant-time append operation. The encoding is still not particularly fast, but at least a 10MB file takes tens of seconds, not many hours to encode.

author	Arseniy Alekseyev <aalekseyev@janestreet.com>
date	Mon, 06 Mar 2023 11:27:57 +0000
parents	d44e3c45f0e4
children	18c8c18993f0

comparison

equal deleted inserted replaced

-:bcf54837241d
+:95acba2c29f6
 return s
 except UnicodeDecodeError:
 pass
 s = pycompat.bytestr(s)
-r = b""
+r = bytearray()
 pos = 0
 l = len(s)
 while pos < l:
 try:
 c = getutf8char(s, pos)
 pos += len(c)
 except UnicodeDecodeError:
 c = unichr(0xDC00 + ord(s[pos])).encode('utf-8', _utf8strict)
 pos += 1
 r += c
-return r
+return bytes(r)
 def fromutf8b(s):
 # type: (bytes) -> bytes
 """Given a UTF-8b string, return a local, possibly-binary string.
 # use UTF-16 internally (issue5031) which causes non-BMP code
 # points to be escaped. Instead, we use our handy getutf8char
 # helper again to walk the string without "decoding" it.
 s = pycompat.bytestr(s)
-r = b""
+r = bytearray()
 pos = 0
 l = len(s)
 while pos < l:
 c = getutf8char(s, pos)
 pos += len(c)
 # unescape U+DCxx characters
 if b"\xed\xb0\x80" <= c <= b"\xed\xb3\xbf":
 c = pycompat.bytechr(ord(c.decode("utf-8", _utf8strict)) & 0xFF)
 r += c
-return r
+return bytes(r)

Mercurial > public > mercurial-scm > hg-stable

comparison mercurial/encoding.py @ 50434:95acba2c29f6