comparison mercurial/patch.py @ 37732:35632d392279

patch: implement a new worddiff algorithm The previous worddiff algorithm has many problems. The major problem is it does a "similarity check" that selects a subset of matched lines to do inline diffs. It is a bad idea because: - The "similarity check" is non-obvious to users. For example, a simple change from "long long x" to "int64_t x" will fail the similarity check and won't be diff-ed as expected. - Selecting "lines" to diff won't work as people expect if there are line wrapping changes. - It has a sad time complexity if lines do not match, could be O(N^2)-ish. There are other problems in implementation details. - Lines can match across distant hunks (if the next hunk does not have "-" lines). - "difflib" is slow. The solution would be removing the "similarity check", and just diff all words in a same hunk. So no content will be missed and everything will be diff-ed as expected. This is similar to what code review tool like Phabricator does. This diff implements the word diff algorithm as described above. It also avoids difflib to be faster. Note about colors: To be consistent, "changed inserted" parts and "purely insertion blocks" should have a same color, since they do not exist in the previous version. Instead of highlighting differences, this patch chooses to dim common parts. This is also more consistent with Phabricator or GitHub webpage. That said, the labels are defined in a way that people can still highlight changed parts and leave purely inserted/deleted hunks use the "non-highlighted" color. As one example, running: hg log -pr df50b87d8f736aff8dc281f816bddcd6f306930c mercurial/commands.py \ --config experimental.worddiff=1 --color=debug --config diff.unified=0 The previous algorithm outputs: [diff.file_a|--- a/mercurial/commands.py Fri Mar 09 15:53:41 2018 +0100] [diff.file_b|+++ b/mercurial/commands.py Sat Mar 10 12:33:19 2018 +0530] [diff.hunk|@@ -2039,1 +2039,4 @@] [diff.deleted|-][diff.deleted.highlight|@command('^forget',][diff.deleted| ][diff.deleted.highlight|walkopts,][diff.deleted| _('[OPTION]... FILE...'), inferrepo=True)] [diff.inserted|+@command(] [diff.inserted|+ '^forget',] [diff.inserted|+ walkopts + dryrunopts,] [diff.inserted|+ ][diff.inserted.highlight| ][diff.inserted| _('[OPTION]... FILE...'), inferrepo=True)] [diff.hunk|@@ -2074,1 +2077,3 @@] [diff.deleted|- rejected = cmdutil.forget(ui, repo, m, prefix="",][diff.deleted.highlight| explicitonly=False)[0]] [diff.inserted|+ dryrun = opts.get(r'dry_run')] [diff.inserted|+ rejected = cmdutil.forget(ui, repo, m, prefix="",] [diff.inserted|+ explicitonly=False, dryrun=dryrun)[0]] The new algorithm outputs: [diff.file_a|--- a/mercurial/commands.py Fri Mar 09 15:53:41 2018 +0100] [diff.file_b|+++ b/mercurial/commands.py Sat Mar 10 12:33:19 2018 +0530] [diff.hunk|@@ -2039,1 +2039,4 @@] [diff.deleted|-][diff.deleted.unchanged|@command(][diff.deleted.unchanged|'^forget',][diff.deleted.unchanged| ][diff.deleted.changed|walkopts][diff.deleted.unchanged|,][diff.deleted.changed| ][diff.deleted.unchanged|_('[OPTION]... FILE...'), inferrepo=True)] [diff.inserted|+][diff.inserted.unchanged|@command(] [diff.inserted|+][diff.inserted.changed| ][diff.inserted.unchanged|'^forget',] [diff.inserted|+][diff.inserted.changed| walkopts][diff.inserted.unchanged| ][diff.inserted.changed|+ dryrunopts][diff.inserted.unchanged|,] [diff.inserted|+][diff.inserted.changed| ][diff.inserted.unchanged|_('[OPTION]... FILE...'), inferrepo=True)] [diff.hunk|@@ -2074,1 +2077,3 @@] [diff.deleted|-][diff.deleted.unchanged| rejected = cmdutil.forget(ui, repo, m, prefix="",][diff.deleted.changed| ][diff.deleted.unchanged|explicitonly=False][diff.deleted.unchanged|)[0]] [diff.inserted|+][diff.inserted.changed| dryrun = opts.get(r'dry_run')] [diff.inserted|+][diff.inserted.unchanged| rejected = cmdutil.forget(ui, repo, m, prefix="",] [diff.inserted|+][diff.inserted.changed| ][diff.inserted.unchanged|explicitonly=False][diff.inserted.changed|, dryrun=dryrun][diff.inserted.unchanged|)[0]] Practically, when diffing a 8k line change, the time spent on worddiff reduces from 4 seconds to 0.14 seconds. Differential Revision: https://phab.mercurial-scm.org/D3212
author Jun Wu <quark@fb.com>
date Mon, 09 Apr 2018 15:58:30 -0700
parents 5471348921c1
children 5e67c20915a7
comparison
equal deleted inserted replaced
37731:5471348921c1 37732:35632d392279
48 48
49 stringio = util.stringio 49 stringio = util.stringio
50 50
51 gitre = re.compile(br'diff --git a/(.*) b/(.*)') 51 gitre = re.compile(br'diff --git a/(.*) b/(.*)')
52 tabsplitter = re.compile(br'(\t+|[^\t]+)') 52 tabsplitter = re.compile(br'(\t+|[^\t]+)')
53 _nonwordre = re.compile(br'([^a-zA-Z0-9_\x80-\xff])') 53 wordsplitter = re.compile(br'(\t+| +|[a-zA-Z0-9_\x80-\xff]+|'
54 '[^ \ta-zA-Z0-9_\x80-\xff])')
54 55
55 PatchError = error.PatchError 56 PatchError = error.PatchError
56 57
57 # public functions 58 # public functions
58 59
2502 if chompline != stripline: 2503 if chompline != stripline:
2503 yield (chompline[len(stripline):], 'diff.trailingwhitespace') 2504 yield (chompline[len(stripline):], 'diff.trailingwhitespace')
2504 if chompline != line: 2505 if chompline != line:
2505 yield (line[len(chompline):], '') 2506 yield (line[len(chompline):], '')
2506 2507
2508 def diffsinglehunkinline(hunklines):
2509 """yield tokens for a list of lines in a single hunk, with inline colors"""
2510 # prepare deleted, and inserted content
2511 a = ''
2512 b = ''
2513 for line in hunklines:
2514 if line[0] == '-':
2515 a += line[1:]
2516 elif line[0] == '+':
2517 b += line[1:]
2518 else:
2519 raise error.ProgrammingError('unexpected hunk line: %s' % line)
2520 # fast path: if either side is empty, use diffsinglehunk
2521 if not a or not b:
2522 for t in diffsinglehunk(hunklines):
2523 yield t
2524 return
2525 # re-split the content into words
2526 al = wordsplitter.findall(a)
2527 bl = wordsplitter.findall(b)
2528 # re-arrange the words to lines since the diff algorithm is line-based
2529 aln = [s if s == '\n' else s + '\n' for s in al]
2530 bln = [s if s == '\n' else s + '\n' for s in bl]
2531 an = ''.join(aln)
2532 bn = ''.join(bln)
2533 # run the diff algorithm, prepare atokens and btokens
2534 atokens = []
2535 btokens = []
2536 blocks = mdiff.allblocks(an, bn, lines1=aln, lines2=bln)
2537 for (a1, a2, b1, b2), btype in blocks:
2538 changed = btype == '!'
2539 for token in mdiff.splitnewlines(''.join(al[a1:a2])):
2540 atokens.append((changed, token))
2541 for token in mdiff.splitnewlines(''.join(bl[b1:b2])):
2542 btokens.append((changed, token))
2543
2544 # yield deleted tokens, then inserted ones
2545 for prefix, label, tokens in [('-', 'diff.deleted', atokens),
2546 ('+', 'diff.inserted', btokens)]:
2547 nextisnewline = True
2548 for changed, token in tokens:
2549 if nextisnewline:
2550 yield (prefix, label)
2551 nextisnewline = False
2552 # special handling line end
2553 isendofline = token.endswith('\n')
2554 if isendofline:
2555 chomp = token[:-1] # chomp
2556 token = chomp.rstrip() # detect spaces at the end
2557 endspaces = chomp[len(token):]
2558 # scan tabs
2559 for maybetab in tabsplitter.findall(token):
2560 if '\t' == maybetab[0]:
2561 currentlabel = 'diff.tab'
2562 else:
2563 if changed:
2564 currentlabel = label + '.changed'
2565 else:
2566 currentlabel = label + '.unchanged'
2567 yield (maybetab, currentlabel)
2568 if isendofline:
2569 if endspaces:
2570 yield (endspaces, 'diff.trailingwhitespace')
2571 yield ('\n', '')
2572 nextisnewline = True
2573
2507 def difflabel(func, *args, **kw): 2574 def difflabel(func, *args, **kw):
2508 '''yields 2-tuples of (output, label) based on the output of func()''' 2575 '''yields 2-tuples of (output, label) based on the output of func()'''
2576 if kw.get(r'opts') and kw[r'opts'].worddiff:
2577 dodiffhunk = diffsinglehunkinline
2578 else:
2579 dodiffhunk = diffsinglehunk
2509 headprefixes = [('diff', 'diff.diffline'), 2580 headprefixes = [('diff', 'diff.diffline'),
2510 ('copy', 'diff.extended'), 2581 ('copy', 'diff.extended'),
2511 ('rename', 'diff.extended'), 2582 ('rename', 'diff.extended'),
2512 ('old', 'diff.extended'), 2583 ('old', 'diff.extended'),
2513 ('new', 'diff.extended'), 2584 ('new', 'diff.extended'),
2523 2594
2524 # buffers a hunk, i.e. adjacent "-", "+" lines without other changes. 2595 # buffers a hunk, i.e. adjacent "-", "+" lines without other changes.
2525 hunkbuffer = [] 2596 hunkbuffer = []
2526 def consumehunkbuffer(): 2597 def consumehunkbuffer():
2527 if hunkbuffer: 2598 if hunkbuffer:
2528 for token in diffsinglehunk(hunkbuffer): 2599 for token in dodiffhunk(hunkbuffer):
2529 yield token 2600 yield token
2530 hunkbuffer[:] = [] 2601 hunkbuffer[:] = []
2531 2602
2532 for chunk in func(*args, **kw): 2603 for chunk in func(*args, **kw):
2533 lines = chunk.split('\n') 2604 lines = chunk.split('\n')