Mercurial > public > mercurial-scm > hg-stable
view mercurial/worker.py @ 26117:4dc5b51f38fe
revlog: change generaldelta delta parent heuristic
The old generaldelta heuristic was "if p1 (or p2) was closer than the last full text,
use it, otherwise use prev". This was problematic when a repo contained multiple
branches that were very different. If commits to branch A were pushed, and the
last full text was branch B, it would generate a fulltext. Then if branch B was
pushed, it would generate another fulltext. The problem is that the last
fulltext (and delta'ing against `prev` in general) has no correlation with the
contents of the incoming revision, and therefore will always have degenerate
cases.
According to the blame, that algorithm was chosen to minimize the chain length.
Since there is already code that protects against that (the delta-vs-fulltext
code), and since it has been improved since the original generaldelta algorithm
went in (2011), I believe the chain length criteria will still be preserved.
The new algorithm always diffs against p1 (or p2 if it's closer), unless the
resulting delta will fail the delta-vs-fulltext check, in which case we delta
against prev.
Some before and after stats on manifest.d size.
internal large repo
old heuristic - 2.0 GB
new heuristic - 1.2 GB
mozilla-central
old heuristic - 242 MB
new heuristic - 261 MB
The regression in mozilla central is due to the new heuristic choosing p2r as
the delta when it's closer to the tip. Switching the algorithm to always prefer
p1r brings the size back down (242 MB). This is result of the way in which
mozilla does merges and pushes, and the result could easily swing the other
direction in other repos (depending on if they merge X into Y or Y into X), but
will never be as degenerate as before.
I future patch will address the regression by introducing an optional, even more
aggressive delta heuristic which will knock the mozilla manifest size down
dramatically.
author | Durham Goode <durham@fb.com> |
---|---|
date | Sun, 30 Aug 2015 13:58:11 -0700 |
parents | d29859cfcfc2 |
children | c0501c26b05c |
line wrap: on
line source
# worker.py - master-slave parallelism support # # Copyright 2013 Facebook, Inc. # # This software may be used and distributed according to the terms of the # GNU General Public License version 2 or any later version. from __future__ import absolute_import import errno import multiprocessing import os import signal import sys import threading from .i18n import _ from . import util def countcpus(): '''try to count the number of CPUs on the system''' try: return multiprocessing.cpu_count() except NotImplementedError: return 1 def _numworkers(ui): s = ui.config('worker', 'numcpus') if s: try: n = int(s) if n >= 1: return n except ValueError: raise util.Abort(_('number of cpus must be an integer')) return min(max(countcpus(), 4), 32) if os.name == 'posix': _startupcost = 0.01 else: _startupcost = 1e30 def worthwhile(ui, costperop, nops): '''try to determine whether the benefit of multiple processes can outweigh the cost of starting them''' linear = costperop * nops workers = _numworkers(ui) benefit = linear - (_startupcost * workers + linear / workers) return benefit >= 0.15 def worker(ui, costperarg, func, staticargs, args): '''run a function, possibly in parallel in multiple worker processes. returns a progress iterator costperarg - cost of a single task func - function to run staticargs - arguments to pass to every invocation of the function args - arguments to split into chunks, to pass to individual workers ''' if worthwhile(ui, costperarg, len(args)): return _platformworker(ui, func, staticargs, args) return func(*staticargs + (args,)) def _posixworker(ui, func, staticargs, args): rfd, wfd = os.pipe() workers = _numworkers(ui) oldhandler = signal.getsignal(signal.SIGINT) signal.signal(signal.SIGINT, signal.SIG_IGN) pids, problem = [], [0] for pargs in partition(args, workers): pid = os.fork() if pid == 0: signal.signal(signal.SIGINT, oldhandler) try: os.close(rfd) for i, item in func(*(staticargs + (pargs,))): os.write(wfd, '%d %s\n' % (i, item)) os._exit(0) except KeyboardInterrupt: os._exit(255) # other exceptions are allowed to propagate, we rely # on lock.py's pid checks to avoid release callbacks pids.append(pid) pids.reverse() os.close(wfd) fp = os.fdopen(rfd, 'rb', 0) def killworkers(): # if one worker bails, there's no good reason to wait for the rest for p in pids: try: os.kill(p, signal.SIGTERM) except OSError as err: if err.errno != errno.ESRCH: raise def waitforworkers(): for _pid in pids: st = _exitstatus(os.wait()[1]) if st and not problem[0]: problem[0] = st killworkers() t = threading.Thread(target=waitforworkers) t.start() def cleanup(): signal.signal(signal.SIGINT, oldhandler) t.join() status = problem[0] if status: if status < 0: os.kill(os.getpid(), -status) sys.exit(status) try: for line in fp: l = line.split(' ', 1) yield int(l[0]), l[1][:-1] except: # re-raises killworkers() cleanup() raise cleanup() def _posixexitstatus(code): '''convert a posix exit status into the same form returned by os.spawnv returns None if the process was stopped instead of exiting''' if os.WIFEXITED(code): return os.WEXITSTATUS(code) elif os.WIFSIGNALED(code): return -os.WTERMSIG(code) if os.name != 'nt': _platformworker = _posixworker _exitstatus = _posixexitstatus def partition(lst, nslices): '''partition a list into N slices of equal size''' n = len(lst) chunk, slop = n / nslices, n % nslices end = 0 for i in xrange(nslices): start = end end = start + chunk if slop: end += 1 slop -= 1 yield lst[start:end]