mercurial-scm/hg: hgext/fix.py comparison

comparison hgext/fix.py @ 48178:f12a19d03d2c

fix: reduce number of tool executions By grouping together (path, ctx) pairs according to the inputs they would provide to fixer tools, we can deduplicate executions of fixer tools to significantly reduce the amount of time spent running slow tools. This change does not handle clean files in the working copy, which could still be deduplicated against the files in the checked out commit. It's a little harder to do that because the filerev is not available in the workingfilectx (and it doesn't exist for added files). Anecdotally, this change makes some real uses cases at Google 10x faster. I think we were originally hesitant to do this because the benefits weren't obvious, and implementing it efficiently is kind of tricky. If we simply memoized the formatter execution function, we would be keeping tons of file content in memory. Also included is a regression test for a corner case that I broke with my first attempt at optimizing this code. Differential Revision: https://phab.mercurial-scm.org/D11280

author	Danny Hooper <hooper@google.com>
date	Thu, 02 Sep 2021 14:08:45 -0700
parents	5ced12cfa41b
children	2f7caef017d9

comparison

equal deleted inserted replaced

-:066cdec8f74f
+:f12a19d03d2c
 _prefetchfiles(repo, workqueue, basepaths)
 # There are no data dependencies between the workers fixing each file
 # revision, so we can use all available parallelism.
 def getfixes(items):
-for rev, path in items:
+for srcrev, path, dstrevs in items:
-ctx = repo[rev]
+ctx = repo[srcrev]
 olddata = ctx[path].data()
 metadata, newdata = fixfile(
-ui, repo, opts, fixers, ctx, path, basepaths, basectxs[rev]
+ui,
+repo,
+opts,
+fixers,
+ctx,
+path,
+basepaths,
+basectxs[srcrev],
 )
-# Don't waste memory/time passing unchanged content back, but
+# We ungroup the work items now, because the code that consumes
-# produce one result per item either way.
+# these results has to handle each dstrev separately, and in
-yield (
+# topological order. Because these are handled in topological
-rev,
+# order, it's important that we pass around references to
-path,
+# "newdata" instead of copying it. Otherwise, we would be
-metadata,
+# keeping more copies of file content in memory at a time than
-newdata if newdata != olddata else None,
+# if we hadn't bothered to group/deduplicate the work items.
-)
+data = newdata if newdata != olddata else None
+for dstrev in dstrevs:
+yield (dstrev, path, metadata, data)
 results = worker.worker(
 ui, 1.0, getfixes, tuple(), workqueue, threadsafe=False
 )
 }
 scmutil.cleanupnodes(repo, replacements, b'fix', fixphase=True)
 def getworkqueue(ui, repo, pats, opts, revstofix, basectxs):
-"""Constructs the list of files to be fixed at specific revisions
+"""Constructs a list of files to fix and which revisions each fix applies to
-It is up to the caller how to consume the work items, and the only
+To avoid duplicating work, there is usually only one work item for each file
-dependence between them is that replacement revisions must be committed in
+revision that might need to be fixed. There can be multiple work items per
-topological order. Each work item represents a file in the working copy or
+file revision if the same file needs to be fixed in multiple changesets with
-in some revision that should be fixed and written back to the working copy
+different baserevs. Each work item also contains a list of changesets where
-or into a replacement revision.
+the file's data should be replaced with the fixed data. The work items for
+earlier changesets come earlier in the work queue, to improve pipelining by
-Work items for the same revision are grouped together, so that a worker
+allowing the first changeset to be replaced while fixes are still being
-pool starting with the first N items in parallel is likely to finish the
+computed for later changesets.
-first revision's work before other revisions. This can allow us to write
-the result to disk and reduce memory footprint. At time of writing, the
+Also returned is a map from changesets to the count of work items that might
-partition strategy in worker.py seems favorable to this. We also sort the
+affect each changeset. This is used later to count when all of a changeset's
-items by ascending revision number to match the order in which we commit
+work items have been finished, without having to inspect the remaining work
-the fixes later.
+queue in each worker subprocess.
-"""
-workqueue = []
+The example work item (1, "foo/bar.txt", (1, 2, 3)) means that the data of
+bar.txt should be read from revision 1, then fixed, and written back to
+revisions 1, 2 and 3. Revision 1 is called the "srcrev" and the list of
+revisions is called the "dstrevs". In practice the srcrev is always one of
+the dstrevs, and we make that choice when constructing the work item so that
+the choice can't be made inconsistently later on. The dstrevs should all
+have the same file revision for the given path, so the choice of srcrev is
+arbitrary. The wdirrev can be a dstrev and a srcrev.
+"""
+dstrevmap = collections.defaultdict(list)
 numitems = collections.defaultdict(int)
 maxfilesize = ui.configbytes(b'fix', b'maxfilesize')
 for rev in sorted(revstofix):
 fixctx = repo[rev]
 match = scmutil.match(fixctx, pats, opts)
 ui.warn(
 _(b'ignoring file larger than %s: %s\n')
 % (util.bytecount(maxfilesize), path)
 )
 continue
-workqueue.append((rev, path))
+baserevs = tuple(ctx.rev() for ctx in basectxs[rev])
+dstrevmap[(fctx.filerev(), baserevs, path)].append(rev)
 numitems[rev] += 1
+workqueue = [
+(min(dstrevs), path, dstrevs)
+for (filerev, baserevs, path), dstrevs in dstrevmap.items()
+]
+# Move work items for earlier changesets to the front of the queue, so we
+# might be able to replace those changesets (in topological order) while
+# we're still processing later work items. Note the min() in the previous
+# expression, which means we don't need a custom comparator here. The path
+# is also important in the sort order to make the output order stable. There
+# are some situations where this doesn't help much, but some situations
+# where it lets us buffer O(1) files instead of O(n) files.
+workqueue.sort()
 return workqueue, numitems
 def getrevstofix(ui, repo, opts):
 """Returns the set of revision numbers that should be fixed"""
 if opts.get(b'whole'):
 # Base paths will never be fetched for line range determination.
 return {}
 basepaths = {}
-for rev, path in workqueue:
+for srcrev, path, _dstrevs in workqueue:
-fixctx = repo[rev]
+fixctx = repo[srcrev]
-for basectx in basectxs[rev]:
+for basectx in basectxs[srcrev]:
 basepath = copies.pathcopies(basectx, fixctx).get(path, path)
 if basepath in basectx:
 basepaths[(basectx.rev(), fixctx.rev(), path)] = basepath
 return basepaths
 def _prefetchfiles(repo, workqueue, basepaths):
 toprefetch = set()
 # Prefetch the files that will be fixed.
-for rev, path in workqueue:
+for srcrev, path, _dstrevs in workqueue:
-if rev == wdirrev:
+if srcrev == wdirrev:
 continue
-toprefetch.add((rev, path))
+toprefetch.add((srcrev, path))
 # Prefetch the base contents for lineranges().
 for (baserev, fixrev, path), basepath in basepaths.items():
 toprefetch.add((baserev, basepath))

Mercurial > public > mercurial-scm > hg

comparison hgext/fix.py @ 48178:f12a19d03d2c