Mercurial > public > mercurial-scm > hg-stable
comparison mercurial/branchmap.py @ 23785:cb99bacb9b4e
branchcache: introduce revbranchcache for caching of revision branch names
It is expensive to retrieve the branch name of a revision. Very expensive when
creating a changectx and calling .branch() every time - slightly less when
using changelog.branchinfo().
Now, to speed things up, provide a way to cache the results on disk in an
efficient format. Each branchname is assigned a number, and for each revision
we store the number of the corresponding branch name. The branch names are
stored in a dedicated file which is strictly append only.
Branch names are usually reused across several revisions, and the total list of
branch names will thus be so small that it is feasible to read the whole set of
names before using the cache. It will however do that it might be more
efficient to use the changelog for retrieving the branch info for a single
revision.
The revision entries are stored in another file. This file is usually append
only, but if the repository has been modified, the file will be truncated and
the relevant parts rewritten on demand.
The entries for each revision are 8 bytes each, and the whole revision file
will thus be 1/8 of 00changelog.i.
Each revision entry contains the first 4 bytes of the corresponding node hash.
This is used as a check sum that always is verified before the entry is used.
That check is relatively expensive but it makes sure history modification is
detected and handled correctly. It will also detect and handle most revision
file corruptions.
This is just a cache. A new format can always be introduced if other
requirements or ideas make that seem like a good idea. Rebuilding the cache is
not really more expensive than it was to run for example 'hg log -b branchname'
before this cache was introduced.
This new method is still unused but promise to make some operations several
times faster once it actually is used.
Abandoning Python 2.4 would make it possible to implement this more efficiently
by using struct classes and pack_into. The Python code could probably also be
micro optimized or it could be implemented very efficiently in C where it would
be easy to control the data access.
author | Mads Kiilerich <madski@unity3d.com> |
---|---|
date | Thu, 08 Jan 2015 00:01:03 +0100 |
parents | 9c3c3dc14a65 |
children | 7d63398fbfd1 |
comparison
equal
deleted
inserted
replaced
23784:8f7e839aaa70 | 23785:cb99bacb9b4e |
---|---|
7 | 7 |
8 from node import bin, hex, nullid, nullrev | 8 from node import bin, hex, nullid, nullrev |
9 import encoding | 9 import encoding |
10 import util | 10 import util |
11 import time | 11 import time |
12 from array import array | |
13 from struct import calcsize, pack, unpack | |
12 | 14 |
13 def _filename(repo): | 15 def _filename(repo): |
14 """name of a branchcache file for a given repo or repoview""" | 16 """name of a branchcache file for a given repo or repoview""" |
15 filename = "cache/branch2" | 17 filename = "cache/branch2" |
16 if repo.filtername: | 18 if repo.filtername: |
283 self.filteredhash = self._hashfiltered(repo) | 285 self.filteredhash = self._hashfiltered(repo) |
284 | 286 |
285 duration = time.time() - starttime | 287 duration = time.time() - starttime |
286 repo.ui.log('branchcache', 'updated %s branch cache in %.4f seconds\n', | 288 repo.ui.log('branchcache', 'updated %s branch cache in %.4f seconds\n', |
287 repo.filtername, duration) | 289 repo.filtername, duration) |
290 | |
291 # Revision branch info cache | |
292 | |
293 _rbcversion = '-v1' | |
294 _rbcnames = 'cache/rbc-names' + _rbcversion | |
295 _rbcrevs = 'cache/rbc-revs' + _rbcversion | |
296 # [4 byte hash prefix][4 byte branch name number with sign bit indicating open] | |
297 _rbcrecfmt = '>4sI' | |
298 _rbcrecsize = calcsize(_rbcrecfmt) | |
299 _rbcnodelen = 4 | |
300 _rbcbranchidxmask = 0x7fffffff | |
301 _rbccloseflag = 0x80000000 | |
302 | |
303 class revbranchcache(object): | |
304 """Persistent cache, mapping from revision number to branch name and close. | |
305 This is a low level cache, independent of filtering. | |
306 | |
307 Branch names are stored in rbc-names in internal encoding separated by 0. | |
308 rbc-names is append-only, and each branch name is only stored once and will | |
309 thus have a unique index. | |
310 | |
311 The branch info for each revision is stored in rbc-revs as constant size | |
312 records. The whole file is read into memory, but it is only 'parsed' on | |
313 demand. The file is usually append-only but will be truncated if repo | |
314 modification is detected. | |
315 The record for each revision contains the first 4 bytes of the | |
316 corresponding node hash, and the record is only used if it still matches. | |
317 Even a completely trashed rbc-revs fill thus still give the right result | |
318 while converging towards full recovery ... assuming no incorrectly matching | |
319 node hashes. | |
320 The record also contains 4 bytes where 31 bits contains the index of the | |
321 branch and the last bit indicate that it is a branch close commit. | |
322 The usage pattern for rbc-revs is thus somewhat similar to 00changelog.i | |
323 and will grow with it but be 1/8th of its size. | |
324 """ | |
325 | |
326 def __init__(self, repo): | |
327 assert repo.filtername is None | |
328 self._names = [] # branch names in local encoding with static index | |
329 self._rbcrevs = array('c') # structs of type _rbcrecfmt | |
330 self._rbcsnameslen = 0 | |
331 try: | |
332 bndata = repo.vfs.read(_rbcnames) | |
333 self._rbcsnameslen = len(bndata) # for verification before writing | |
334 self._names = [encoding.tolocal(bn) for bn in bndata.split('\0')] | |
335 except (IOError, OSError), inst: | |
336 repo.ui.debug("couldn't read revision branch cache names: %s\n" % | |
337 inst) | |
338 if self._names: | |
339 try: | |
340 data = repo.vfs.read(_rbcrevs) | |
341 self._rbcrevs.fromstring(data) | |
342 except (IOError, OSError), inst: | |
343 repo.ui.debug("couldn't read revision branch cache: %s\n" % | |
344 inst) | |
345 # remember number of good records on disk | |
346 self._rbcrevslen = min(len(self._rbcrevs) // _rbcrecsize, | |
347 len(repo.changelog)) | |
348 if self._rbcrevslen == 0: | |
349 self._names = [] | |
350 self._rbcnamescount = len(self._names) # number of good names on disk | |
351 self._namesreverse = dict((b, r) for r, b in enumerate(self._names)) | |
352 | |
353 def branchinfo(self, changelog, rev): | |
354 """Return branch name and close flag for rev, using and updating | |
355 persistent cache.""" | |
356 rbcrevidx = rev * _rbcrecsize | |
357 | |
358 # if requested rev is missing, add and populate all missing revs | |
359 if len(self._rbcrevs) < rbcrevidx + _rbcrecsize: | |
360 first = len(self._rbcrevs) // _rbcrecsize | |
361 self._rbcrevs.extend('\0' * (len(changelog) * _rbcrecsize - | |
362 len(self._rbcrevs))) | |
363 for r in xrange(first, len(changelog)): | |
364 self._branchinfo(r) | |
365 | |
366 # fast path: extract data from cache, use it if node is matching | |
367 reponode = changelog.node(rev)[:_rbcnodelen] | |
368 cachenode, branchidx = unpack( | |
369 _rbcrecfmt, buffer(self._rbcrevs, rbcrevidx, _rbcrecsize)) | |
370 close = bool(branchidx & _rbccloseflag) | |
371 if close: | |
372 branchidx &= _rbcbranchidxmask | |
373 if cachenode == reponode: | |
374 return self._names[branchidx], close | |
375 # fall back to slow path and make sure it will be written to disk | |
376 self._rbcrevslen = min(self._rbcrevslen, rev) | |
377 return self._branchinfo(rev) | |
378 | |
379 def _branchinfo(self, changelog, rev): | |
380 """Retrieve branch info from changelog and update _rbcrevs""" | |
381 b, close = changelog.branchinfo(rev) | |
382 if b in self._namesreverse: | |
383 branchidx = self._namesreverse[b] | |
384 else: | |
385 branchidx = len(self._names) | |
386 self._names.append(b) | |
387 self._namesreverse[b] = branchidx | |
388 reponode = changelog.node(rev) | |
389 if close: | |
390 branchidx |= _rbccloseflag | |
391 rbcrevidx = rev * _rbcrecsize | |
392 rec = array('c') | |
393 rec.fromstring(pack(_rbcrecfmt, reponode, branchidx)) | |
394 self._rbcrevs[rbcrevidx:rbcrevidx + _rbcrecsize] = rec | |
395 return b, close | |
396 | |
397 def write(self, repo): | |
398 """Save branch cache if it is dirty.""" | |
399 if self._rbcnamescount < len(self._names): | |
400 try: | |
401 if self._rbcnamescount != 0: | |
402 f = repo.vfs.open(_rbcnames, 'ab') | |
403 if f.tell() == self._rbcsnameslen: | |
404 f.write('\0') | |
405 else: | |
406 f.close() | |
407 self._rbcnamescount = 0 | |
408 self._rbcrevslen = 0 | |
409 if self._rbcnamescount == 0: | |
410 f = repo.vfs.open(_rbcnames, 'wb') | |
411 f.write('\0'.join(encoding.fromlocal(b) | |
412 for b in self._names[self._rbcnamescount:])) | |
413 self._rbcsnameslen = f.tell() | |
414 f.close() | |
415 except (IOError, OSError, util.Abort), inst: | |
416 repo.ui.debug("couldn't write revision branch cache names: " | |
417 "%s\n" % inst) | |
418 return | |
419 self._rbcnamescount = len(self._names) | |
420 | |
421 start = self._rbcrevslen * _rbcrecsize | |
422 if start != len(self._rbcrevs): | |
423 self._rbcrevslen = min(len(repo.changelog), | |
424 len(self._rbcrevs) // _rbcrecsize) | |
425 try: | |
426 f = repo.vfs.open(_rbcrevs, 'ab') | |
427 if f.tell() != start: | |
428 f.seek(start) | |
429 f.truncate() | |
430 end = self._rbcrevslen * _rbcrecsize | |
431 f.write(self._rbcrevs[start:end]) | |
432 f.close() | |
433 except (IOError, OSError, util.Abort), inst: | |
434 repo.ui.debug("couldn't write revision branch cache: %s\n" % | |
435 inst) | |
436 return |