Mercurial > public > mercurial-scm > hg
comparison mercurial/revlog.py @ 40627:e9293c5f8bb9
revlog: automatically read from opened file handles
The revlog reading code commonly opens a new file handle for
reading on demand. There is support for passing a file handle
to revlog.revision(). But it is marked as an internal argument.
When revlogs are written, we write() data as it is available. But
we don't flush() data until all revisions are written.
Putting these two traits together, it is possible for an in-process
revlog reader during active writes to trigger the opening of a new
file handle on a file with unflushed writes. The reader won't have
access to all "available" revlog data (as it hasn't been flushed).
And with the introduction of the previous patch, this can lead to
the revlog raising an error due to a partial read.
I witnessed this behavior when applying changegroup data (via
`hg pull`) before issue6006 was fixed via different means. Having
this and the previous patch in play would have helped cause errors
earlier rather than manifesting as hash verification failures.
While this has been a long-standing issue, I believe the relatively
new delta computation code has tickled it into being more common.
This is because the new delta computation code will compute deltas
in more scenarios. This can lead to revlog reading. While the delta
computation code is probably supposed to reuse file handles, it
appears it isn't doing so in all circumstances.
But the issue runs deeper than that. Theoretically, any code can
access revision data during revlog writes. It appears we were just
getting lucky that it wasn't. (The "add revision callback" passed to
addgroup() provides an avenue to do this.)
If I changed the revlog's behavior to not cache the full revision
text or to clear caches after revision insertion during addgroup(),
I was able to produce crashes 100% of the time when writing changelog
revisions. This is because changelog's add revision callback attempts
to resolve the revision data to access the changed files list. And
without the revision's fulltext being cached, we performed a revlog
read, which required opening a new file handle. This attempted to read
unflushed data, leading to a partial read and a crash.
This commit teaches the revlog to store the file handles used for
writing multiple revisions during addgroup(). It also teaches the
code for resolving a file handle when reading to use these handles,
if available. This ensures that *any* reads (regardless of their
source) use the active writing file handles, if available. These
file handles have access to the unflushed data because they wrote it.
This allows reads to complete without issue.
Differential Revision: https://phab.mercurial-scm.org/D5267
author | Gregory Szorc <gregory.szorc@gmail.com> |
---|---|
date | Tue, 13 Nov 2018 12:32:05 -0800 |
parents | 87a872555e90 |
children | 8947f49daaa8 |
comparison
equal
deleted
inserted
replaced
40626:87a872555e90 | 40627:e9293c5f8bb9 |
---|---|
373 | 373 |
374 # Make copy of flag processors so each revlog instance can support | 374 # Make copy of flag processors so each revlog instance can support |
375 # custom flags. | 375 # custom flags. |
376 self._flagprocessors = dict(_flagprocessors) | 376 self._flagprocessors = dict(_flagprocessors) |
377 | 377 |
378 # 2-tuple of file handles being used for active writing. | |
379 self._writinghandles = None | |
380 | |
378 mmapindexthreshold = None | 381 mmapindexthreshold = None |
379 v = REVLOG_DEFAULT_VERSION | 382 v = REVLOG_DEFAULT_VERSION |
380 opts = getattr(opener, 'options', None) | 383 opts = getattr(opener, 'options', None) |
381 if opts is not None: | 384 if opts is not None: |
382 if 'revlogv2' in opts: | 385 if 'revlogv2' in opts: |
503 return self.opener(self.datafile, mode=mode) | 506 return self.opener(self.datafile, mode=mode) |
504 | 507 |
505 @contextlib.contextmanager | 508 @contextlib.contextmanager |
506 def _datareadfp(self, existingfp=None): | 509 def _datareadfp(self, existingfp=None): |
507 """file object suitable to read data""" | 510 """file object suitable to read data""" |
511 # Use explicit file handle, if given. | |
508 if existingfp is not None: | 512 if existingfp is not None: |
509 yield existingfp | 513 yield existingfp |
514 | |
515 # Use a file handle being actively used for writes, if available. | |
516 # There is some danger to doing this because reads will seek the | |
517 # file. However, _writeentry() performs a SEEK_END before all writes, | |
518 # so we should be safe. | |
519 elif self._writinghandles: | |
520 if self._inline: | |
521 yield self._writinghandles[0] | |
522 else: | |
523 yield self._writinghandles[1] | |
524 | |
525 # Otherwise open a new file handle. | |
510 else: | 526 else: |
511 if self._inline: | 527 if self._inline: |
512 func = self._indexfp | 528 func = self._indexfp |
513 else: | 529 else: |
514 func = self._datafp | 530 func = self._datafp |
1748 tr.add(self.datafile, dataoff) | 1764 tr.add(self.datafile, dataoff) |
1749 | 1765 |
1750 if fp: | 1766 if fp: |
1751 fp.flush() | 1767 fp.flush() |
1752 fp.close() | 1768 fp.close() |
1769 # We can't use the cached file handle after close(). So prevent | |
1770 # its usage. | |
1771 self._writinghandles = None | |
1753 | 1772 |
1754 with self._indexfp('r') as ifh, self._datafp('w') as dfh: | 1773 with self._indexfp('r') as ifh, self._datafp('w') as dfh: |
1755 for r in self: | 1774 for r in self: |
1756 dfh.write(self._getsegmentforrevs(r, r, df=ifh)[1]) | 1775 dfh.write(self._getsegmentforrevs(r, r, df=ifh)[1]) |
1757 | 1776 |
1994 # platforms, Python or the platform itself can be buggy. Some versions | 2013 # platforms, Python or the platform itself can be buggy. Some versions |
1995 # of Solaris have been observed to not append at the end of the file | 2014 # of Solaris have been observed to not append at the end of the file |
1996 # if the file was seeked to before the end. See issue4943 for more. | 2015 # if the file was seeked to before the end. See issue4943 for more. |
1997 # | 2016 # |
1998 # We work around this issue by inserting a seek() before writing. | 2017 # We work around this issue by inserting a seek() before writing. |
1999 # Note: This is likely not necessary on Python 3. | 2018 # Note: This is likely not necessary on Python 3. However, because |
2019 # the file handle is reused for reads and may be seeked there, we need | |
2020 # to be careful before changing this. | |
2000 ifh.seek(0, os.SEEK_END) | 2021 ifh.seek(0, os.SEEK_END) |
2001 if dfh: | 2022 if dfh: |
2002 dfh.seek(0, os.SEEK_END) | 2023 dfh.seek(0, os.SEEK_END) |
2003 | 2024 |
2004 curr = len(self) - 1 | 2025 curr = len(self) - 1 |
2026 log, the rest are against the previous delta. | 2047 log, the rest are against the previous delta. |
2027 | 2048 |
2028 If ``addrevisioncb`` is defined, it will be called with arguments of | 2049 If ``addrevisioncb`` is defined, it will be called with arguments of |
2029 this revlog and the node that was added. | 2050 this revlog and the node that was added. |
2030 """ | 2051 """ |
2052 | |
2053 if self._writinghandles: | |
2054 raise error.ProgrammingError('cannot nest addgroup() calls') | |
2031 | 2055 |
2032 nodes = [] | 2056 nodes = [] |
2033 | 2057 |
2034 r = len(self) | 2058 r = len(self) |
2035 end = 0 | 2059 end = 0 |
2046 dfh = self._datafp("a+") | 2070 dfh = self._datafp("a+") |
2047 def flush(): | 2071 def flush(): |
2048 if dfh: | 2072 if dfh: |
2049 dfh.flush() | 2073 dfh.flush() |
2050 ifh.flush() | 2074 ifh.flush() |
2075 | |
2076 self._writinghandles = (ifh, dfh) | |
2077 | |
2051 try: | 2078 try: |
2052 deltacomputer = deltautil.deltacomputer(self) | 2079 deltacomputer = deltautil.deltacomputer(self) |
2053 # loop through our set of deltas | 2080 # loop through our set of deltas |
2054 for data in deltas: | 2081 for data in deltas: |
2055 node, p1, p2, linknode, deltabase, delta, flags = data | 2082 node, p1, p2, linknode, deltabase, delta, flags = data |
2107 # addrevision switched from inline to conventional | 2134 # addrevision switched from inline to conventional |
2108 # reopen the index | 2135 # reopen the index |
2109 ifh.close() | 2136 ifh.close() |
2110 dfh = self._datafp("a+") | 2137 dfh = self._datafp("a+") |
2111 ifh = self._indexfp("a+") | 2138 ifh = self._indexfp("a+") |
2139 self._writinghandles = (ifh, dfh) | |
2112 finally: | 2140 finally: |
2141 self._writinghandles = None | |
2142 | |
2113 if dfh: | 2143 if dfh: |
2114 dfh.close() | 2144 dfh.close() |
2115 ifh.close() | 2145 ifh.close() |
2116 | 2146 |
2117 return nodes | 2147 return nodes |