comparison mercurial/revlog.py @ 40627:e9293c5f8bb9

revlog: automatically read from opened file handles The revlog reading code commonly opens a new file handle for reading on demand. There is support for passing a file handle to revlog.revision(). But it is marked as an internal argument. When revlogs are written, we write() data as it is available. But we don't flush() data until all revisions are written. Putting these two traits together, it is possible for an in-process revlog reader during active writes to trigger the opening of a new file handle on a file with unflushed writes. The reader won't have access to all "available" revlog data (as it hasn't been flushed). And with the introduction of the previous patch, this can lead to the revlog raising an error due to a partial read. I witnessed this behavior when applying changegroup data (via `hg pull`) before issue6006 was fixed via different means. Having this and the previous patch in play would have helped cause errors earlier rather than manifesting as hash verification failures. While this has been a long-standing issue, I believe the relatively new delta computation code has tickled it into being more common. This is because the new delta computation code will compute deltas in more scenarios. This can lead to revlog reading. While the delta computation code is probably supposed to reuse file handles, it appears it isn't doing so in all circumstances. But the issue runs deeper than that. Theoretically, any code can access revision data during revlog writes. It appears we were just getting lucky that it wasn't. (The "add revision callback" passed to addgroup() provides an avenue to do this.) If I changed the revlog's behavior to not cache the full revision text or to clear caches after revision insertion during addgroup(), I was able to produce crashes 100% of the time when writing changelog revisions. This is because changelog's add revision callback attempts to resolve the revision data to access the changed files list. And without the revision's fulltext being cached, we performed a revlog read, which required opening a new file handle. This attempted to read unflushed data, leading to a partial read and a crash. This commit teaches the revlog to store the file handles used for writing multiple revisions during addgroup(). It also teaches the code for resolving a file handle when reading to use these handles, if available. This ensures that *any* reads (regardless of their source) use the active writing file handles, if available. These file handles have access to the unflushed data because they wrote it. This allows reads to complete without issue. Differential Revision: https://phab.mercurial-scm.org/D5267
author Gregory Szorc <gregory.szorc@gmail.com>
date Tue, 13 Nov 2018 12:32:05 -0800
parents 87a872555e90
children 8947f49daaa8
comparison
equal deleted inserted replaced
40626:87a872555e90 40627:e9293c5f8bb9
373 373
374 # Make copy of flag processors so each revlog instance can support 374 # Make copy of flag processors so each revlog instance can support
375 # custom flags. 375 # custom flags.
376 self._flagprocessors = dict(_flagprocessors) 376 self._flagprocessors = dict(_flagprocessors)
377 377
378 # 2-tuple of file handles being used for active writing.
379 self._writinghandles = None
380
378 mmapindexthreshold = None 381 mmapindexthreshold = None
379 v = REVLOG_DEFAULT_VERSION 382 v = REVLOG_DEFAULT_VERSION
380 opts = getattr(opener, 'options', None) 383 opts = getattr(opener, 'options', None)
381 if opts is not None: 384 if opts is not None:
382 if 'revlogv2' in opts: 385 if 'revlogv2' in opts:
503 return self.opener(self.datafile, mode=mode) 506 return self.opener(self.datafile, mode=mode)
504 507
505 @contextlib.contextmanager 508 @contextlib.contextmanager
506 def _datareadfp(self, existingfp=None): 509 def _datareadfp(self, existingfp=None):
507 """file object suitable to read data""" 510 """file object suitable to read data"""
511 # Use explicit file handle, if given.
508 if existingfp is not None: 512 if existingfp is not None:
509 yield existingfp 513 yield existingfp
514
515 # Use a file handle being actively used for writes, if available.
516 # There is some danger to doing this because reads will seek the
517 # file. However, _writeentry() performs a SEEK_END before all writes,
518 # so we should be safe.
519 elif self._writinghandles:
520 if self._inline:
521 yield self._writinghandles[0]
522 else:
523 yield self._writinghandles[1]
524
525 # Otherwise open a new file handle.
510 else: 526 else:
511 if self._inline: 527 if self._inline:
512 func = self._indexfp 528 func = self._indexfp
513 else: 529 else:
514 func = self._datafp 530 func = self._datafp
1748 tr.add(self.datafile, dataoff) 1764 tr.add(self.datafile, dataoff)
1749 1765
1750 if fp: 1766 if fp:
1751 fp.flush() 1767 fp.flush()
1752 fp.close() 1768 fp.close()
1769 # We can't use the cached file handle after close(). So prevent
1770 # its usage.
1771 self._writinghandles = None
1753 1772
1754 with self._indexfp('r') as ifh, self._datafp('w') as dfh: 1773 with self._indexfp('r') as ifh, self._datafp('w') as dfh:
1755 for r in self: 1774 for r in self:
1756 dfh.write(self._getsegmentforrevs(r, r, df=ifh)[1]) 1775 dfh.write(self._getsegmentforrevs(r, r, df=ifh)[1])
1757 1776
1994 # platforms, Python or the platform itself can be buggy. Some versions 2013 # platforms, Python or the platform itself can be buggy. Some versions
1995 # of Solaris have been observed to not append at the end of the file 2014 # of Solaris have been observed to not append at the end of the file
1996 # if the file was seeked to before the end. See issue4943 for more. 2015 # if the file was seeked to before the end. See issue4943 for more.
1997 # 2016 #
1998 # We work around this issue by inserting a seek() before writing. 2017 # We work around this issue by inserting a seek() before writing.
1999 # Note: This is likely not necessary on Python 3. 2018 # Note: This is likely not necessary on Python 3. However, because
2019 # the file handle is reused for reads and may be seeked there, we need
2020 # to be careful before changing this.
2000 ifh.seek(0, os.SEEK_END) 2021 ifh.seek(0, os.SEEK_END)
2001 if dfh: 2022 if dfh:
2002 dfh.seek(0, os.SEEK_END) 2023 dfh.seek(0, os.SEEK_END)
2003 2024
2004 curr = len(self) - 1 2025 curr = len(self) - 1
2026 log, the rest are against the previous delta. 2047 log, the rest are against the previous delta.
2027 2048
2028 If ``addrevisioncb`` is defined, it will be called with arguments of 2049 If ``addrevisioncb`` is defined, it will be called with arguments of
2029 this revlog and the node that was added. 2050 this revlog and the node that was added.
2030 """ 2051 """
2052
2053 if self._writinghandles:
2054 raise error.ProgrammingError('cannot nest addgroup() calls')
2031 2055
2032 nodes = [] 2056 nodes = []
2033 2057
2034 r = len(self) 2058 r = len(self)
2035 end = 0 2059 end = 0
2046 dfh = self._datafp("a+") 2070 dfh = self._datafp("a+")
2047 def flush(): 2071 def flush():
2048 if dfh: 2072 if dfh:
2049 dfh.flush() 2073 dfh.flush()
2050 ifh.flush() 2074 ifh.flush()
2075
2076 self._writinghandles = (ifh, dfh)
2077
2051 try: 2078 try:
2052 deltacomputer = deltautil.deltacomputer(self) 2079 deltacomputer = deltautil.deltacomputer(self)
2053 # loop through our set of deltas 2080 # loop through our set of deltas
2054 for data in deltas: 2081 for data in deltas:
2055 node, p1, p2, linknode, deltabase, delta, flags = data 2082 node, p1, p2, linknode, deltabase, delta, flags = data
2107 # addrevision switched from inline to conventional 2134 # addrevision switched from inline to conventional
2108 # reopen the index 2135 # reopen the index
2109 ifh.close() 2136 ifh.close()
2110 dfh = self._datafp("a+") 2137 dfh = self._datafp("a+")
2111 ifh = self._indexfp("a+") 2138 ifh = self._indexfp("a+")
2139 self._writinghandles = (ifh, dfh)
2112 finally: 2140 finally:
2141 self._writinghandles = None
2142
2113 if dfh: 2143 if dfh:
2114 dfh.close() 2144 dfh.close()
2115 ifh.close() 2145 ifh.close()
2116 2146
2117 return nodes 2147 return nodes