--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/mercurial/help/internals/linelog.txt Mon Jul 30 10:42:37 2018 -0400
@@ -0,0 +1,251 @@
+linelog is a storage format inspired by the "Interleaved deltas" idea. See
+https://en.wikipedia.org/wiki/Interleaved_deltas for its introduction.
+
+0. SCCS Weave
+
+ To understand what linelog is, first we have a quick look at a simplified
+ (with header removed) SCCS weave format, which is an implementation of the
+ "Interleaved deltas" idea.
+
+0.1 Basic SCCS Weave File Format
+
+ A SCCS weave file consists of plain text lines. Each line is either a
+ special instruction starting with "^A" or part of the content of the real
+ file the weave tracks. There are 3 important operations, where REV denotes
+ the revision number:
+
+ ^AI REV, marking the beginning of an insertion block introduced by REV
+ ^AD REV, marking the beginning of a deletion block introduced by REV
+ ^AE REV, marking the end of the block started by "^AI REV" or "^AD REV"
+
+ Note on revision numbers: For any two different revision numbers, one must
+ be an ancestor of the other to make them comparable. This enforces linear
+ history. Besides, the comparison functions (">=", "<") should be efficient.
+ This means, if revisions are strings like git or hg, an external map is
+ required to convert them into integers.
+
+ For example, to represent the following changes:
+
+ REV 1 | REV 2 | REV 3
+ ------+-------+-------
+ a | a | a
+ b | b | 2
+ c | 1 | c
+ | 2 |
+ | c |
+
+ A possible weave file looks like:
+
+ ^AI 1
+ a
+ ^AD 3
+ b
+ ^AI 2
+ 1
+ ^AE 3
+ 2
+ ^AE 2
+ c
+ ^AE 1
+
+ An "^AE" does not always match its nearest operation ("^AI" or "^AD"). In
+ the above example, "^AE 3" does not match the nearest "^AI 2" but "^AD 3".
+ Therefore we need some extra information for "^AE". The SCCS weave uses a
+ revision number. It could also be a boolean value about whether it is an
+ insertion or a deletion (see section 0.4).
+
+0.2 Checkout
+
+ The "checkout" operation is to retrieve file content at a given revision,
+ say X. It's doable by going through the file line by line and:
+
+ - If meet ^AI rev, and rev > X, find the corresponding ^AE and jump there
+ - If meet ^AD rev, and rev <= X, find the corresponding ^AE and jump there
+ - Ignore ^AE
+ - For normal lines, just output them
+
+0.3 Annotate
+
+ The "annotate" operation is to show extra metadata like the revision number
+ and the original line number a line comes from.
+
+ It's basically just a "Checkout". For the extra metadata, they can be stored
+ side by side with the line contents. Alternatively, we can infer the
+ revision number from "^AI"s.
+
+ Some SCM tools have to calculate diffs on the fly and thus are much slower
+ on this operation.
+
+0.4 Tree Structure
+
+ The word "interleaved" is used because "^AI" .. "^AE" and "^AD" .. "^AE"
+ blocks can be interleaved.
+
+ If we consider insertions and deletions separately, they can form tree
+ structures, respectively.
+
+ +--- ^AI 1 +--- ^AD 3
+ | +- ^AI 2 | +- ^AD 2
+ | | | |
+ | +- ^AE 2 | +- ^AE 2
+ | |
+ +--- ^AE 1 +--- ^AE 3
+
+ More specifically, it's possible to build a tree for all insertions, where
+ the tree node has the structure "(rev, startline, endline)". "startline" is
+ the line number of "^AI" and "endline" is the line number of the matched
+ "^AE". The tree will have these properties:
+
+ 1. child.rev > parent.rev
+ 2. child.startline > parent.startline
+ 3. child.endline < parent.endline
+
+ A similar tree for all deletions can also be built with the first property
+ changed to:
+
+ 1. child.rev < parent.rev
+
+0.5 Malformed Cases
+
+ The following cases are considered malformed in our implementation:
+
+ 1. Interleaved insertions, or interleaved deletions.
+ It can be rewritten to a non-interleaved tree structure.
+
+ ^AI/D x ^AI/D x
+ ^AI/D y -> ^AI/D y
+ ^AE x ^AE y
+ ^AE y ^AE x
+
+ 2. Nested insertions, where the inner one has a smaller revision number.
+ It can be rewritten to a non-nested form.
+
+ ^AI x + 1 ^AI x + 1
+ ^AI x -> ^AE x + 1
+ ^AE x ^AI x
+ ^AE x + 1 ^AE x
+
+ 3. Insertion or deletion inside another deletion, where the outer deletion
+ block has a smaller revision number.
+
+ ^AD x ^AD x
+ ^AI/D x + 1 -> ^AE x
+ ^AE x + 1 ^AI/D x + 1
+ ^AE x ^AE x
+
+ Some of them may be valid in other implementations for special purposes. For
+ example, to "revive" a previously deleted block in a newer revision.
+
+0.6 Cases Can Be Optimized
+
+ It's always better to get things nested. For example, the left is more
+ efficient than the right while they represent the same content:
+
+ +--- ^AD 2 +- ^AD 1
+ | +- ^AD 1 | LINE A
+ | | LINE A +- ^AE 1
+ | +- ^AE 1 +- ^AD 2
+ | LINE B | LINE B
+ +--- ^AE 2 +- ^AE 2
+
+ Our implementation sometimes generates the less efficient data. To always
+ get the optimal form, it requires extra code complexity that seems unworthy.
+
+0.7 Inefficiency
+
+ The file format can be slow because:
+
+ - Inserting a new line at position P requires rewriting all data after P.
+ - Finding "^AE" requires walking through the content (O(N), where N is the
+ number of lines between "^AI/D" and "^AE").
+
+1. Linelog
+
+ The linelog is a binary format that dedicates to speed up mercurial (or
+ git)'s "annotate" operation. It's designed to avoid issues mentioned in
+ section 0.7.
+
+1.1 Content Stored
+
+ Linelog is not another storage for file contents. It only stores line
+ numbers and corresponding revision numbers, instead of actual line content.
+ This is okay for the "annotate" operation because usually the external
+ source is fast to checkout the content of a file at a specific revision.
+
+ A typical SCCS weave is also fast on the "grep" operation, which needs
+ random accesses to line contents from different revisions of a file. This
+ can be slow with linelog's no-line-content design. However we could use
+ an extra map ((rev, line num) -> line content) to speed it up.
+
+ Note the revision numbers in linelog should be independent from mercurial
+ integer revision numbers. There should be some mapping between linelog rev
+ and hg hash stored side by side, to make the files reusable after being
+ copied to another machine.
+
+1.2 Basic Format
+
+ A linelog file consists of "instruction"s. An "instruction" can be either:
+
+ - JGE REV ADDR # jump to ADDR if rev >= REV
+ - JL REV ADDR # jump to ADDR if rev < REV
+ - LINE REV LINENUM # append the (LINENUM+1)-th line in revision REV
+
+ For example, here is the example linelog representing the same file with
+ 3 revisions mentioned in section 0.1:
+
+ SCCS | Linelog
+ Weave | Addr : Instruction
+ ------+------+-------------
+ ^AI 1 | 0 : JL 1 8
+ a | 1 : LINE 1 0
+ ^AD 3 | 2 : JGE 3 6
+ b | 3 : LINE 1 1
+ ^AI 2 | 4 : JL 2 7
+ 1 | 5 : LINE 2 2
+ ^AE 3 |
+ 2 | 6 : LINE 2 3
+ ^AE 2 |
+ c | 7 : LINE 1 2
+ ^AE 1 |
+ | 8 : END
+
+ This way, "find ^AE" is O(1) because we just jump there. And we can insert
+ new lines without rewriting most part of the file by appending new lines and
+ changing a single instruction to jump to them.
+
+ The current implementation uses 64 bits for an instruction: The opcode (JGE,
+ JL or LINE) takes 2 bits, REV takes 30 bits and ADDR or LINENUM takes 32
+ bits. It also stores the max revision number and buffer size at the first
+ 64 bits for quick access to these values.
+
+1.3 Comparing with Mercurial's revlog format
+
+ Apparently, linelog is very different from revlog: linelog stores rev and
+ line numbers, while revlog has line contents and other metadata (like
+ parents, flags). However, the revlog format could also be used to store rev
+ and line numbers. For example, to speed up the annotate operation, we could
+ also pre-calculate annotate results and just store them using the revlog
+ format.
+
+ Therefore, linelog is actually somehow similar to revlog, with the important
+ trade-off that it only supports linear history (mentioned in section 0.1).
+ Essentially, the differences are:
+
+ a) Linelog is full of deltas, while revlog could contain full file
+ contents sometimes. So linelog is smaller. Revlog could trade
+ reconstruction speed for file size - best case, revlog is as small as
+ linelog.
+ b) The interleaved delta structure allows skipping large portion of
+ uninteresting deltas so linelog's content reconstruction is faster than
+ the delta-only version of revlog (however it's possible to construct
+ a case where interleaved deltas degrade to plain deltas, so linelog
+ worst case would be delta-only revlog). Revlog could trade file size
+ for reconstruction speed.
+ c) Linelog implicitly maintains the order of all lines it stores. So it
+ could dump all the lines from all revisions, with a reasonable order.
+ While revlog could also dump all line additions, it requires extra
+ computation to figure out the order putting those lines - that's some
+ kind of "merge".
+
+ "c" makes "hg absorb" easier to implement and makes it possible to do
+ "annotate --deleted".