Next: Rolling checksum Up: The rsync algorithm Previous: The
problem
Suppose we have two general purpose computers
and
. Computer
has access to a file A and
has access to file
B, where A and B are ``similar''. There is a slow
communications link between
and
.
The rsync algorithm consists of the following steps:
- 1.
splits the file B into a series of
non-overlapping fixed-sized blocks of size S bytes1. The last block may be shorter than S
bytes.
- 2.
- For each of these blocks
calculates two checksums: a weak
``rolling'' 32-bit checksum (described below) and a strong 128-bit MD4
checksum.
- 3.
sends these checksums to
.
- 4.
searches through A to find all
blocks of length S bytes (at any offset, not just multiples of
S) that have the same weak and strong checksum as one of the blocks of
B. This can be done in a single pass very quickly using a special
property of the rolling checksum described below.
- 5.
sends
a sequence of
instructions for constructing a copy of A. Each instruction is either a
reference to a block of B, or literal data. Literal data is sent only
for those sections of A which did not match any of the blocks of
B.
The end result is that
gets a copy of A, but only the
pieces of A that are not found in B (plus a small amount of data
for checksums and block indexes) are sent over the link. The algorithm also only
requires one round trip, which minimises the impact of the link latency.
The most important details of the algorithm are the rolling checksum and the
associated multi-alternate search mechanism which allows the all-offsets
checksum search to proceed very quickly. These will be discussed in greater
detail below.
Next: Rolling checksum Up: The rsync algorithm Previous: The
problem
Andrew Tridgell
1998-11-09