Rcmp is a more flexible replacement for filecmp from the standard Python library.
The basic idea here is that depending on content, files don’t always have to be entirely bitwise identical in order to be equivalent or “close enough” for many purposes like comparing the results of two builds. For example, some (broken) file formats embed a time stamp indicating the time when a file was produced even though the file system already tracks this information. Build the same file twice and the two copies will initially appear to be different due to the embedded time stamp. Only when the irrelevant embedded time stamp differences are ignored do the two files show out to otherwise be the same.
Rcmp includes a flexible extension structure to allow for precisely these sorts of living and evolving comparisons.
Rcmp is capable of recursively descending into a number of different file types including:
In order to describe file locations which may extend beyond the traditional file system paths, rcmp introduces an extended path naming scheme. Traditional paths are described using the traditional slash separated list of names, /etc/hosts. And components which are included in other files, like a file located within a tar archive, are described using a sequence of brace encapsulated file format separaters. So, for instance, a file named foo located within a gzip compressed, (.gz), tar archive named tarchive.tar would be described as tarchive.tar.gz{gzip}tarchive.tar{tar}foo. And these can be combined as with /home/rich/tarchive.tar.gz{gzip}tarchive.tar{tar}foo.
Items which are not in the file system proper are referred to internally as being “boxed”.
Rcmp is both a library and a command line script for driving the library.
Things which can be compared are represented internally by instances of class Item. These can be items in the file system, like a file or directory, or in an archive, like an archive member.
This is used for caching the results from calls like stat and for holding content.
| Parameters: | name (string) – file system name |
|---|
Returns True if and only if we are “boxed”. That is, if we are not located directly in the file system but instead are encapsulated within some other file.
| Return type: | boolean |
|---|
Close any outstanding file descriptor if relevant.
The contents of the entire file, in memory.
| Return type: | bytearray or possibly an mmap’d section of file. |
|---|
Return device number from stat.
| Return type: | string |
|---|
Check for existence. Boxed items always exist. Unboxed items exist if they exist in the file system.
| Return type: | boolean |
|---|
If we have a file descriptor, return it. If not, then open one, cache it, and return it.
| Return type: | file |
|---|
Return the inode number from stat.
| Return type: | string |
|---|
Return True if and only if we are represent a file system directory.
| Return type: | boolean |
|---|
Return True if and only if we represent a symbolic link.
| Return type: | boolean |
|---|
Return True if and only if we represent a regular file.
| Return type: | boolean |
|---|
Return a string representing the path to which the symbolic link points. This presumes that we are a symbolic link.
| Return type: | string |
|---|
Return our size. Look it up in stat, (and cache the result), if we don’t already know what it is.
| Return type: | int |
|---|
If we have a statbuf, return it.
If not, then look one up, cache it, and return it.
| Return type: | statbuf |
|---|
There is a global set of all instances of class Item stored in the singular class Items.
This exists primarily to prevent us from creating a duplicate Item for the same path name.
Note
The class is used directly here as a global aggregator, a singleton. It is never instantiated but instead the class itself is used as a singleton.
Returned to indicate an authoritative claim of sufficient identicality. No further comparators need be tried.
Note
The class itself is used as a constant. It is never instantiated.
Returned to indicate an authoritative claim of difference. No further comparators need be tried.
Note
The class itself is used as a constant. It is never instantiated.
Represents a single comparison heuristic. This is an abstract class. It is intended solely to act as a base class for subclasses. It is never instantiated.
Subclasses based on Comparator implement individual heuristics for comparing items when applied to a Comparison. There are many Comparator subclasses included.
There are no instantiation variables nor properties.
Return True if and only if we apply to the given comparison.
| Return type: | boolean |
|---|
This is an abstract base class intended for things which are composed of other things. So, for instance, a directory, or a file archive.
Compare our lists and return the result.
Represents a pair of objects to be compared.
An instance of Comparison comprises a pair of Item, a list of Comparator, and a method for applying the list of Comparator to the pair of Item and returning an answer.
If exit_asap is true, the first difference will end the comparison. If it is not true, the comparison will continue despite knowing that our aggregate result is that we are Different. This is useful for getting a complete list of all differences.
exit_asap=False is like “make -k” in the sense that it reports on all differences rather than stopping after the first.
| Parameters: |
|
|---|
Compare our pair of Item.
Run through our list of Comparator calling each one in turn with our pair of Item. Each comparator is expected to return either:
If no Comparator returns non-null, then IndeterminateResult will be raised.
A 2 item list of the items to be compared
Represents a pair of lists of path names to be compared - one from column a, one from column b, etc.
An instance of ComparisonList is very similar to a Comparison except that instead of a pair of Items, it comprises a pair of lists of path names
| Parameters: | stuff (a (2-element) list of lists of string) – path names to be compared |
|---|
In all other ways, this class resembles Comparison.
Listed in default order of application:
Objects are different if either one is missing.
Objects with the same inode and device are identical.
Two files which are each empty are equal. In particular, we don’t need to open them or read them to make this determination.
Objects which are directories are special. They match if their contents match.
Verify the metadata of each member of an ar archive.
Objects which are bitwise identical are close enough.
Symlinks are equal if they point to the same place.
Files which differ only in that they have their paths buried in them aren’t really different.
(currently unused).
Elf files are different if any of the important sections are different.
Ar archive files are different if any of the important members are different.
Automake generated Makefiles have some nondeterminisms. They’re the same if they’re the same aside from that. (May also need to make some allowance for different tool sets later.)
When autoconf tests fail, there’s a line written to the config.log which exposes the name of the underlying temporary file. Since the name of this temporary file changes from build to build, it introduces a nondeterminism.
Note
I’d ignore config.log files, (and started to do exactly that), but it occurs to me that differences in autoconf configuration are quite likely to cause build differences. So I’ve been more surgical.
When “make config” is run in the kernel, it generates an auto.conf file which includes a time stamp. I think these files are important enough to merit more surgical checking. This comparator blots out the 4th line.
Zip archive files are different if any of the members are different.
Tar archive files are different if any of the important members are different.
Note
must be called before GzipComparator in order to exploit the Python tarfile module’s ability to open compressed archives.
Gzip archives only have one member but the archive itself sadly includes a timestamp. You can see the timestamp using “gzip -l -v”.
Verify the metadata of each member of a cpio archive.
Cpio archive files are different if any of the important members are different.
Objects which are bitwise identical after date blotting are close enough. But this should only be tried late.
Used as a catchall - just return Difference
Convert dates embedded in a string into innocuous constants of uniform length.
| Parameters: | input_string – input string |
|---|---|
| Return type: | string |
Given a list of file names to be ignored and a specific file name to check, return the first ignore pattern from the list that matches the file name.
| Parameters: |
|
|---|---|
| Return type: | string or False (Can be used as a predicate.) |
Raised when we can’t make any authoritative determination. At the top level, this is an error condition as this case indicates that we’ve failed to accomplish our job. Note that this is significantly different from the non-True value returned by Comparator subclasses to indicate that they have no authoritative result.
Rcmp uses the python standard logging facility. The only non-obvious bits are that definitive differences are logged at WARNING level. Definitive Sames are logged at WARNING - 1. And indefinite results are logged at WARNING - 2. This allows for linearly increasing volumes of logging info starting with the information that is usually more important first.
Note
I keep thinking that it would be better to create an IgnoringComparator that simply returned Same. It would make much of the code much simpler. However, it would mean that we’d build entire trees in some cases and compare them all just to produce constants. This way we clip the tree.