Welcome to rcmp’s documentation!

RCMP

Rcmp is a more flexible replacement for filecmp from the standard Python library.

The basic idea here is that depending on content, files don’t always have to be entirely bitwise identical in order to be equivalent or “close enough” for many purposes like comparing the results of two builds. For example, some (broken) file formats embed a time stamp indicating the time when a file was produced even though the file system already tracks this information. Build the same file twice and the two copies will initially appear to be different due to the embedded time stamp. Only when the irrelevant embedded time stamp differences are ignored do the two files show out to otherwise be the same.

Rcmp includes a flexible extension structure to allow for precisely these sorts of living and evolving comparisons.

Extended Path Names

Rcmp is capable of recursively descending into a number of different file types including:

  • file system directories
  • archival and aggregating types including:
  • compressed files including:

In order to describe file locations which may extend beyond the traditional file system paths, rcmp introduces an extended path naming scheme. Traditional paths are described using the traditional slash separated list of names, /etc/hosts. And components which are included in other files, like a file located within a tar archive, are described using a sequence of brace encapsulated file format separaters. So, for instance, a file named foo located within a gzip compressed, (.gz), tar archive named tarchive.tar would be described as tarchive.tar.gz{gzip}tarchive.tar{tar}foo. And these can be combined as with /home/rich/tarchive.tar.gz{gzip}tarchive.tar{tar}foo.

Items which are not in the file system proper are referred to internally as being “boxed”.

Script Usage

Rcmp is both a library and a command line script for driving the library.

Class Architecture

class rcmp.Item(name)

Things which can be compared are represented internally by instances of class Item. These can be items in the file system, like a file or directory, or in an archive, like an archive member.

This is used for caching the results from calls like stat and for holding content.

Parameters:name (string) – file system name
boxed

Returns True if and only if we are “boxed”. That is, if we are not located directly in the file system but instead are encapsulated within some other file.

Return type:boolean
close()

Close any outstanding file descriptor if relevant.

content

The contents of the entire file, in memory.

Return type:bytearray or possibly an mmap’d section of file.
device

Return device number from stat.

Return type:string
exists

Check for existence. Boxed items always exist. Unboxed items exist if they exist in the file system.

Return type:boolean
fd

If we have a file descriptor, return it. If not, then open one, cache it, and return it.

Return type:file
inode

Return the inode number from stat.

Return type:string
isdir

Return True if and only if we are represent a file system directory.

Return type:boolean
islnk

Return True if and only if we represent a symbolic link.

Return type:boolean
isreg

Return True if and only if we represent a regular file.

Return type:boolean

Return a string representing the path to which the symbolic link points. This presumes that we are a symbolic link.

Return type:string
name

name in the extended file system name space of this Item.

Return type:string
size

Return our size. Look it up in stat, (and cache the result), if we don’t already know what it is.

Return type:int
stat

If we have a statbuf, return it.

If not, then look one up, cache it, and return it.

Return type:statbuf
class rcmp.Items

There is a global set of all instances of class Item stored in the singular class Items.

This exists primarily to prevent us from creating a duplicate Item for the same path name.

Note

The class is used directly here as a global aggregator, a singleton. It is never instantiated but instead the class itself is used as a singleton.

classmethod delete(name)

Delete an Item from the set.

Parameters:name (string) – name of the Item to be deleted.
classmethod find_or_create(name)

Look up an Item with name. If necessary, create it.

Parameters:name (string) – the name of the :py:class`Item` to look up
Return type:Item
class rcmp.Same

Returned to indicate an authoritative claim of sufficient identicality. No further comparators need be tried.

Note

The class itself is used as a constant. It is never instantiated.

class rcmp.Different

Returned to indicate an authoritative claim of difference. No further comparators need be tried.

Note

The class itself is used as a constant. It is never instantiated.

class rcmp.Comparator

Represents a single comparison heuristic. This is an abstract class. It is intended solely to act as a base class for subclasses. It is never instantiated.

Subclasses based on Comparator implement individual heuristics for comparing items when applied to a Comparison. There are many Comparator subclasses included.

There are no instantiation variables nor properties.

applies(comparison)

Return True if and only if we apply to the given comparison.

Return type:boolean
cmp(comparison)

Apply ourselves to the given Comparison.

If can make an authoritative determination about whether the Items are alike then return either Same or Different. If we can make no such determination, then return a non-True value.

Return type:Same, Different, or a non-True value
class rcmp.Aggregator(comparators=[])

This is an abstract base class intended for things which are composed of other things. So, for instance, a directory, or a file archive.

cmp(comparison)

Compare our lists and return the result.

class rcmp.Comparison(lname=u'', rname=u'', litem=False, ritem=False, comparators=False, ignores=[], exit_asap=False)

Represents a pair of objects to be compared.

An instance of Comparison comprises a pair of Item, a list of Comparator, and a method for applying the list of Comparator to the pair of Item and returning an answer.

If exit_asap is true, the first difference will end the comparison. If it is not true, the comparison will continue despite knowing that our aggregate result is that we are Different. This is useful for getting a complete list of all differences.

exit_asap=False is like “make -k” in the sense that it reports on all differences rather than stopping after the first.

Parameters:
  • lname (string) – path name of the first thing, (the leftmost one)
  • rname (string) – path name of the second thing, (the rightmost one)
  • comparators (list of Comparator) – list of comparators to be applied
  • ignores (list of strings) – wild card patterns of path names to be ignored
  • exit_asap (boolean) – exit as soon as possible
cmp()

Compare our pair of Item.

Run through our list of Comparator calling each one in turn with our pair of Item. Each comparator is expected to return either:

any non True value, (null, False, etc)
indicating an indeterminate result, that is, that this particular comparator could make no authoritative determinations and that the next comparator in the list should be tried
Same
an authoritative declaration that the items are sufficiently alike and thus no further comparators need be tried
Different
an authoritative declaration that the items are insufficiently alike and thus no further comparators need be tried.

If no Comparator returns non-null, then IndeterminateResult will be raised.

pair

A 2 item list of the items to be compared

class rcmp.ComparisonList(stuff, comparators=False, ignores=[], exit_asap=False)

Represents a pair of lists of path names to be compared - one from column a, one from column b, etc.

An instance of ComparisonList is very similar to a Comparison except that instead of a pair of Items, it comprises a pair of lists of path names

Parameters:stuff (a (2-element) list of lists of string) – path names to be compared

In all other ways, this class resembles Comparison.

Comparators

Listed in default order of application:

class rcmp.NoSuchFileComparator

Objects are different if either one is missing.

class rcmp.InodeComparator

Objects with the same inode and device are identical.

class rcmp.EmptyFileComparator

Two files which are each empty are equal. In particular, we don’t need to open them or read them to make this determination.

class rcmp.DirComparator(comparators=[])

Objects which are directories are special. They match if their contents match.

class rcmp.ArMemberMetadataComparator

Verify the metadata of each member of an ar archive.

class rcmp.BitwiseComparator

Objects which are bitwise identical are close enough.

class rcmp.SymlinkComparator

Symlinks are equal if they point to the same place.

class rcmp.BuriedPathComparator

Files which differ only in that they have their paths buried in them aren’t really different.

(currently unused).

class rcmp.ElfComparator

Elf files are different if any of the important sections are different.

class rcmp.ArComparator(comparators=[])

Ar archive files are different if any of the important members are different.

class rcmp.AMComparator

Automake generated Makefiles have some nondeterminisms. They’re the same if they’re the same aside from that. (May also need to make some allowance for different tool sets later.)

class rcmp.ConfigLogComparator

When autoconf tests fail, there’s a line written to the config.log which exposes the name of the underlying temporary file. Since the name of this temporary file changes from build to build, it introduces a nondeterminism.

Note

I’d ignore config.log files, (and started to do exactly that), but it occurs to me that differences in autoconf configuration are quite likely to cause build differences. So I’ve been more surgical.

class rcmp.KernelConfComparator

When “make config” is run in the kernel, it generates an auto.conf file which includes a time stamp. I think these files are important enough to merit more surgical checking. This comparator blots out the 4th line.

class rcmp.ZipComparator(comparators=[])

Zip archive files are different if any of the members are different.

class rcmp.TarComparator(comparators=[])

Tar archive files are different if any of the important members are different.

Note

must be called before GzipComparator in order to exploit the Python tarfile module’s ability to open compressed archives.

class rcmp.GzipComparator(comparators=[])

Gzip archives only have one member but the archive itself sadly includes a timestamp. You can see the timestamp using “gzip -l -v”.

class rcmp.CpioMemberMetadataComparator

Verify the metadata of each member of a cpio archive.

class rcmp.CpioComparator(comparators=[])

Cpio archive files are different if any of the important members are different.

class rcmp.DateBlotBitwiseComparator

Objects which are bitwise identical after date blotting are close enough. But this should only be tried late.

class rcmp.FailComparator

Used as a catchall - just return Difference

Utilities

rcmp.date_blot(input_string)

Convert dates embedded in a string into innocuous constants of uniform length.

Parameters:input_string – input string
Return type:string
rcmp.ignoring(ignores, fname)

Given a list of file names to be ignored and a specific file name to check, return the first ignore pattern from the list that matches the file name.

Parameters:
  • ignores (list of strings) – ignore patterns
  • fname (string) – file name to check
Return type:

string or False (Can be used as a predicate.)

Exceptions

exception rcmp.RcmpException

Base class for all rcmp exceptions

exception rcmp.IndeterminateResult

Raised when we can’t make any authoritative determination. At the top level, this is an error condition as this case indicates that we’ve failed to accomplish our job. Note that this is significantly different from the non-True value returned by Comparator subclasses to indicate that they have no authoritative result.

Logging strategy:

Rcmp uses the python standard logging facility. The only non-obvious bits are that definitive differences are logged at WARNING level. Definitive Sames are logged at WARNING - 1. And indefinite results are logged at WARNING - 2. This allows for linearly increasing volumes of logging info starting with the information that is usually more important first.

Note

I keep thinking that it would be better to create an IgnoringComparator that simply returned Same. It would make much of the code much simpler. However, it would mean that we’d build entire trees in some cases and compare them all just to produce constants. This way we clip the tree.

Indices and tables