A manager for storing generated files.
A bucket where we store files with same hash sums.
Warning
Bucket is not thread-safe!
Buckets store ‘source’ files and their representations. A representation is simply another file, optionally marked with a ‘suffix’. This is meant to be used like a certain office document (the ‘source’ file) for which different converted representations (for instance an HTML, or PDF version) might be stored.
For each source file there can be an arbitrary number of representations, as long as each representation provides a different ‘suffix’. The Bucket does not introspect the files and makes no assumptions about the file-type or format. So, you could store a PDF representation with an ‘xhtml’ suffix if you like.
The ‘suffix’ for a representation is a simple string and can be chosen by the user. Normally, you would choose something like ‘pdf’ for a PDF version of a certain source file.
Each bucket can hold several source files and knows which representations belong to which source file.
To make a distinction between different sources inside the same bucket, the bucket manages ‘markers’ which normally are simple stringified numbers, one for each source and the representations connected to it. You should, however, make no assumptions about the marker, except that it is a string.
Currently, you can store as much source files in a bucket, as the the maximum integer number can address.
Create the default dirs for this bucket.
This method is called when instantiating a bucket.
You should therefore be aware that constructing a bucket will try to modify the file system.
Get the paths of all source files stored in this bucket.
Returns a generator of paths.
Get current source num.
Get the cached result for path.
Get path of a result file stored with marker marker and suffix suffix
If the path does not exist None is returned.
Get a path to a source file that equals file stored in path.
Returns a tuple (path, marker) or (None, None) if the source cannot be found.
Set current source num.
Store file in result_path as representation of source in src_path.
Optionally store this result marked with a certain suffix string.
The result has to be a path to a single file.
If suffix is given, the representation will be stored marked with the suffix in order to be able to distinguish this representation from possible others.
If the source file given by src_path already exist in the bucket, the file in result_path will be stored as a representation of the already existing source file.
We determine wether an identical source file already exists in the bucket by comparing the given file in src_path with the source files already stored in the bucket byte-wise.
A cache manager.
This cache manager caches processed files and their sources. It uses hashes and buckets to find paths of cached files quickly.
Overall it maps input files on output files. The cache manager is interesting when the computation of an output file is expensive but must be repeated often.
A sample application is to cache converted office files: as the computation is expensive, we can store the results of conversion in the cache manager and get it any time we want much more quickly. See cachemanager.txt for more infos.
It also checks for hash collisions: if two input files give the same hash, they will be handled correctly.
Check, whether the file in path or marked by marker and with suffix suffix is already cached.
This is a convenience method for easy checking of caching state for certain files. You can also get the information by using other API methods of CacheManager.
You must at least give one of path or marker, not both.
The suffix parameter is optional.
Returns True or False.
Return all source documents.
Get a bucket in which a source with ‘hash_digest’ would be stored.
Note
This call creates the appropriate bucket in filesystem if it does not exist already!
Get a bucket in which the source given by path would be stored.
Note
This call creates the appropriate bucket in filesystem if it does not exist already!
Check, whether the file in path is already cached.
Returns the path of cached file or None. Only ‘result’ files are looked up and returned, not sources.
This method does not modify the filesystem if an appropriate bucket does not yet exist.
Check whether a basket exists for marker and suffix.
Returns the path to a file represented by that marker or None.
A basket exists, if there was already registered a doc, which returned that marker on registration.
The basket might contain a representation of type suffix. If this is true, the path to the file is returned, None else.
Get the hash of a file stored in path.
Currently we compute the MD5 digest.
Note for derived classes, that the hash digest computed by this method should give only chars that can easily be processed as path elements in URLs. For instance slashes (which can occur in Base64 encoded strings) could make things difficult.
Prepare the cache dir, create dirs, etc.
Store a representation of file found in source_path which resides in to_cache to a bucket.
If suffix is not None the representation will be stored under the suffix name. A suffix is only a name and the cache manager makes no assumptions about file types or similar.
Returns a marker string which can be used in conjunction with the appropriate cache manager methods to retrieve the file later on.
The suffix used internally in buckets.
A cache manager tries to cache converted files, so that already converted documents do not have to be converted again.
A cache manager expects a cache_dir parameter where it can store the cached files. If this parameter is set to None no caching will be performed at all:
>>> from ulif.openoffice.cachemanager import CacheManager
>>> cm = CacheManager(cache_dir=None)
If we pass a path, which already exists and is a file, the cache manager will complain but still be constructed.
If we pass a path that does not exist, it will be created:
>>> ls('home')
>>> cm = CacheManager(cache_dir='home/mycachedir')
>>> ls('home')
d mycachedir
The cache manager can register files, look for already created conversions and pass them back if found.
We lookup a certain document which, as the cache is yet empty, cannot be found. We create a dummy file for this purpose:
>>> import os
>>> open('dummysource.doc', 'w').write('Just a dummy file.')
>>> docsource = os.path.abspath('dummysource.doc')
>>> docsource_contents = open(docsource, 'r').read()
>>> cm.contains(docsource, suffix='pdf')
False
We can also pass the file contents as argument:
>>> #cm.contains(extension = 'pdf', data = docsource_contents)False
The cache is based on MD5 sums of source files. Source documents are not stored.
We can pass level to the constructor, if we want a directory level different to 1:
>>> cm = CacheManager(cache_dir='home/mycachedir', level=2)
>>> cm.level
2
This will result in a different organization of all the cached files and directories inside the caching directory. See section below to learn more about this mere internal feature.
Setting the level after creation of a cache manager is not recommended.
We can register conversion results with the cache manager, which will be available lateron.
To demonstrate this, we create a dummy source file and a dummy conversioned file:
>>> import os
>>> open('dummysource.doc', 'w').write('Just a dummy file.')
>>> docsource = os.path.abspath('dummysource.doc')
>>> docsource_contents = open(docsource, 'r').read()
>>> open('dummyresult.pdf', 'w').write('I am not a real PDF.')
>>> pdfresult = os.path.abspath('dummyresult.pdf')
Now we can create a cache manager and register our stuff:
>>> cm = CacheManager(cache_dir='home/mycachedir')
>>> cm.registerDoc(source_path=docsource,
... to_cache=pdfresult)
'08867237840fabae77b838e9c9226eb2_1'
The string we get back here is a unique marker we can use to identify the uploaded file (see also usage of markers below).
This will create the needed directories inside the cache dir and store all contents of the directory where the file to cache resides in it.
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/')
- data
d results
d sources
The ‘data’ file contains some pickled management infos.
While in ‘sources’ all sources with the same hash are stored, the ‘results’ dir contains all results belonging to a certain source:
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/sources')
- source_1
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/results')
- result_1_default
We can, however, also store a file with a certain ‘suffix’ in order to cache several results for one source. For example we might want to cache a PDF and an HTML version of the same file.
To do so, we have to provide a suffix on doc registration:
>>> cm.registerDoc(source_path=docsource,
... to_cache=pdfresult,
... suffix='pdf')
'08867237840fabae77b838e9c9226eb2_1'
We get back the marker of the sourcefile we use. It’s the same as above. Actually, we have now several stored files in the basket. First, the source file which we store to be able to compare upcoming docs with it:
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/sources')
- source_1
Then, we store the result file:
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/results')
- result_1__pdf
- result_1_default
The cache manager notices, that the source delivered was the same as on first time and so only stored the new result with the suffix in name.
This will become more obvious, when we want to register a certain result file as HTML result:
>>> cm.registerDoc(source_path=docsource,
... to_cache=pdfresult,
... suffix='html')
'08867237840fabae77b838e9c9226eb2_1'
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/sources')
- source_1
>>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/results')
- result_1__html
- result_1__pdf
- result_1_default
It is up to the caller to choose any suffix she likes.
When we want to get the result for some input file, we can do so:
>>> cm.getCachedFile(docsource)
'/sample-buildout/home/mycachedir/.../results/result_1_default'
>>> cm.getCachedFile(docsource, suffix='pdf')
'/sample-buildout/home/mycachedir/.../results/result_1__pdf'
>>> cm.getCachedFile(docsource, suffix='html')
'/sample-buildout/home/mycachedir/.../results/result_1__html'
If a file was not cached yet, we will get None:
>>> cm.getCachedFile(docsource, suffix='blah') is None
True
The cache manager relies very much on hash (MD5) digests to find a cached document quickly. However, hash collisions can occur.
We create a cache manager with a trivial hash algorithm to see this:
>>> from ulif.openoffice.cachemanager import CacheManager
>>> class NotHashingCacheManager(CacheManager):
... def getHash(self, path=None):
... return 'somefakedhash'
>>> cm_dir = 'home/newcachedir'
>>> cm = NotHashingCacheManager(cache_dir=cm_dir)
We create two sources to store:
>>> import os
>>> open('dummysource1.doc', 'w').write('Just a dummy file.')
>>> open('dummysource2.doc', 'w').write('Another dummy file.')
>>> docsource1 = os.path.abspath('dummysource1.doc')
>>> docsource2 = os.path.abspath('dummysource2.doc')
Now we create some dummy result files and register both pairs of them:
>>> open('dummyresult1.pdf', 'w').write('Fake result 1')
>>> open('dummyresult1.html', 'w').write('Fake result 2')
>>> open('dummyresult2.pdf', 'w').write('Fake result 3')
>>> open('dummyresult2.html', 'w').write('Fake result 4')
>>> result1 = os.path.abspath('dummyresult1.pdf')
>>> result2 = os.path.abspath('dummyresult1.html')
>>> result3 = os.path.abspath('dummyresult2.pdf')
>>> result4 = os.path.abspath('dummyresult2.html')
>>> m1 = cm.registerDoc(source_path=docsource1,
... to_cache=result1,
... suffix='pdf')
>>> m2 = cm.registerDoc(source_path=docsource1,
... to_cache=result2,
... suffix='html')
>>> m3 = cm.registerDoc(source_path=docsource2,
... to_cache=result3,
... suffix='pdf')
>>> m4 = cm.registerDoc(source_path=docsource2,
... to_cache=result4,
... suffix='html')
All these sources give the same hash and are therefore stored in the same basket:
>>> ls(cm_dir, 'so', 'somefakedhash', 'sources')
- source_1
- source_2
>>> cat(cm_dir, 'so', 'somefakedhash', 'sources', 'source_1')
Just a dummy file.
>>> cat(cm_dir, 'so', 'somefakedhash', 'sources', 'source_2')
Another dummy file.
All results are connected via a number in filename to their respective source:
>>> ls(cm_dir, 'so', 'somefakedhash', 'results')
- result_1__html
- result_1__pdf
- result_2__html
- result_2__pdf
>>> cat(cm_dir, 'so', 'somefakedhash', 'results', 'result_1__pdf')
Fake result 1
>>> cat(cm_dir, 'so', 'somefakedhash', 'results', 'result_2__pdf')
Fake result 3
We can use unique markers to distiguish between different files in a bucket. The markers are distributed by the cachemanager. Actually we already got such markers. They were returned when registering the files above:
>>> m1
'somefakedhash_1'
>>> m2
'somefakedhash_1'
>>> m3
'somefakedhash_2'
Using these markers we can get cached files back directly:
>>> #cached_file_info = cm.getFileFromMarker(m1) >>> #cached_file_info.filename‘result1’
>>> #cached_file_info.source_filename‘docsource1’
>>> #cached_file_info.path‘/.../so/somefakedhash/results/result_1_pdf’
If a marker is not valid, i.e. it is not linked with a file, we will get None:
>>> #cm.getFileFromMarker('blah') is NoneTrue
A cache manager can list all source files stored.
>>> cm = CacheManager(cache_dir='home/mycachedir')
>>> [x for x in cm.getAllSources()]
['/.../mycachedir/08/08867237840fabae77b838e9c9226eb2/sources/source_1']