A loader is a way to import a dataset into Open Spending. These loaders use one or more dataset sources and create Entry, Entity and Classifier entries in the database. You can use whatever python can read,for example csv, json, xls or xml files. We also provide helper functions to read data directly from google docs.
Note that we are working on a way to let users import csv files with through the Open Spending web interface.
In this section you will write a demo loader to import a simple csv file that provides all the data we need. The minimal set of data you need for an import spendings is (named with their internal keys):
Not all of this information has to be present in your data source. A common case is that for all spending data to or from `` is the society, or ``currency is alway the same, so you can hardcode values in your data loader.
The demo source file is a csv file that contains one spending Entry per line. You know that the currency of all spendings is EUR. The file contains the following columns:
Here is your data (download demoloader.csv):
id | spender_id | spender_name | recipient_id | recipient_name | region | sector | date | amount |
---|---|---|---|---|---|---|---|---|
id | spender_id | spender_name | recipient_id | recipient_name | region | sector | date | amount |
demo-sp-001 | dfes | Department for Education | dtlr | Department for the Regions | North Yorkshire | Social Protection | 2010-01-01 | 1200000 |
demo-sp-002 | dfes | Department for Education | dtlr | Department for the Regions | North Yorkshire | Education | 2010-02-01 | 800000 |
demo-sp-003 | dfes | Department for Education | society | General Public | North Yorkshire | Education | 2011-03-01 | 500000 |
demo-sp-004 | dtlr | Department for the Regions | society | General Public | Hartlepool | Health | 2011-04-01 | 1400000 |
demo-sp-005 | dtlr | Department for the Regions | dfes | Department for Education | Hartlepool | Heath | 2010-05-01 | 260000 |
demo-sp-006 | dtlr | Department for the Regions | dfes | Department for Education | Hartlepool | Social Protection | 2011-06-01 | 1150000 |
First you create a Loader. This is a class that lets you create Entry, Entity and Classifier objects in an Open Spending site easily, quickly and reliably without a deep knowledge of the saved data structures.
1 2 3 4 5 6 7 | def make_loader():
'''return a loader for our demo dataset'''
loader = Loader(dataset_name=u'demodata',
unique_keys=['name'],
label=u'Demo data set',
description="A dataset to show how to write a data loader")
return loader
|
At the end you return the instanciated loader to use it later. Internally the loader automatically creates a wdmmg.model.Dataset for you.
>>> loader = make_loader
>>> loader.dataset.name
u'demodata'
See the Loader API for the available methods and the options to create it.
Now you can read the csv file. To do that, create a function that you can pass the filename to (i.e. You don’t need to write a second function to read a second, similar file, can use the same function multiple times). This is especially useful if you want to write tests for your data loader.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | def read_data(filename):
'''read a file with data for our demo dataset
and return a list of dicts. (To be corret it's a generator with
which we can iterate over our dictionaries one at a time.'''
csv_file = file(filename)
rows = csv.DictReader(csv_file)
# convert all strings to unicodes
converted = []
for row in rows:
for key in row:
row[key] = unicode(row[key], 'utf-8')
converted.append(row)
return converted
|
Note
If you want to work with massive amounts of data, you want to use generator functions in many places and yield the single rows for performance reasons.
Now you have a list with one dict for every row:
>>> filename = '../wdmmg/tests/demoloader.csv'
>>> rows = read_data(filename)
>>> first_row = rows.next()
>>> first_row
{'amount': '1200000',
'date': '2010-01-01',
'id': 'demo-sp-001',
'recipient_id': 'dtlr',
'recipient_name': 'Department for the Regions',
'region': 'North Yorkshire',
'sector': 'Social Protection',
'spender_id': 'dfes',
'spender_name': 'Department for Education'}
Now that you have the dictionaries with the data you can save it into the database. You have to refine the data a bit as the DictReader returns dicts where the keys are the column headers, and all values are strings. But:
Here we show you how to create custom ones.
So let’s prepare the data and save Entry, Entities and Classifiers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | def save_entry(loader, row):
'''prepare a ``row`` and save the *entry*, the *entities*
and the *classifiers* in the database.
'''
# This will hold the data we save as the entry later
entry_data = {}
# 1. The name of the *entry* is the id
entry_data['name'] = row['id']
# 2. convert the amount to a float and add it to entry_data
amount = float(row['amount'])
entry_data['amount'] = amount
# 3. convert the date into a time structure we need and add it
# as 'time'
date = row['date']
timestructure = for_datestrings(date)
entry_data['time'] = timestructure
# 4. Turn the spender and the recipient into wdmmg.model.Entity objects
# and add them as 'from' and 'to'
spender_id = row['spender_id']
spender_name = row['spender_name']
spender_entity = loader.create_entity(spender_id, spender_name)
entry_data['from'] = spender_entity
recipient_id = row['recipient_id']
recipient_name = row['recipient_name']
recipient_entity = loader.create_entity(recipient_id, recipient_name)
entry_data['to'] = recipient_entity
# 5. create wdmmg.model.Classifier objects for region and sector...
region_name = row['region']
region_classifier = loader.create_classifier(region_name, u'demotaxonomy')
sector_name = row['sector']
sector_classifier = loader.create_classifier(sector_name, u'demotaxonomy')
# ... and classify the entry. The necessarey information will be added
# to entry_data
loader.classify_entry(entry_data, region_classifier, name=u'region')
loader.classify_entry(entry_data, sector_classifier, name=u'sector')
# 6. save the entry
query_spec = loader.create_entry(**entry_data)
|
Sweet. The system can read your data and you can make a loader and save entries and other data. So you’re nearly there.
Most of the data you save is pretty standard. Amount, sender, recipient, currency and date are generally understandable and used across all datasets. But for the special data in the demo dataset, region and sector, you need to describe how users should interprete them. The system also need to know how a user should be able to analyze the data.
To describe the kind of data you store in the Entries use Loader.create_dimension().
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | def create_dimensions(loader):
'''Describe which extra data we save beside the common data.'''
loader.create_dimension(
'sector', label='Sector',
description=('The sectors the money is spend in as described in '
'catalog FOO published by the department BAR. '
'It should be mostly accurate it is possible '
'that it was repurposed by the recipient.'))
loader.create_dimension(
'region', label="Regional",
description=("The money was appropriated to the be used in "
"the named region. It will have be given to a number "
"of regional buereaus. However some these regional "
"bureaus have special functions that are not related "
"to the region."))
|
To wire it all together, write a load function that finds the source file and does all the things above. It’s important to call Loader.compute_aggregates() at the end. This will collect additional information in the database.
1 2 3 4 5 6 7 8 9 10 11 12 13 | def load():
'''load the demo data into Open Spending'''
directory = os.path.dirname(__file__)
filename = os.path.join(directory, 'demoloader.csv')
loader = make_loader()
create_dimensions(loader)
rows = read_data(filename)
for row in rows:
save_entry(loader, row)
loader.compute_aggregates()
|
The complete loader can be downloaded here.
Loaders are normally used on the command line with the command:
paster load <loader_name>
To add your loader to the load command, add an entry point to the python package you created in the setup.py for wdmmg.load. In this case the demo loader is located in wdmmg/tests/demoloader.py and the function is load:
setup(...
entry_points="""
[wdmmg.load]
demo = wdmmg.tests.demoloader:load
"""
)
If you write a loader that is used by other people you should also write tests to make sure everything works as expected. There might be changes in newer versions of the source file or in newer versions of OpenSpending that will be unnoted if there are no tests.
You can find a set of tests in test_demoloader.py
'''This are the tests for the demoloader that is used in
the loader howto in /doc/loader.rst
'''
from unittest import TestCase
from wdmmg.model import init_serverside_js, Repository
from wdmmg.model.mongo import db, drop_db
from wdmmg.lib.loader import Loader
class TestDemoLoader(TestCase):
def setUp(self):
# create the collections so we have the
# default indexes when we start
Repository().delete_all()
self.db = db()
self.db.create_collection('entry')
self.db.create_collection('entity')
# upload server side javascript
init_serverside_js()
# fixme: we have to ease the tests for new loaders
def tearDown(self):
Repository().delete_all()
drop_db()
def test_create_loader(self):
from wdmmg.tests.demoloader import make_loader
loader = make_loader()
self.assertTrue(isinstance(loader, Loader))
self.assertEqual(loader.dataset.name, 'demodata')
def test_read_data(self):
from os.path import dirname, join
from wdmmg.tests.demoloader import read_data
directory = dirname(__file__)
filename = join(directory, 'demoloader.csv')
rows = read_data(filename)
self.assertEqual(len(rows), 6)
self.assertEqual(rows[0],
{'amount': '1200000',
'date': '2010-01-01',
'id': 'demo-sp-001',
'recipient_id': 'dtlr',
'recipient_name': 'Department for the Regions',
'region': 'North Yorkshire',
'sector': 'Social Protection',
'spender_id': 'dfes',
'spender_name': 'Department for Education'})
def test_complete_run(self):
from wdmmg.model import Entry
from wdmmg.tests.demoloader import load
load()
# test that we can find one of our entries
self.assertTrue(Entry.find_one({'dataset.name': 'demodata'}))
Often you have entities that name the Society as the spender or recipient. You should not use create_entity() then but use Loader.get_default_society().
The names for Entries must be unique for the dataset, the name of Classifiers unique across all classifiers in the taxonomy and the name of Entities unique within an OpenSpending installation.
This will most of the time be no problem. You often have unique id’s like journal numbers for the Entries. If you do not have them, you can
If you have a hierarchy, say:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | foo = loader.create_entity('foo', "Department Foo")
bar = loader.create_entity('bar', "Subdepartment Bar")
bureau = loader.create_entity('extraordinary-affairs', "Bureau for ...")
# now we have to add cross references:
foo['subdepartment'] = bar
foo.save()
bar['department'] = foo
bar['bureau'] = bureau
bar.save()
bureau['subdepartment'] = bar
bureau.save()
|
Line 7, 11, 14: When you change an Entity, Entry or Classifier after you have created it, you need to save them explicitly.
Alternatively you can only use ‘parent’ and ‘child’ as keys (instead of ‘department’, ‘subdepartment’ and ‘bureau’).