Post Office Directory Parser (podparser)

This document refers to version 0.4

The podparser is a tool for parsing Scotland’s Post Office directories.

Introduction

The Scottish Post Office directories are annual directories, from the period 1773 to 1911, that include an alphabetical list of a town’s or county’s inhabitants. The directories have been digitised by the National Library of Scotland and made available in XML format. The podparser attempts to parse the XML and determine the forename, surname, occupation and address(es) of each entry. Furthermore, each address location is geocoded using the Google Geocoding API.

Currently only the General Directory section of the directories are parsed.

Dependencies

Installation

$ pip install podparser

or

$ easy_install podparser

POD set up

Scottish Post Office directories can be found at the Internet Archive. Select the ‘All files: HTTPS’ link of a valid directory (example: https://ia601600.us.archive.org/1/items/postofficeann194041edin/). It is this URL that the podfetcher uses to set up the pod. Note: the parser currently only parses the General Directory section of town directories.

To create the pod (with a slow internet connection this can take a long time):

$ cd </path/to/pod>
$ podfetch -d <url>

If successful, this will fetch a metadata file and a djvu file containing all pages in the pod. A new djvu XML file is then generated for each page in the pod in a new directory.

Usage

Input

The parser expects the input files in the format and file structure of Scottish Post Office directories djvu XML files. The parent directory should contain a metadata XML file ending in _meta.xml containing the following values:

<metadata>
  <volume></volume>
  <publisher></publisher>
</metadata>

The POD pages are expected in a child directory whose name ends in _djvu_xml. Each file contains a single POD page whose page number is contained in the filename. See https://github.com/gmh04/podparser/tree/master/test/example/example_djvu_xml for an example. The following is the XML format of the file

<OBJECT>
  <PARAM name="PAGE" value="postofficeannal188182gla_0116.xml"/>
  <LINE>Auld, John, grocer and victualler, 25 Duke street ;</LINE>
  <LINE>house, 4 Burrell's lane.</LINE>
  <LINE>Auld, John, painter and paperhanger, S9 Bath street.</LINE>
  <LINE>Auld, John (of David Auld &amp; Sons), house, 13</LINE>
  ...
</OBJECT>

The parser can be used as a command-line application or envoked as a library call within a python script.

Command Line

The command-line application parses the Post Offices directories from XML and optionally commits the entries to a database. For example, the following parses a single directory page (note paths to files are full paths):

$ podparser -p </path/to/pod.xml>

The next example parses a range of directory pages:

$ podparser -d </path/to/pods> -s 110 -e 115

Below is an example that will commit the parse result to a database:

$ podparser -p </path/to/pod.xml> -D mydb -W mydbpass -c

For a full list of parser command-line options see help options:

$ podparser --help

Python Library

The following example demonstrates envoking the parser and retrieving the results from within a python script.

from podparser.parser import Parser

p = Parser(config='/path/to/conf', dir_path='/path/to/pod')
dir = p.run_parser()
for page in dir.pages:
    for entry in page.entries:
        # do something with the entry
        print entry

Post Office directories can contain many pages, leading to parse times of many hours. In cases where many pages are being parsed it makes more sense to use a callback to process the results after the parsing of each page. This means if the process is killed before finishing, it can be restarted from the point of failure. The next example demonstrates the use of a callback.

from podparser.parser import Parser

def read_page(directory, page):
    for entry in page.entries:
        # do something with the entry
        print entry

p = parser.Parser(config='/path/to/conf', dir_path='/path/to/pod')
p.run_parser(read_page)

Output

The parser prints out the parse results to the terminal. The following is an example of a single entry:

  | Auld                 | John                 | grocer and victualler | G | 25 Duke street ; house, 4 Burrell's lane.
> | 4 Burrell's Lane, Glasgow, Scotland                          | 55.860516 : -4.238328  | GEOMETRIC_CENTER     | derived    (Burrell's Ln)
> | 25 Duke Street, Glasgow, Scotland                            | 55.860185 : -4.238551  | RANGE_INTERPOLATED   | derived    (Duke St)

The first row is the entry details:

1 Surname  
2 Forename  
3 Occupation  
4 Occupation Category see UK Standard Industrial Classification
5 Address(es)  

Any following row (starting with ‘>’) are locations that the parser has found in the address column:

1 Address  
2 LatLon  
3 Accuracy see location_type in Google Geocoding API results
4 type raw, derived or explicit(A raw type is an address query request as found in the address column. A derived type is constructed used pattern matching, see Streets config. An explicit location is a hard coded latlon defined in streets.xml)

Stats

Statistics of the parse are collected and a summary is displayed after each page. For multiple page parses, this summary is for all pages parsed and not the last:

Total Entries Number of processed entries after fixing line wrapping.
Rejected If an entry has less than 3 columns or the name contains a stop word, the entry is not processed.
No geo tag Google has returned no geotag for the entry.
Bad geo tag Google has returned an accuracy of APPROXIMATE, see Google Geocoding API results.
Exact Tags The percentage of good tags where the search address matches the result address.
Professions Number of entries with a profession entry
No Category Number of entries with a profession but no category.

Problems

The parser will alert the user when there is a problem with an entry:

No geo tag No valid location could be found in the address column.
Poor Geo tag There is no address in the entry with a geo tag better than APPROXIMATE, see location_type in Google Geocoding API results
No profession category Entry has a profession but no pattern is matched in Professions config.
Inexact tag In parentheses after the type column is the address returned by the google geocoding service. If the address returned does not match the query, it is marked as inexact with three asterixes.
Rejected If an entry has less than 3 columns or contains a stop word, the entry is not processed.

Database

Currently only Postgis is supported. The schema can be found in </path/to/site-packages>/podparser/etc.

Config

A number of XML files exist to help the parser improve the quality of the results.

Global

global.xml contains replace elements to fix Optical Character Recogintion(OCR) errors and misspellings for all entry fields. E.g.:

<?xml version="1.0" encoding="UTF-8"?>
<global>

  <replaces>
    <replace>
      <pattern>Eando'ph</pattern>
      <value>Randolph</value>
    </replace>
    <replace>
      <pattern>Eobert</pattern>
      <value>Robert</value>
    </replace>
    ...
  </replaces>
</global>

Names

In addition to containing replace elements to fix OCR errors and misspellings for name fields, names.xml contains stop words. A stop word is a character string where if found in the forename or surname, the entry will be rejected. Stop words in names can be used for identifying commercial entries:

<?xml version="1.0" encoding="UTF-8"?>
<names>
  <stopWords>
    <word>Association</word>
    <word>Insurance</word>
    ...
  </stopWords>

  ...
  <replace>
    <pattern>Jobn</pattern>
    <value>John</value>
  </replace>
  ...
</names>

Professions

In addition to containing replace elements to fix OCR errors and misspellings for the profession field, professions.xml contains elements for indentifying professional category:

<?xml version="1.0" encoding="UTF-8"?>

<professions>
  <replaces>
    <replace>
      <pattern>bookfeller</pattern>
      <value>bookseller</value>
    </replace>
    ...
  </replaces>
  <categories>
    <category>
      <name>Agriculture, forestry and fishing</name>
      <code>A</code>
      <list>
        <pattern>cowfeeder</pattern>
        <pattern>dairy</pattern>
        <pattern>farmer</pattern>
        <pattern>game dealer</pattern>
      </list>
    </category>
  </categories>
</professions>

Addresses

addresses.xml contains replace elements to fix OCR errors and misspellings for the address field. E.g.:

<?xml version="1.0" encoding="UTF-8"?>
<addresses>
  <replaces>
    <replace>
      <pattern>Caftle</pattern>
      <value>Castle</value>
    </replace>
    <replace>
      <pattern>Calton-hiil</pattern>
      <value>Calton hill</value>
    </replace>
    ...
  </replaces>
</addresses>

Streets

streets.xml helps the parser improve google geoencoding by cleaning the address character string sent to google (derived address) and providing a mechanism for specifying the modern street name. For example the following provides a means of finding alternative spelling for the same street:

<addresses>
  <address>
    <pattern>st james' terrace</pattern>
    <pattern>st. james terrace</pattern>
    <street>St James' Terrace</street>
  </address>
</addresses>

The next example shows how by providing a town element, a modern street name can be defined:

<address>
  <pattern>alexander street</pattern>
  <street>Alexander Street</street>
  <town>
    <name>Glasgow</name>
    <modern_name>Brechin Street</modern_name>
  </town>
</address>

Alternatively, latlon co-ordinates can be given. This is useful is google doesn’t find the address:

<address>
  <pattern>alexander street</pattern>
  <street>Alexander Street</street>
  <town>
    <name>Glasgow</name>
    <latlon>55.864210 -4.281235</latlon>
  </town>
</address>

Furthermore, areas withing particular towns can have the same street name but different modern names or latlon co-ordinates:

<address>
  <pattern>albert road</pattern>
  <street>Albert Road</street>
  <town>
    <name>Glasgow</name>
    <area>
      <name>Crosshill</name>
    </area>
    <area>
      <name>Langside</name>
      <modern_name>Dowanside Road</modern_name>
    </area>
    <area>
      <name>Pollockshields</name>
      <latlon>55.864210 -4.281235</latlon>
    </area>
  </town>
</address>

If both town and area level location details are defined, the area details take precence. A full example of streets can be found at github.

API

Parser

class podparser.parser.Parser(config, dir_path, start=0, end=9999, encoder_key=None, client_id=None, verbose=False, pre_post_office=False, db=None, commit=False)

Post office directory parser.

config - The full path to the parser configuration files.
directory - The full path to either an individual POD file or the POD directory.
start - Start directory page to be parsed, only applies to for directory parse. If no start page given start from 0.
end - End directory page to be parsed, only applies to for directory parse. If no end page given parse until last.
encoder_key - Google premium private key
client_id - Google premium client identifier
verbose - Print detailed output
pre_post_office - parse williamson’s directory?
db - PODconnection instance
commit - commit results to database?
run_parser(callback=None)

Kick off parser.

Returns Directory instance

Directory

class podparser.directory.Directory(path)

Post Office Directory

path - full path to the post directory

country = 'Scotland'

Post Office Directory country, default Scotland.

pages = []

List of parsed Pages.

Page

class podparser.directory.Page(path, number)

Represents a single page in the POD.

path - full path to the post directory number - Directory page number.

entries = []

List of parsed Entries.

number = None

Directory page number

Entry

class podparser.directory.Entry(line=None)

A single POD entry.

line - the raw line being parsed into an entry.

forename = None

Occupant’s forname.

surname = None

Occupant’s surname.

profession = ''

Occupant’s profession.

locations = []

List of Locations successfully geotagged.

Location

class podparser.geo.encoder.Location(address, town, point, accuracy, type=None, found_address=None, found_locality=None)

Stores location information related to an address

address = None

Address used in google search.

point = None

The latlon returned by google for address.

PodConnection

Testing

The parser unit tests can be run with

$ python -m unittest test.tests

Indices and tables