refextract

Small library for extracting references used in scholarly communication.

Originally exported from Invenio https://github.com/inveniosoftware/invenio.

Installation

pip install refextract

Usage

To get structured info from a publication reference:

from refextract import extract_journal_reference
reference = extract_journal_reference("J.Phys.,A39,13445")
print(reference)
{
    'extra_ibids': [],
    'is_ibid': False,
    'misc_txt': u'',
    'page': u'13445',
    'title': u'J. Phys.',
    'type': 'JOURNAL',
    'volume': u'A39',
    'year': ''
 }

To extract references from a publication full-text PDF:

from refextract import extract_references_from_file
reference = extract_references_from_file("some/fulltext/1503.07589v1.pdf")
print(reference)
{
    'references': [
            {'author': [u'F. Englert and R. Brout'],
             'doi': [u'10.1103/PhysRevLett.13.321'],
             'journal_page': [u'321'],
             'journal_reference': ['Phys.Rev.Lett.,13,1964'],
             'journal_title': [u'Phys.Rev.Lett.'],
             'journal_volume': [u'13'],
             'journal_year': [u'1964'],
             'linemarker': [u'1'],
             'title': [u'Broken symmetry and the mass of gauge vector mesons'],
             'year': [u'1964']}, ...
       ],
    'stats': {
          'author': 15,
          'date': '2016-01-12 10:52:58',
          'doi': 1,
          'misc': 0,
          'old_stats_str': '0-1-1-15-0-1-0',
          'reportnum': 1,
          'status': 0,
          'title': 1,
          'url': 0,
          'version': u'0.1.0.dev20150722'
    }
}

You can also extract directly from a URL:

from refextract import extract_references_from_url
reference = extract_references_from_url("http://arxiv.org/pdf/1503.07589v1.pdf")
print(reference)
{
    'references': [
            {'author': [u'F. Englert and R. Brout'],
             'doi': [u'10.1103/PhysRevLett.13.321'],
             'journal_page': [u'321'],
             'journal_reference': ['Phys.Rev.Lett.,13,1964'],
             'journal_title': [u'Phys.Rev.Lett.'],
             'journal_volume': [u'13'],
             'journal_year': [u'1964'],
             'linemarker': [u'1'],
             'title': [u'Broken symmetry and the mass of gauge vector mesons'],
             'year': [u'1964']}, ...
       ],
    'stats': {
          'author': 15,
          'date': '2016-01-12 10:52:58',
          'doi': 1,
          'misc': 0,
          'old_stats_str': '0-1-1-15-0-1-0',
          'reportnum': 1,
          'status': 0,
          'title': 1,
          'url': 0,
          'version': u'0.1.0.dev20150722'
    }
}

API

Refextract.

refextract.extract_journal_reference(line, override_kbs_files=None)

Extract the journal reference from string.

Extracts the journal reference from string and parses for specific journal information.

refextract.extract_references_from_file(path, recid=None, reference_format='{title} {volume} ({year}) {page}', linker_callback=None, override_kbs_files=None)

Extract references from a local pdf file.

The first parameter is the path to the file It raises FullTextNotAvailable if the file does not exist.

The standard reference format is: {title} {volume} ({year}) {page}.

E.g. you can change that by passing the reference_format:

>>> extract_references_from_file(path, reference_format="{title},{volume},{page}")

If you want to also link each reference to some other resource (like a record), you can provide a linker_callback function to be executed for every reference element found.

To override KBs for journal names etc., use override_kbs_files:

>>> extract_references_from_file(path, override_kbs_files={'journals': 'my/path/to.kb'})

Returns a dictionary with extracted references and stats.

refextract.extract_references_from_string(source, is_only_references=True, recid=None, reference_format='{title} {volume} ({year}) {page}', linker_callback=None, override_kbs_files=None)

Extract references from a raw string.

The first parameter is the path to the file It raises FullTextNotAvailable if the file does not exist.

If the string does not only contain references, improve accuracy by specifing is_only_references=False.

The standard reference format is: {title} {volume} ({year}) {page}.

E.g. you can change that by passing the reference_format:

>>> extract_references_from_url(path, reference_format="{title},{volume},{page}")

If you want to also link each reference to some other resource (like a record), you can provide a linker_callback function to be executed for every reference element found.

To override KBs for journal names etc., use override_kbs_files:

>>> extract_references_from_url(path, override_kbs_files={'journals': 'my/path/to.kb'})
refextract.extract_references_from_url(url, headers=None, chunk_size=1024, **kwargs)

Extract references from the pdf specified in the url.

The first parameter is the path to the file It raises FullTextNotAvailable if the file does not exist.

The standard reference format is: {title} {volume} ({year}) {page}.

E.g. you can change that by passing the reference_format:

>>> extract_references_from_url(path, reference_format="{title},{volume},{page}")

If you want to also link each reference to some other resource (like a record), you can provide a linker_callback function to be executed for every reference element found.

To override KBs for journal names etc., use override_kbs_files:

>>> extract_references_from_url(path, override_kbs_files={'journals': 'my/path/to.kb'})

It raises FullTextNotAvailable if the url gives a 404

Changes

Version 0.1.0 (2016-01-12)

Authors