refextract¶
Small library for extracting references used in scholarly communication.
- Free software: GPLv2
- Documentation: http://pythonhosted.org/refextract/
Originally exported from Invenio https://github.com/inveniosoftware/invenio.
Installation¶
pip install refextract
Usage¶
To get structured info from a publication reference:
from refextract import extract_journal_reference
reference = extract_journal_reference("J.Phys.,A39,13445")
print(reference)
{
'extra_ibids': [],
'is_ibid': False,
'misc_txt': u'',
'page': u'13445',
'title': u'J. Phys.',
'type': 'JOURNAL',
'volume': u'A39',
'year': ''
}
To extract references from a publication full-text PDF:
from refextract import extract_references_from_file
reference = extract_references_from_file("some/fulltext/1503.07589v1.pdf")
print(reference)
{
'references': [
{'author': [u'F. Englert and R. Brout'],
'doi': [u'10.1103/PhysRevLett.13.321'],
'journal_page': [u'321'],
'journal_reference': ['Phys.Rev.Lett.,13,1964'],
'journal_title': [u'Phys.Rev.Lett.'],
'journal_volume': [u'13'],
'journal_year': [u'1964'],
'linemarker': [u'1'],
'title': [u'Broken symmetry and the mass of gauge vector mesons'],
'year': [u'1964']}, ...
],
'stats': {
'author': 15,
'date': '2016-01-12 10:52:58',
'doi': 1,
'misc': 0,
'old_stats_str': '0-1-1-15-0-1-0',
'reportnum': 1,
'status': 0,
'title': 1,
'url': 0,
'version': u'0.1.0.dev20150722'
}
}
You can also extract directly from a URL:
from refextract import extract_references_from_url
reference = extract_references_from_url("http://arxiv.org/pdf/1503.07589v1.pdf")
print(reference)
{
'references': [
{'author': [u'F. Englert and R. Brout'],
'doi': [u'10.1103/PhysRevLett.13.321'],
'journal_page': [u'321'],
'journal_reference': ['Phys.Rev.Lett.,13,1964'],
'journal_title': [u'Phys.Rev.Lett.'],
'journal_volume': [u'13'],
'journal_year': [u'1964'],
'linemarker': [u'1'],
'title': [u'Broken symmetry and the mass of gauge vector mesons'],
'year': [u'1964']}, ...
],
'stats': {
'author': 15,
'date': '2016-01-12 10:52:58',
'doi': 1,
'misc': 0,
'old_stats_str': '0-1-1-15-0-1-0',
'reportnum': 1,
'status': 0,
'title': 1,
'url': 0,
'version': u'0.1.0.dev20150722'
}
}
API¶
Refextract.
-
refextract.
extract_journal_reference
(line, override_kbs_files=None)¶ Extract the journal reference from string.
Extracts the journal reference from string and parses for specific journal information.
-
refextract.
extract_references_from_file
(path, recid=None, reference_format='{title} {volume} ({year}) {page}', linker_callback=None, override_kbs_files=None)¶ Extract references from a local pdf file.
The first parameter is the path to the file It raises FullTextNotAvailable if the file does not exist.
The standard reference format is: {title} {volume} ({year}) {page}.
E.g. you can change that by passing the reference_format:
>>> extract_references_from_file(path, reference_format="{title},{volume},{page}")
If you want to also link each reference to some other resource (like a record), you can provide a linker_callback function to be executed for every reference element found.
To override KBs for journal names etc., use
override_kbs_files
:>>> extract_references_from_file(path, override_kbs_files={'journals': 'my/path/to.kb'})
Returns a dictionary with extracted references and stats.
-
refextract.
extract_references_from_string
(source, is_only_references=True, recid=None, reference_format='{title} {volume} ({year}) {page}', linker_callback=None, override_kbs_files=None)¶ Extract references from a raw string.
The first parameter is the path to the file It raises FullTextNotAvailable if the file does not exist.
If the string does not only contain references, improve accuracy by specifing
is_only_references=False
.The standard reference format is: {title} {volume} ({year}) {page}.
E.g. you can change that by passing the reference_format:
>>> extract_references_from_url(path, reference_format="{title},{volume},{page}")
If you want to also link each reference to some other resource (like a record), you can provide a linker_callback function to be executed for every reference element found.
To override KBs for journal names etc., use
override_kbs_files
:>>> extract_references_from_url(path, override_kbs_files={'journals': 'my/path/to.kb'})
-
refextract.
extract_references_from_url
(url, headers=None, chunk_size=1024, **kwargs)¶ Extract references from the pdf specified in the url.
The first parameter is the path to the file It raises FullTextNotAvailable if the file does not exist.
The standard reference format is: {title} {volume} ({year}) {page}.
E.g. you can change that by passing the reference_format:
>>> extract_references_from_url(path, reference_format="{title},{volume},{page}")
If you want to also link each reference to some other resource (like a record), you can provide a linker_callback function to be executed for every reference element found.
To override KBs for journal names etc., use
override_kbs_files
:>>> extract_references_from_url(path, override_kbs_files={'journals': 'my/path/to.kb'})
It raises FullTextNotAvailable if the url gives a 404
Changes¶
Version 0.1.0 (2016-01-12)
- Initial export from Invenio Software <https://github.com/inveniosoftware/invenio>
- Restructured into stripped down, standalone version
Authors¶
- Alessio Deiana <alessio.deiana@cern.ch>
- Christopher Hayward <christopher.james.hayward@cern.ch>
- Federico Poli <federico.poli@cern.ch>
- Gerrit Rindermann <Gerrit.Rindermann@cern.ch>
- Graham R. Armstrong <graham.richard.armstrong@cern.ch>
- Grzegorz Szpura <grzegorz.szpura@cern.ch>
- Jan Aage Lavik <jan.age.lavik@cern.ch>
- Javier Martin Montull <javier.martin.montull@cern.ch>
- Samuele Kaplun <samuele.kaplun@cern.ch>
- Jiri Kuncar <jiri.kuncar@cern.ch>
- Lars Holm Nielsen <lars.holm.nielsen@cern.ch>
- Tibor Simko <tibor.simko@cern.ch>