Methods for parsing bibliographic datasets.
dfr | Methods for parsing JSTOR Data-for-Research datasets. |
mallet | Reader for output from topic modeling with MALLET. |
pubmed | Methods for working with PubMed data are still under development. Please use |
wos | Reader for Web of Science field-tagged bibliographic data. |
Each file reader provides methods to parse bibliographic data from a scholarly database (e.g. Web of Science or PubMed), resulting in a list of Paper instances containing as many as possible of the following keys (missing values are set to None):
Field | Type | Description |
---|---|---|
aulast | list | Authors’ surnames, as a list. |
auinit | list | Authors’ initials, as a list. |
institution | dict | Institutions with which the authors are affiliated. |
atitle | str | Article title. |
jtitle | str | Journal title or abbreviated title. |
volume | str | Journal volume number. |
issue | str | Journal issue number. |
spage | str | Starting page of article in journal. |
epage | str | Ending page of article in journal. |
date | int | Date of publication. |
abstract | str |
These keys are associated with the meta data entries in the databases of organizations such as the International DOI Foundation and its Registration Agencies such as CrossRef and DataCite.
In addition, Paper instances will contain keys with information relevant to the networks of interest for Tethne including:
Field | Type | Description |
---|---|---|
citations | list | List of minimum Paper instances for cited references. |
ayjid | str | First author’s name (last, fi), publication year, and journal. |
doi | str | Digital Object Identifier. |
pmid | str | PubMed ID. |
wosid | str | Web of Science UT fieldtag. |
Missing data here also results in the above keys being set to None.
Combines two lists (P1 and P2) of Paper instances into a single list, and attempts to merge papers with matching fields. Where there are conflicts, values from Paper in P1 will be preferred.
Parameters : | P1 : list
P2 : list
fields : list
|
---|---|
Returns : | combined : list
|
Examples
>>> import tethne.readers as rd
>>> P1 = rd.wos.read("/Path/to/data1.txt")
>>> P2 = rd.dfr.read("/Path/to/DfR")
>>> papers = rd.merge(P1, P2, ['ayjid'])
Methods for parsing JSTOR Data-for-Research datasets.
ngrams(datapath[, N, ignore_hash, ...]) | Yields N-grams from a JSTOR DfR dataset. |
read(datapath) | Yields Paper s from JSTOR DfR package. |
Convenience function for generating a list of Paper from a directory of JSTOR DfR datasets.
Parameters : | path : string
|
---|---|
Returns : | papers : list
|
Raises : | IOError :
|
Examples
>>> import tethne.readers as rd
>>> papers = rd.dfr.from_dir("/Path/to/datadir")
Yields N-grams from a JSTOR DfR dataset.
Parameters : | datapath : string
N : string
ignore_hash : bool
apply_stoplist : bool
|
---|---|
Returns : | ngrams : dict
|
Examples
>>> import tethne.readers as rd
>>> trigrams = rd.dfr.ngrams("/Path/to/DfR", N='tri')
Yields Paper s from JSTOR DfR package.
Each Paper is tagged with an accession id for this read/conversion.
Parameters : | filepath : string
|
---|---|
Returns : | papers : list
|
Examples
>>> import tethne.readers as rd
>>> papers = rd.dfr.read("/Path/to/DfR")
Reader for output from topic modeling with MALLET.
Parse results from LDA modeling with MALLET.
MALLET’s LDA topic modeling algorithm produces a collection of output files. read() takes the topic-document and (sparse) word-topic matrices, as tab-separated value files, along with a metadata file that maps each MALLET document id to a Paper, using the metadata_key.
Parameters : | top_doc : string
word_top : string
topic_keys : string
Z : int
metadata : string (optional)
|
---|---|
Returns : | ldamodel : LDAModel |
Generates Paper objects from Mallet output.
Each Paper is assigned a topic vector.
Parameters : | top_doc : string
word_top : string
topic_keys : string
Z : int
metadata : string (optional)
|
---|---|
Returns : | papers : list
|
Methods for working with PubMed data are still under development. Please use with care.
read(filepath) | Given a file with PubMed XML, return a list of Paper instances. |
Given a file with PubMed XML, return a list of Paper instances.
See the following hyperlinks regarding possible structures of XML: * http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/citations/v2/citationtags.html#2Articlewithmorethan10authors%28listthefirst10andaddetal%29 * http://dtd.nlm.nih.gov/publishing/
Each Paper is tagged with an accession id for this read/conversion.
Usage
>>> import tethne.readers as rd
>>> papers = rd.pubmed.read("/Path/to/PubMedData.xml")
Parameters : | filepath : string
|
---|---|
Returns : | meta_list : list
|
Reader for Web of Science field-tagged bibliographic data.
Tethne parses Web of Science field-tagged data into a list of Paper objects. This is a two-step process: data are first parsed into a list of dictionaries with field-tags as keys, and then each dictionary is converted to a Paper . readers.wos.read() performs both steps in sequence.
The method readers.wos.read() performs both readers.wos.parse() and readers.wos.convert() . This is the preferred (simplest) approach in most cases.
>>> papers = rd.wos.read("/Path/to/savedrecs.txt")
>>> papers[0]
<tethne.data.Paper instance at 0x101b575a8>
Alternatively, if you have many data files saved in the same directory, you can use readers.wos.from_dir() :
>>> papers = rd.wos.parse_from_dir("/Path/to")
Use the two-step approach if you need to access fields not included in Paper, or if you wish to perform some intermediate manipulation on the raw parsed data.
First import the readers.wos module:
>>> import tethne.readers as rd
Then parse the WoS data to a list of field-tagged dictionaries using readers.wos.parse() :
>>> wos_list = rd.wos.parse("/Path/to/savedrecs.txt")
>>> wos_list[0].keys()
['EM', '', 'CL', 'AB', 'WC', 'GA', 'DI', 'IS', 'DE', 'VL', 'CY', 'AU', 'JI',
'AF', 'CR', 'DT', 'TC', 'EP', 'CT', 'PG', 'PU', 'PI', 'RP', 'J9', 'PT',
'LA', 'UT', 'PY', 'ID', 'SI', 'PA', 'SO', 'Z9', 'PD', 'TI', 'SC', 'BP',
'C1', 'NR', 'RI', 'ER', 'SN']
Convert those field-tagged dictionaries to Paper objects using readers.wos.convert() :
>>> papers = rd.wos.convert(wos_list)
>>> papers[0]
<tethne.data.Paper instance at 0x101b575a8>
convert(wos_data) | Convert parsed field-tagged data to Paper instances. |
from_dir(path) | Convenience function for generating a list of Paper from a |
parse(filepath) | Parse Web of Science field-tagged data. |
read(datapath) | Yields a list of Paper instances from a Web of Science data file. |
Convert parsed field-tagged data to Paper instances.
Convert a dictionary or list of dictionaries with keys from the Web of Science field tags into a Paper instance or list of Paper instances, the standard for Tethne.
Each Paper is tagged with an accession id for this conversion.
Parameters : | wos_data : list
|
---|---|
Returns : | papers : list
|
Notes
Need to handle author name anomolies (case, blank spaces, etc.) that may make the same author appear to be two different authors in Networkx; this is important for any graph with authors as nodes.
Examples
>>> import tethne.readers as rd
>>> wos_list = rd.wos.parse("/Path/to/data.txt")
>>> papers = rd.wos.convert(wos_list)
Convenience function for generating a list of Paper from a directory of Web of Science field-tagged data files.
Parameters : | path : string
|
---|---|
Returns : | papers : list
|
Raises : | IOError :
|
Examples
>>> import tethne.readers as rd
>>> papers = rd.wos.from_dir("/Path/to/datadir")
Parse Web of Science field-tagged data.
Parameters : | filepath : string
|
---|---|
Returns : | wos_list : list
|
Raises : | KeyError : Key value which needs to be converted to an ‘int’ is not present. AttributeError : : IOError : File at filepath not found, not readable, or empty. |
Notes
Unknown keys: RI, OI, Z9
Examples
>>> import tethne.readers as rd
>>> wos_list = rd.wos.parse("/Path/to/data.txt")
Yields a list of Paper instances from a Web of Science data file.
Parameters : | datapath : string
|
---|---|
Returns : | papers : list
|
Examples
>>> import tethne.readers as rd
>>> papers = rd.wos.read("/Path/to/data.txt")