readers Package

readers Package

Methods for parsing bibliographic datasets.

dfr Methods for parsing JSTOR Data-for-Research datasets.
mallet Reader for output from topic modeling with MALLET.
pubmed Methods for working with PubMed data are still under development. Please use
wos Reader for Web of Science field-tagged bibliographic data.

Each file reader provides methods to parse bibliographic data from a scholarly database (e.g. Web of Science or PubMed), resulting in a list of Paper instances containing as many as possible of the following keys (missing values are set to None):

Field Type Description
aulast list Authors’ surnames, as a list.
auinit list Authors’ initials, as a list.
institution dict Institutions with which the authors are affiliated.
atitle str Article title.
jtitle str Journal title or abbreviated title.
volume str Journal volume number.
issue str Journal issue number.
spage str Starting page of article in journal.
epage str Ending page of article in journal.
date int Date of publication.
abstract str  

These keys are associated with the meta data entries in the databases of organizations such as the International DOI Foundation and its Registration Agencies such as CrossRef and DataCite.

In addition, Paper instances will contain keys with information relevant to the networks of interest for Tethne including:

Field Type Description
citations list List of minimum Paper instances for cited references.
ayjid str First author’s name (last, fi), publication year, and journal.
doi str Digital Object Identifier.
pmid str PubMed ID.
wosid str Web of Science UT fieldtag.

Missing data here also results in the above keys being set to None.

exception tethne.readers.DataError(value)[source]

Bases: exceptions.Exception

tethne.readers.merge(P1, P2, fields=['ayjid'])[source]

Combines two lists (P1 and P2) of Paper instances into a single list, and attempts to merge papers with matching fields. Where there are conflicts, values from Paper in P1 will be preferred.

Parameters :

P1 : list

A list of Paper instances.

P2 : list

A list of Paper instances.

fields : list

Fields used to identify matching Paper

Returns :

combined : list

A list of Paper instances.

Examples

>>> import tethne.readers as rd
>>> P1 = rd.wos.read("/Path/to/data1.txt")
>>> P2 = rd.dfr.read("/Path/to/DfR")
>>> papers = rd.merge(P1, P2, ['ayjid'])

dfr Module

Methods for parsing JSTOR Data-for-Research datasets.

ngrams(datapath[, N, ignore_hash, ...]) Yields N-grams from a JSTOR DfR dataset.
read(datapath) Yields Paper s from JSTOR DfR package.
tethne.readers.dfr.from_dir(path)[source]

Convenience function for generating a list of Paper from a directory of JSTOR DfR datasets.

Parameters :

path : string

Path to directory containing DfR dataset directories.

Returns :

papers : list

A list of Paper objects.

Raises :

IOError :

Invalid path.

Examples

>>> import tethne.readers as rd
>>> papers = rd.dfr.from_dir("/Path/to/datadir")
tethne.readers.dfr.ngrams(datapath, N='bi', ignore_hash=True, apply_stoplist=False)[source]

Yields N-grams from a JSTOR DfR dataset.

Parameters :

datapath : string

Path to unzipped JSTOR DfR folder containing N-grams (e.g. ‘bigrams’).

N : string

‘uni’, ‘bi’, ‘tri’, or ‘quad’

ignore_hash : bool

If True, will exclude all N-grams that contain the hash ‘#’ character.

apply_stoplist : bool

If True, will exclude all N-grams that contain words in the NLTK stoplist.

Returns :

ngrams : dict

Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

Examples

>>> import tethne.readers as rd
>>> trigrams = rd.dfr.ngrams("/Path/to/DfR", N='tri')
tethne.readers.dfr.read(datapath)[source]

Yields Paper s from JSTOR DfR package.

Each Paper is tagged with an accession id for this read/conversion.

Parameters :

filepath : string

Filepath to unzipped JSTOR DfR folder containing a citations.XML file.

Returns :

papers : list

A list of Paper objects.

Examples

>>> import tethne.readers as rd
>>> papers = rd.dfr.read("/Path/to/DfR")

mallet Module

Reader for output from topic modeling with MALLET.

tethne.readers.mallet.load(top_doc, word_top, topic_keys, Z, metadata=None, metadata_key='doi')[source]

Parse results from LDA modeling with MALLET.

MALLET’s LDA topic modeling algorithm produces a collection of output files. read() takes the topic-document and (sparse) word-topic matrices, as tab-separated value files, along with a metadata file that maps each MALLET document id to a Paper, using the metadata_key.

Parameters :

top_doc : string

Path to topic-document datafile generated with –output-doc-topics.

word_top : string

Path to word-topic datafile generated with –word-topic-counts-file.

topic_keys : string

Path to topic-keys datafile generated with –output-topic-keys.

Z : int

Number of topics.

metadata : string (optional)

Path to tab-separated metadata file with IDs and Paper keys.

Returns :

ldamodel : LDAModel

tethne.readers.mallet.read(top_doc, word_top, topic_keys, Z, metadata=None, metadata_key='doi')[source]

Generates Paper objects from Mallet output.

Each Paper is assigned a topic vector.

Parameters :

top_doc : string

Path to topic-document datafile generated with –output-doc-topics.

word_top : string

Path to word-topic datafile generated with –word-topic-counts-file.

topic_keys : string

Path to topic-keys datafile generated with –output-topic-keys.

Z : int

Number of topics.

metadata : string (optional)

Path to tab-separated metadata file with IDs and Paper keys.

Returns :

papers : list

List of Paper

pubmed Module

Methods for working with PubMed data are still under development. Please use with care.

read(filepath) Given a file with PubMed XML, return a list of Paper instances.
tethne.readers.pubmed.read(filepath)[source]

Given a file with PubMed XML, return a list of Paper instances.

See the following hyperlinks regarding possible structures of XML: * http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/citations/v2/citationtags.html#2Articlewithmorethan10authors%28listthefirst10andaddetal%29 * http://dtd.nlm.nih.gov/publishing/

Each Paper is tagged with an accession id for this read/conversion.

Usage

>>> import tethne.readers as rd
>>> papers = rd.pubmed.read("/Path/to/PubMedData.xml")
Parameters :

filepath : string

Path to PubMed XML file.

Returns :

meta_list : list

A list of Paper instances.

wos Module

Reader for Web of Science field-tagged bibliographic data.

Tethne parses Web of Science field-tagged data into a list of Paper objects. This is a two-step process: data are first parsed into a list of dictionaries with field-tags as keys, and then each dictionary is converted to a Paper . readers.wos.read() performs both steps in sequence.

One-step Parsing

The method readers.wos.read() performs both readers.wos.parse() and readers.wos.convert() . This is the preferred (simplest) approach in most cases.

>>> papers = rd.wos.read("/Path/to/savedrecs.txt")
>>> papers[0]
<tethne.data.Paper instance at 0x101b575a8>

Alternatively, if you have many data files saved in the same directory, you can use readers.wos.from_dir() :

>>> papers = rd.wos.parse_from_dir("/Path/to")

Two-step Parsing

Use the two-step approach if you need to access fields not included in Paper, or if you wish to perform some intermediate manipulation on the raw parsed data.

First import the readers.wos module:

>>> import tethne.readers as rd

Then parse the WoS data to a list of field-tagged dictionaries using readers.wos.parse() :

>>> wos_list = rd.wos.parse("/Path/to/savedrecs.txt")
>>> wos_list[0].keys()
['EM', '', 'CL', 'AB', 'WC', 'GA', 'DI', 'IS', 'DE', 'VL', 'CY', 'AU', 'JI', 
 'AF', 'CR', 'DT', 'TC', 'EP', 'CT', 'PG', 'PU', 'PI', 'RP', 'J9', 'PT', 
 'LA', 'UT', 'PY', 'ID', 'SI', 'PA', 'SO', 'Z9', 'PD', 'TI', 'SC', 'BP', 
 'C1', 'NR', 'RI', 'ER', 'SN']

Convert those field-tagged dictionaries to Paper objects using readers.wos.convert() :

>>> papers = rd.wos.convert(wos_list)
>>> papers[0]
<tethne.data.Paper instance at 0x101b575a8>

Methods

convert(wos_data) Convert parsed field-tagged data to Paper instances.
from_dir(path) Convenience function for generating a list of Paper from a
parse(filepath) Parse Web of Science field-tagged data.
read(datapath) Yields a list of Paper instances from a Web of Science data file.
exception tethne.readers.wos.DataError[source]

Bases: exceptions.Exception

tethne.readers.wos.convert(wos_data)[source]

Convert parsed field-tagged data to Paper instances.

Convert a dictionary or list of dictionaries with keys from the Web of Science field tags into a Paper instance or list of Paper instances, the standard for Tethne.

Each Paper is tagged with an accession id for this conversion.

Parameters :

wos_data : list

A list of dictionaries with keys from the WoS field tags.

Returns :

papers : list

A list of Paper instances.

Notes

Need to handle author name anomolies (case, blank spaces, etc.) that may make the same author appear to be two different authors in Networkx; this is important for any graph with authors as nodes.

Examples

>>> import tethne.readers as rd
>>> wos_list = rd.wos.parse("/Path/to/data.txt")
>>> papers = rd.wos.convert(wos_list)
tethne.readers.wos.from_dir(path)[source]

Convenience function for generating a list of Paper from a directory of Web of Science field-tagged data files.

Parameters :

path : string

Path to directory of field-tagged data files.

Returns :

papers : list

A list of Paper objects.

Raises :

IOError :

Invalid path.

Examples

>>> import tethne.readers as rd
>>> papers = rd.wos.from_dir("/Path/to/datadir")        
tethne.readers.wos.parse(filepath)[source]

Parse Web of Science field-tagged data.

Parameters :

filepath : string

Filepath to the Web of Science plain text file.

Returns :

wos_list : list

A list of dictionaries each associated with a paper from the Web of Science with keys from docs/fieldtags.txt as encountered in the file; most values associated with keys are strings with special exceptions defined by the list_keys and int_keys variables.

Raises :

KeyError : Key value which needs to be converted to an ‘int’ is not present.

AttributeError : :

IOError : File at filepath not found, not readable, or empty.

Notes

Unknown keys: RI, OI, Z9

Examples

>>> import tethne.readers as rd
>>> wos_list = rd.wos.parse("/Path/to/data.txt")
tethne.readers.wos.read(datapath)[source]

Yields a list of Paper instances from a Web of Science data file.

Parameters :

datapath : string

Filepath to the Web of Science field-tagged data file.

Returns :

papers : list

A list of Paper instances.

Examples

>>> import tethne.readers as rd
>>> papers = rd.wos.read("/Path/to/data.txt")

Table Of Contents

Previous topic

networks Package

Next topic

services Package

This Page