readers Package¶

`readers` Package¶

Methods for parsing bibliographic datasets.

`dfr`	Methods for parsing JSTOR Data-for-Research datasets.
`mallet`	Reader for output from topic modeling with MALLET.
`pubmed`	Methods for working with PubMed data are still under development. Please use
`wos`	Reader for Web of Science field-tagged bibliographic data.

Each file reader provides methods to parse bibliographic data from a scholarly database (e.g. Web of Science or PubMed), resulting in a list of Paper instances containing as many as possible of the following keys (missing values are set to None):

Field	Type	Description
aulast	list	Authors’ surnames, as a list.
auinit	list	Authors’ initials, as a list.
institution	dict	Institutions with which the authors are affiliated.
atitle	str	Article title.
jtitle	str	Journal title or abbreviated title.
volume	str	Journal volume number.
issue	str	Journal issue number.
spage	str	Starting page of article in journal.
epage	str	Ending page of article in journal.
date	int	Date of publication.
abstract	str

These keys are associated with the meta data entries in the databases of organizations such as the International DOI Foundation and its Registration Agencies such as CrossRef and DataCite.

In addition, Paper instances will contain keys with information relevant to the networks of interest for Tethne including:

Field	Type	Description
citations	list	List of minimum `Paper` instances for cited references.
ayjid	str	First author’s name (last, fi), publication year, and journal.
doi	str	Digital Object Identifier.
pmid	str	PubMed ID.
wosid	str	Web of Science UT fieldtag.

Missing data here also results in the above keys being set to None.

exception tethne.readers.DataError(value)[source]¶: Bases: exceptions.Exception

tethne.readers.merge(P1, P2, fields=['ayjid'])[source]¶

Combines two lists (P1 and P2) of Paper instances into a single list, and attempts to merge papers with matching fields. Where there are conflicts, values from Paper in P1 will be preferred.

Parameters :

P1 : list

A list of Paper instances.

P2 : list

A list of Paper instances.

fields : list

Fields used to identify matching Paper

Returns :

combined : list

A list of Paper instances.

Examples

>>> import tethne.readers as rd
>>> P1 = rd.wos.read("/Path/to/data1.txt")
>>> P2 = rd.dfr.read("/Path/to/DfR")
>>> papers = rd.merge(P1, P2, ['ayjid'])

`dfr` Module¶

Methods for parsing JSTOR Data-for-Research datasets.

`ngrams`(datapath[, N, ignore_hash, ...])	Yields N-grams from a JSTOR DfR dataset.
`read`(datapath)	Yields `Paper` s from JSTOR DfR package.

tethne.readers.dfr.from_dir(path)[source]¶

Convenience function for generating a list of Paper from a directory of JSTOR DfR datasets.

Parameters :

path : string

Path to directory containing DfR dataset directories.

Returns :

papers : list

A list of Paper objects.

Raises :

IOError :

Invalid path.

Examples

>>> import tethne.readers as rd
>>> papers = rd.dfr.from_dir("/Path/to/datadir")

tethne.readers.dfr.ngrams(datapath, N='bi', ignore_hash=True, apply_stoplist=False)[source]¶

Yields N-grams from a JSTOR DfR dataset.

Parameters :

datapath : string

Path to unzipped JSTOR DfR folder containing N-grams (e.g. ‘bigrams’).

N : string

‘uni’, ‘bi’, ‘tri’, or ‘quad’

ignore_hash : bool

If True, will exclude all N-grams that contain the hash ‘#’ character.

apply_stoplist : bool

If True, will exclude all N-grams that contain words in the NLTK stoplist.

Returns :

ngrams : dict

Keys are paper DOIs, values are lists of (Ngram, frequency) tuples.

Examples

>>> import tethne.readers as rd
>>> trigrams = rd.dfr.ngrams("/Path/to/DfR", N='tri')

tethne.readers.dfr.read(datapath)[source]¶

Yields Paper s from JSTOR DfR package.

Each Paper is tagged with an accession id for this read/conversion.

Parameters :

filepath : string

Filepath to unzipped JSTOR DfR folder containing a citations.XML file.

Returns :

papers : list

A list of Paper objects.

Examples

>>> import tethne.readers as rd
>>> papers = rd.dfr.read("/Path/to/DfR")

`mallet` Module¶

Reader for output from topic modeling with MALLET.

tethne.readers.mallet.load(top_doc, word_top, topic_keys, Z, metadata=None, metadata_key='doi')[source]¶

Parse results from LDA modeling with MALLET.

MALLET’s LDA topic modeling algorithm produces a collection of output files. read() takes the topic-document and (sparse) word-topic matrices, as tab-separated value files, along with a metadata file that maps each MALLET document id to a Paper, using the metadata_key.

Parameters :

top_doc : string

Path to topic-document datafile generated with –output-doc-topics.

word_top : string

Path to word-topic datafile generated with –word-topic-counts-file.

topic_keys : string

Path to topic-keys datafile generated with –output-topic-keys.

Z : int

Number of topics.

metadata : string (optional)

Path to tab-separated metadata file with IDs and Paper keys.

Returns :

ldamodel : LDAModel

tethne.readers.mallet.read(top_doc, word_top, topic_keys, Z, metadata=None, metadata_key='doi')[source]¶

Generates Paper objects from Mallet output.

Each Paper is assigned a topic vector.

Parameters :

top_doc : string

Path to topic-document datafile generated with –output-doc-topics.

word_top : string

Path to word-topic datafile generated with –word-topic-counts-file.

topic_keys : string

Path to topic-keys datafile generated with –output-topic-keys.

Z : int

Number of topics.

metadata : string (optional)

Path to tab-separated metadata file with IDs and Paper keys.

Returns :

papers : list

List of Paper

`pubmed` Module¶

Methods for working with PubMed data are still under development. Please use with care.

read(filepath) Given a file with PubMed XML, return a list of Paper instances.

tethne.readers.pubmed.read(filepath)[source]¶

Given a file with PubMed XML, return a list of Paper instances.

See the following hyperlinks regarding possible structures of XML: * http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/citations/v2/citationtags.html#2Articlewithmorethan10authors%28listthefirst10andaddetal%29 * http://dtd.nlm.nih.gov/publishing/

Each Paper is tagged with an accession id for this read/conversion.

Usage

>>> import tethne.readers as rd
>>> papers = rd.pubmed.read("/Path/to/PubMedData.xml")

Parameters :

filepath : string

Path to PubMed XML file.

Returns :

meta_list : list

A list of Paper instances.

`wos` Module¶

Reader for Web of Science field-tagged bibliographic data.

Tethne parses Web of Science field-tagged data into a list of Paper objects. This is a two-step process: data are first parsed into a list of dictionaries with field-tags as keys, and then each dictionary is converted to a Paper . readers.wos.read() performs both steps in sequence.

One-step Parsing¶

The method readers.wos.read() performs both readers.wos.parse() and readers.wos.convert() . This is the preferred (simplest) approach in most cases.

>>> papers = rd.wos.read("/Path/to/savedrecs.txt")
>>> papers[0]
<tethne.data.Paper instance at 0x101b575a8>

Alternatively, if you have many data files saved in the same directory, you can use readers.wos.from_dir() :

>>> papers = rd.wos.parse_from_dir("/Path/to")

Two-step Parsing¶

Use the two-step approach if you need to access fields not included in Paper, or if you wish to perform some intermediate manipulation on the raw parsed data.

First import the readers.wos module:

>>> import tethne.readers as rd

Then parse the WoS data to a list of field-tagged dictionaries using readers.wos.parse() :

>>> wos_list = rd.wos.parse("/Path/to/savedrecs.txt")
>>> wos_list[0].keys()
['EM', '', 'CL', 'AB', 'WC', 'GA', 'DI', 'IS', 'DE', 'VL', 'CY', 'AU', 'JI', 
 'AF', 'CR', 'DT', 'TC', 'EP', 'CT', 'PG', 'PU', 'PI', 'RP', 'J9', 'PT', 
 'LA', 'UT', 'PY', 'ID', 'SI', 'PA', 'SO', 'Z9', 'PD', 'TI', 'SC', 'BP', 
 'C1', 'NR', 'RI', 'ER', 'SN']

Convert those field-tagged dictionaries to Paper objects using readers.wos.convert() :

>>> papers = rd.wos.convert(wos_list)
>>> papers[0]
<tethne.data.Paper instance at 0x101b575a8>

Methods¶

`convert`(wos_data)	Convert parsed field-tagged data to `Paper` instances.
`from_dir`(path)	Convenience function for generating a list of `Paper` from a
`parse`(filepath)	Parse Web of Science field-tagged data.
`read`(datapath)	Yields a list of `Paper` instances from a Web of Science data file.

exception tethne.readers.wos.DataError[source]¶: Bases: exceptions.Exception

tethne.readers.wos.convert(wos_data)[source]¶

Convert parsed field-tagged data to Paper instances.

Convert a dictionary or list of dictionaries with keys from the Web of Science field tags into a Paper instance or list of Paper instances, the standard for Tethne.

Each Paper is tagged with an accession id for this conversion.

Parameters :

wos_data : list

A list of dictionaries with keys from the WoS field tags.

Returns :

papers : list

A list of Paper instances.

Notes

Need to handle author name anomolies (case, blank spaces, etc.) that may make the same author appear to be two different authors in Networkx; this is important for any graph with authors as nodes.

Examples

>>> import tethne.readers as rd
>>> wos_list = rd.wos.parse("/Path/to/data.txt")
>>> papers = rd.wos.convert(wos_list)

tethne.readers.wos.from_dir(path)[source]¶

Convenience function for generating a list of Paper from a directory of Web of Science field-tagged data files.

Parameters :

path : string

Path to directory of field-tagged data files.

Returns :

papers : list

A list of Paper objects.

Raises :

IOError :

Invalid path.

Examples

>>> import tethne.readers as rd
>>> papers = rd.wos.from_dir("/Path/to/datadir")        

tethne.readers.wos.parse(filepath)[source]¶

Parse Web of Science field-tagged data.

Parameters :

filepath : string

Filepath to the Web of Science plain text file.

Returns :

wos_list : list

A list of dictionaries each associated with a paper from the Web of Science with keys from docs/fieldtags.txt as encountered in the file; most values associated with keys are strings with special exceptions defined by the list_keys and int_keys variables.

Raises :

KeyError : Key value which needs to be converted to an ‘int’ is not present.

AttributeError : :

IOError : File at filepath not found, not readable, or empty.

Notes

Unknown keys: RI, OI, Z9

Examples

>>> import tethne.readers as rd
>>> wos_list = rd.wos.parse("/Path/to/data.txt")

tethne.readers.wos.read(datapath)[source]¶

Yields a list of Paper instances from a Web of Science data file.

Parameters :

datapath : string

Filepath to the Web of Science field-tagged data file.

Returns :

papers : list

A list of Paper instances.

Examples

>>> import tethne.readers as rd
>>> papers = rd.wos.read("/Path/to/data.txt")

readers Package¶

`readers` Package¶

`dfr` Module¶

`mallet` Module¶

`pubmed` Module¶

`wos` Module¶

One-step Parsing¶

Two-step Parsing¶

Methods¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

readers Package¶

readers Package¶

dfr Module¶

mallet Module¶

pubmed Module¶

wos Module¶

One-step Parsing¶

Two-step Parsing¶

Methods¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation

`readers` Package¶

`dfr` Module¶

`mallet` Module¶

`pubmed` Module¶

`wos` Module¶