copyvios Package¶

`copyvios` Package¶

class earwigbot.wiki.copyvios.CopyvioMixIn(site)[source]¶

EarwigBot: Wiki Toolset: Copyright Violation MixIn

This is a mixin that provides two public methods, copyvio_check() and copyvio_compare(). The former checks the page for copyright violations using a search engine API, and the latter compares the page against a given URL. Credentials for the search engine API are stored in the Site‘s config.

copyvio_check(min_confidence=0.75, max_queries=15, max_time=-1, no_searches=False, no_links=False, short_circuit=True)[source]¶

Check the page for copyright violations.

Returns a CopyvioCheckResult object with information on the results of the check.

min_confidence is the minimum amount of confidence we must have in the similarity between a source text and the article in order for us to consider it a suspected violation. This is a number between 0 and 1.

max_queries is self-explanatory; we will never make more than this number of queries in a given check.

max_time can be set to prevent copyvio checks from taking longer than a set amount of time (generally around a minute), which can be useful if checks are called through a web server with timeouts. We will stop checking new URLs as soon as this limit is reached.

Setting no_searches to True will cause only URLs in the wikitext of the page to be checked; no search engine queries will be made. Setting no_links to True will cause the opposite to happen: URLs in the wikitext will be ignored; search engine queries will be made only. Setting both of these to True is pointless.

Normally, the checker will short-circuit if it finds a URL that meets min_confidence. This behavior normally causes it to skip any remaining URLs and web queries, but setting short_circuit to False will prevent this.

Raises CopyvioCheckError or subclasses (UnknownSearchEngineError, SearchQueryError, ...) on errors.

copyvio_compare(url, min_confidence=0.75, max_time=30)[source]¶

Check the page like copyvio_check() against a specific URL.

This is essentially a reduced version of copyvio_check() - a copyivo comparison is made using Markov chains and the result is returned in a CopyvioCheckResult object - but without using a search engine, since the suspected “violated” URL is supplied from the start.

Its primary use is to generate a result when the URL is retrieved from a cache, like the one used in EarwigBot’s Tool Labs site. After a search is done, the resulting URL is stored in a cache for 72 hours so future checks against that page will not require another set of time-and-money-consuming search engine queries. However, the comparison itself (which includes the article’s and the source’s content) cannot be stored for data retention reasons, so a fresh comparison is made using this function.

Since no searching is done, neither UnknownSearchEngineError nor SearchQueryError will be raised.

earwigbot.wiki.copyvios.globalize(num_workers=8)¶

Cause all copyvio checks to be done by one global set of workers.

This is useful when checks are being done through a web interface where large numbers of simulatenous requests could be problematic. The global workers are spawned when the function is called, run continuously, and intelligently handle multiple checks.

This function is not thread-safe and should only be called when no checks are being done. It has no effect if it has already been called.

earwigbot.wiki.copyvios.localize()¶

Return to using page-specific workers for copyvio checks.

This disables changes made by globalize(), including stoping the global worker threads.

This function is not thread-safe and should only be called when no checks are being done.

`exclusions` Module¶

class earwigbot.wiki.copyvios.exclusions.ExclusionsDB(sitesdb, dbfile, logger)[source]¶

EarwigBot: Wiki Toolset: Exclusions Database Manager

Controls the exclusions.db file, which stores URLs excluded from copyright violation checks on account of being known mirrors, for example.

check(sitename, url)[source]¶

Check whether a given URL is in the exclusions database.

Return True if the URL is in the database, or False otherwise.

get_mirror_hints(sitename, try_mobile=True)[source]¶

Return a list of strings that indicate the existence of a mirror.

The source parser checks for the presence of these strings inside of certain HTML tag attributes ("href" and "src").

sync(sitename, force=False)[source]¶

Update the database if it hasn’t been updated recently.

This updates the exclusions database for the site sitename and “all”.

Site-specific lists are considered stale after 48 hours; global lists after 12 hours.

`markov` Module¶

class earwigbot.wiki.copyvios.markov.MarkovChain(text)[source]¶

Bases: object

Implements a basic ngram Markov chain of words.

END = -2¶

START = -1¶

degree = 5¶

class earwigbot.wiki.copyvios.markov.MarkovChainIntersection(mc1, mc2)[source]¶

Bases: earwigbot.wiki.copyvios.markov.MarkovChain

Implements the intersection of two chains (i.e., their shared nodes).

`parsers` Module¶

class earwigbot.wiki.copyvios.parsers.ArticleTextParser(text, args=None)[source]¶

Bases: earwigbot.wiki.copyvios.parsers._BaseTextParser

A parser that can strip and chunk wikicode article text.

TEMPLATE_MERGE_THRESHOLD = 35¶

TYPE = 'Article'¶

chunk(nltk_dir, max_chunks, min_query=8, max_query=128)[source]¶

Convert the clean article text into a list of web-searchable chunks.

No greater than max_chunks will be returned. Each chunk will only be a sentence or two long at most (no more than max_query). The idea is to return a sample of the article text rather than the whole, so we’ll pick and choose from parts of it, especially if the article is large and max_chunks is low, so we don’t end up just searching for just the first paragraph.

This is implemented using nltk (http://nltk.org/). A base directory (nltk_dir) is required to store nltk’s punctuation database. This is typically located in the bot’s working directory.

get_links()[source]¶

Return a list of all external links in the article.

The list is restricted to things that we suspect we can parse: i.e., those with schemes of http and https.

strip()[source]¶

Clean the page’s raw text by removing templates and formatting.

Return the page’s text with all HTML and wikicode formatting removed, including templates, tables, and references. It retains punctuation (spacing, paragraphs, periods, commas, (semi)-colons, parentheses, quotes), original capitalization, and so forth. HTML entities are replaced by their unicode equivalents.

The actual stripping is handled by mwparserfromhell.

earwigbot.wiki.copyvios.parsers.get_parser(content_type)[source]¶: Return the parser most able to handle a given content type, or None.

`result` Module¶

class earwigbot.wiki.copyvios.result.CopyvioSource(workspace, url, headers=None, timeout=5, parser_args=None)[source]¶

EarwigBot: Wiki Toolset: Copyvio Source

A class that represents a single possible source of a copyright violation, i.e., a URL.

Attributes:

url: the URL of the source
confidence: the confidence of a violation, between 0 and 1
chains: a 2-tuple of the source chain and the delta chain
skipped: whether this URL was skipped during the check
excluded: whether this URL was in the exclusions list

finish_work()[source]¶: Mark this source as finished.

join(until)[source]¶: Block until this violation result is filled out.

skip()[source]¶: Deactivate this source without filling in the relevant data.

start_work()[source]¶: Mark this source as being worked on right now.

update(confidence, source_chain, delta_chain)[source]¶: Fill out the confidence and chain information inside this source.

class earwigbot.wiki.copyvios.result.CopyvioCheckResult(violation, sources, queries, check_time, article_chain, possible_miss)[source]¶

EarwigBot: Wiki Toolset: Copyvio Check Result

A class holding information about the results of a copyvio check.

Attributes:

violation: True if this is a violation, else False
sources: a list of CopyvioSources, sorted by confidence
best: the best matching CopyvioSource, or None
confidence: the best matching source’s confidence, or 0
url: the best matching source’s URL, or None
queries: the number of queries used to reach a result
time: the amount of time the check took to complete
article_chain: the MarkovChain of the article text
possible_miss: whether some URLs might have been missed

best[source]¶: The best known source, or None if no sources exist.

confidence[source]¶: The confidence of the best source, or 0 if no sources exist.

get_log_message(title)[source]¶: Build a relevant log message for this copyvio check result.

url[source]¶: The URL of the best source, or None if no sources exist.

`search` Module¶

class earwigbot.wiki.copyvios.search.BaseSearchEngine(cred, opener)[source]¶

Bases: object

Base class for a simple search engine interface.

name = 'Base'¶

search(query)[source]¶

Use this engine to search for query.

Not implemented in this base class; overridden in subclasses.

class earwigbot.wiki.copyvios.search.YahooBOSSSearchEngine(cred, opener)[source]¶

Bases: earwigbot.wiki.copyvios.search.BaseSearchEngine

A search engine interface with Yahoo! BOSS.

name = 'Yahoo! BOSS'¶

search(query)[source]¶

Do a Yahoo! BOSS web search for query.

Returns a list of URLs, no more than five, ranked by relevance (as determined by Yahoo). Raises SearchQueryError on errors.

copyvios Package¶