EarwigBot: Wiki Toolset: Copyright Violation MixIn
This is a mixin that provides two public methods, copyvio_check() and copyvio_compare(). The former checks the page for copyright violations using a search engine API, and the latter compares the page against a given URL. Credentials for the search engine API are stored in the Site‘s config.
Check the page for copyright violations.
Returns a CopyvioCheckResult object with information on the results of the check.
min_confidence is the minimum amount of confidence we must have in the similarity between a source text and the article in order for us to consider it a suspected violation. This is a number between 0 and 1.
max_queries is self-explanatory; we will never make more than this number of queries in a given check.
max_time can be set to prevent copyvio checks from taking longer than a set amount of time (generally around a minute), which can be useful if checks are called through a web server with timeouts. We will stop checking new URLs as soon as this limit is reached.
Setting no_searches to True will cause only URLs in the wikitext of the page to be checked; no search engine queries will be made. Setting no_links to True will cause the opposite to happen: URLs in the wikitext will be ignored; search engine queries will be made only. Setting both of these to True is pointless.
Normally, the checker will short-circuit if it finds a URL that meets min_confidence. This behavior normally causes it to skip any remaining URLs and web queries, but setting short_circuit to False will prevent this.
Raises CopyvioCheckError or subclasses (UnknownSearchEngineError, SearchQueryError, ...) on errors.
Check the page like copyvio_check() against a specific URL.
This is essentially a reduced version of copyvio_check() - a copyivo comparison is made using Markov chains and the result is returned in a CopyvioCheckResult object - but without using a search engine, since the suspected “violated” URL is supplied from the start.
Its primary use is to generate a result when the URL is retrieved from a cache, like the one used in EarwigBot’s Tool Labs site. After a search is done, the resulting URL is stored in a cache for 72 hours so future checks against that page will not require another set of time-and-money-consuming search engine queries. However, the comparison itself (which includes the article’s and the source’s content) cannot be stored for data retention reasons, so a fresh comparison is made using this function.
Since no searching is done, neither UnknownSearchEngineError nor SearchQueryError will be raised.
Cause all copyvio checks to be done by one global set of workers.
This is useful when checks are being done through a web interface where large numbers of simulatenous requests could be problematic. The global workers are spawned when the function is called, run continuously, and intelligently handle multiple checks.
This function is not thread-safe and should only be called when no checks are being done. It has no effect if it has already been called.
Return to using page-specific workers for copyvio checks.
This disables changes made by globalize(), including stoping the global worker threads.
This function is not thread-safe and should only be called when no checks are being done.
EarwigBot: Wiki Toolset: Exclusions Database Manager
Controls the exclusions.db file, which stores URLs excluded from copyright violation checks on account of being known mirrors, for example.
Check whether a given URL is in the exclusions database.
Return True if the URL is in the database, or False otherwise.
Bases: object
Implements a basic ngram Markov chain of words.
Bases: earwigbot.wiki.copyvios.markov.MarkovChain
Implements the intersection of two chains (i.e., their shared nodes).
Bases: earwigbot.wiki.copyvios.parsers._BaseTextParser
A parser that can strip and chunk wikicode article text.
Convert the clean article text into a list of web-searchable chunks.
No greater than max_chunks will be returned. Each chunk will only be a sentence or two long at most (no more than max_query). The idea is to return a sample of the article text rather than the whole, so we’ll pick and choose from parts of it, especially if the article is large and max_chunks is low, so we don’t end up just searching for just the first paragraph.
This is implemented using nltk (http://nltk.org/). A base directory (nltk_dir) is required to store nltk’s punctuation database. This is typically located in the bot’s working directory.
Return a list of all external links in the article.
The list is restricted to things that we suspect we can parse: i.e., those with schemes of http and https.
Clean the page’s raw text by removing templates and formatting.
Return the page’s text with all HTML and wikicode formatting removed, including templates, tables, and references. It retains punctuation (spacing, paragraphs, periods, commas, (semi)-colons, parentheses, quotes), original capitalization, and so forth. HTML entities are replaced by their unicode equivalents.
The actual stripping is handled by mwparserfromhell.
EarwigBot: Wiki Toolset: Copyvio Source
A class that represents a single possible source of a copyright violation, i.e., a URL.
Attributes:
EarwigBot: Wiki Toolset: Copyvio Check Result
A class holding information about the results of a copyvio check.
Attributes:
Bases: object
Base class for a simple search engine interface.
Bases: earwigbot.wiki.copyvios.search.BaseSearchEngine
A search engine interface with Yahoo! BOSS.
Do a Yahoo! BOSS web search for query.
Returns a list of URLs, no more than five, ranked by relevance (as determined by Yahoo). Raises SearchQueryError on errors.