Module email_extractor
[hide private]
[frames] | no frames]

Module email_extractor

source code


Web Data Extractor, extract emails by sitecrawl
Copyright (C) 2011 KATHURIA Pulkit
Contact: pulkit@jaist.ac.jp

Contributors:
    Open Source Sitemap Generator sitemap_gen by Vladimir Toncar
    http://toncar.cz/opensource/sitemap_gen.html

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

Classes [hide private]
  MyHTMLParser
Functions [hide private]
 
getPage(url) source code
 
joinUrls(baseUrl, newUrl) source code
 
getRobotParser(startUrl) source code
 
getUrlToProcess(pageMap) source code
 
parsePages(startUrl, maxUrls, blockExtensions) source code
 
grab_email(text) source code
 
urltext(url) source code
 
crawl_site(url, limit) source code
Variables [hide private]
  __package__ = None