Wikipedia database dumps are huge and downloading them for a lot of languages and keeping them up to date is tedious work and I pity every system administrator who does this on a regular basis. In the spirit of “Shut up! Or I will replace you with a very small shell script” i have good news for everyone!
Just use wp-download ...
You might want to figure out how you should use wp-download, which can be achieved easily by asking it for help:
$ wp-download --help
Usage: wp-download [options] DIR
Options:
-h, --help show this help message and exit
-q, --quiet do not generate output (only report errors)
-v, --verbose generate verbose output
-c FILE, --config=FILE
load configuration from FILE [default:
/home/$USER/.wpdownloadrc]
Logging:
Specify log file handling.
--log-file=FILE write logs to FILE
--log-file-level=LOG_FILE_LEVEL
set log level (DEBUG, INFO, WARNING, ERROR) [default:
INFO]
Download:
Change download behaviour
--force Force download of all files.
--resume Resume partial downloads.
--timeout=TIMEOUT Set timeout for download in seconds [default 30 s]
--retries=RETRIES Set number of download attempts [default 3]
The program is pretty easy to use. You create a directory where you want to place newly downloaded dump files and configure the files and languages you want to download within a configuration file. So let’s configure wp-download:
$ cp /usr/local/share/doc/wp-download/examples/wpdownloadrc.sample ~/.wpdownloadrc
$ $EDITOR ~/.wpdownloadrc
You configure wp-download to your wishes by editing/creating a suitable configuration file which you place at ~/.wpdownloadrc or specify with the -c [FILE] option.
The configuration file has a couple of sections, but only [Files] and [Languages] are relevant. (for you!?) You might have guessed that you configure files to download within [Files] and languages within [Languages].
If, for example, you want to download the redirect, category and pages-article files for Swahili edit the sections like this:
# wpdownloadrc.sample
...
[Files]
redirect = True
category = True
pages-articles = True
all-other-files = False
# or-commented = True
...
[Languages]
...
#sv = True
sw = True
#szl = True
...
To download all files simply call wp-download with the target directory:
$ wp-download -v /path/to/wikipedia/dumps
swwiki-20090821-redirect.sql.gz [***********] 100% Time: 00:00:00 1.42 M/s
swwiki-20090821-category.sql.gz [***********] 100% Time: 00:00:00 2.23 M/s
...
...
The downloaded files will be placed in appropriate directories within /path/to/wikipedia/dumps. The created directory structure looks like this:
../dumps/
de/
20090618/
20090710/
20090728/
20090810/
en/
20090728/
20090810/
...
...
zh/
20090728/
20090810/
The downloader will download the most recent dumps for each language so that it easy to always have the most recent dumps around by executing wp-download multiple times.
If you want more information about what is going on add the -v or --verbose option:
$ wp-download -v /path/to/wikipedia/dumps
Read configuration from: '/home/$USER/.wpdownloadrc'
Set timeout to 30s
Processing language: sw
Creating directory: /path/to/wikipedia/dumps/sw/20090821
Latest dump for (sw) is from Friday 21 August 2009
swwiki-20090821-redirect.sql.gz [***********] 100% Time: 00:00:00 1.39 M/s
swwiki-20090821-category.sql.gz [***********] 100% Time: 00:00:00 2.40 M/s
...
...
As it might happen that an ongoing download is interrupted you can tell wp-download to resume interrupted downloads by calling it with the --resume option:
$ wp-download -v /path/to/wikipedia/dumps
Read configuration from: '/home/$USER/.wpdownloadrc'
Set timeout to 30s
Processing language: sw
Creating directory: /path/to/wikipedia/dumps/sw/20090821
Latest dump for (sw) is from Friday 21 August 2009
Skip: swwiki-20090821-redirect.sql.gz
Skip: swwiki-20090821-category.sql.gz
Resume: swwiki-20090821-pages-articles.xml.bz2
swwiki-20090821-pages-articles.xml.bz2 [****] 100% Time: 00:00:00 3.19 M/s
...
...
Installation <install>
Usage <usage>