Turns a wordpress export xml file into a python dictionary
Note
taking coffee for donations :)
>>> from web2py_utils import wordpress2py
Retrieve a python dict that represents the wordpress database:
>>> data = wordpress2py.word2py(open('/path/to/wordpress.2009-11-30.xml', 'r'))
Insert data into web2py DAL using a schema
>>> ids_inserted = wordpress2py.schema_migrate(db, schema, '/path/to/wordpress.2009-11-30.xml')
Use the data dictionary to create a custom migration function.
Dictionary layout is documented in the word2py function.
{
'<DATA TABLE>': {
'<DATA COLUMN>': '<DAL TABLE>/<DAL FIELD>',
},
'<DATA TABLE>': {
'<PYTHON EXEC doit>': {
'<DATA COLUMN>': '<DAL TABLE>/<DAL FIELD>',
}
}
}
'<DATA TABLE>': {
'categories',
'tags',
'posts',
'comments',
'post_categories',
'post_tags',
}
'<DATA COLUMN>': {
'categories' ->
name
slug
parent
'tags' ->
name
slug
'posts' ->
id
title
slug
status
type
link
pub_date
description
content
post_date
post_date_gmt
categories -> list of strings (categories slug)
tags -> list of strings (tags slug)
'comments' ->
id
author
author_email
author_url
author_ip
date
date_gmt
content
approved
}
Warning
IMPORTANT
THIS CODE MUST BE VALID PYTHON CODE IT MUST SET A VARIABLE NAMED doit TO EITHER TRUE OR FALSE IF TRUE, THE RECORDS IN THE CORRESPONDING DICT ARE INSERTED INTO THE DATABASE THIS CAN BE RECURSIVE
EXEC ENVIRONMENT HAS ACCESS TO THESE VARIABLES data <dict> (this is the data from wordpress) data[‘title’] # sample for posts data[‘parent’] # sample for categories This data matches options for <DATA COLUMN>
The module comes with two example schemas
wordpress2py.default_mengu_blog_schema:
{
'categories': {
'name': 'category/title',
},
'posts': {
'doit = True if data["type"] == "post" else False': {
'title': 'post/title',
'content': 'post/body',
'post_date': 'post/dateline',
},
'doit = True if data["type"] == "page" else False': {
'title': 'page/title',
'content': 'page/content',
},
},
'comments': {
'id_post': 'comment/post_id',
'author': 'comment/name',
'author_email': 'comment/email',
'content': 'comment/comment',
'date': 'comment/dateline',
},
'post_categories': {
'id_category': 'relations/category',
'id_post': 'relations/post',
},
}
Note
Take a good look at the ‘doit’ keys. This is how to use EXEC ENVIRONMENT effectively. This basically allows you some control over how your data will go into the database. In case you have multiple tables for different post types.
wordpress2py.default_schema:
{
'categories': {
'name': 'category/title',
'parent': 'category/parent',
},
'tags': {
'name': 'tag/title',
},
'posts': {
'title': 'post/title',
'slug': 'post/slug',
'status': 'post/status',
'type': 'post/type',
'post_date': 'post/pub_date',
'content': 'post/content',
},
'comments': {
'id_post': 'comment/id_post',
'author': 'comment/author',
'author_email': 'comment/email',
'author_url': 'comment/site',
'date': 'comment/posted_on',
'approved': 'comment/approved',
'content': 'comment/content',
},
'post_categories': {
'id_category': 'category_relations/category',
'id_post': 'category_relations/post',
},
'post_tags': {
'id_tag': 'tag_relations/tag',
'id_post': 'tag_relations/post',
},
}
This is an example custom migration script to export to mengu blog.
This is just here for a full reference in case you have more complex needs. However the schema works perfectly and is very versatile.:
def custom_migrate_to_mengu_database(db):
data = word2py(open('wordpress_export.xml', 'r'))
category_ids = {}
post_ids = {}
comment_ids = {}
for c in data['categories']:
category_ids[c['name']] = db.category.insert(title=c['name'])
for post in data['posts']:
if post['type'] == 'post':
post_id = db.post.insert(
title = post['title'],
body = post['content'],
dateline = post['pub_date'],
)
for c in post['categories']:
db.relations.insert(
post = post_id,
category = category_ids[c]
)
for c in post['comments']:
comment_id = db.comment.insert(
post_id = post_id,
name = c['author'],
email = c['author_email'],
comment = c['content'],
dateline = c['date']
)
elif post['type'] == 'page':
post_id = db.page.insert(
title = post['title'],
content = post['content']
)
Requires elementtree
Returns python dictionary representing the wordpress blog. Certain metadata may be missing.
Content is sorted based on the arrangment of the data in the xml file.
Dict structure:
# -> means a list, or array.
db {
title
link
description
pub_date
language
categories ->
name
slug
parent
description (if available)
tags ->
name
slug
posts ->
id
title
slug
status
type
link
pub_date
description
content
post_date
post_date_gmt
categories -> flat array
tags -> flat array
comments ->
id
author
author_email
author_url
author_ip
date
date_gmt
content
approved
}