Hi, I’m Ryan McCue. You may remember me from such projects as the REST API.
I’m here today to talk about something a bit different: the WordPress Importer. The WordPress Importer is key to a tonne of different workflows, and is one of the most used plugins on the repo.
Unfortunately, the Importer is also a bit unloved. After getting immensely frustrated at the Importer, I figured it was probably time we throw some attention at it. I’ve been working on fixing this with a new and improved Importer!
If you’re interested in checking out this new version of the Importer, grab it from GitHub. It’s still some way from release, but the vast majority of functionality is already in place. The plan is to eventually replace the existing Importer with this new version.
The key to these Importer improvements is rewriting the core processing, taking experience with the current Importer and building to fix those specific problems. This means fixing and improving a whole raft of problems:
Way less memory usage: Testing shows memory usage to import a 41MB WXR file is down from 132MB to 19MB (less than half the actual file size!). This means no more splitting files just to get them to import!
Faster parser: By using a streaming XML parser, we process data as we go, which is much more scalable than the current approach. Content can begin being imported as soon as the file is read, rather than waiting for pre-processing.
Resumable parsing: By storing more in the database instead of variables, we can quit and resume imports on-the-go.
Partial imports: Rethinking the deduplication approach allows better partial imports, such as when you’re updating a production site from staging.
Better CLI: Treating the CLI as a first-class citizen means a better experience for those doing imports on a daily basis, and better code quality and reusability.
Curious as to how all of this is done? Read on!
The importer is largely unchanged from last week, with the exception of a few UI changes:
#341: Progress for posts/attachments/menu items is now shown correctly (in
%d of %d format)
#342: The debug view (showing the raw data) now uses
print_r through a special chars filter
#340: UI now has full-sentence strings to communicate how the process works and when the import is done, and Refresh/Abort buttons are shown above and below the progress.
I’ve also had the chance to run it against a large number of import files, including ones sent to me by generous volunteers who read some of my previous weekly updates (props @tieptoep). No catastrophes, yet!
Obviously, it’s still a work in progress, but I’m now willing to risk a public beta. The usual disclaimers (please don’t use this on a production site) apply.
Although I’m not aware of any other plugins that build on the WXR importer through its API, I nevertheless generated some PHPDoc API documentation using phpDocumentor 2, which might be handy if you decide to hook into this or reuse its components.
I’d love to hear your feedback on the interface, on the general experience using this importer, and any errors or warnings that you encounter. Thanks!
Hey everybody — I have good news and bad news.
Good news: I’ve finished porting all the individual import steps to the cron model and now have a mostly working frontend UI (largely unchanged from the previous iteration of the importer) that utilizes it.
As of this evening, the cron model is able to parse, process, and finish importing two test XML files from the core unit tests (
small-export.xml). The test case, which uses exactly the same assertions as the core unit test, passes all 193 assertions. (Update: an errorless import of the wptest.io sample data has been done.)
WordPress cron import in progress
A completed cron import
Bad news: I wanted to tag a version and release a download today, but I’ve decided not to do so due to the unconfirmed stability of the importer. As some astute observers noted last week, storing the temporary data in the options table can blow out caches. Although I’ve attempted to mitigate this (see  and this reference from a few years back on Core Trac), I still need to test this against some real systems before I release it and break your server.
Those who are very interested can always check out the source in Subversion. I will post a comment under this post if a download is made available before my next weekly update.
Although an overhaul of the XML parser, as suggested in the comments on last week’s post, is probably necessary to avoid memory and caching issues, my first priority was to finish the migration of processes to the cron tasks. As soon as I can post a working importer, I will immediately turn my attention to the XML parsing step.
Following last week’s update about the WP_Importer_Cron approach to writing importers and running import jobs, I’ve been steadily transitioning code from the current non-stateful, single-execution plugin to a stateful, step-wise process (#327).
At the same time, I needed to separate presentation from logic/backend processing (#331) — something that @otto42 also recommended — in two ways:
echo statements that were used by the WXR importer (example)
and changing them to WP_Error objects (example of fatal error; of non-fatal warning)
Handling uploads and UI choices in a separate class
Why must this be done now? Well, asynchronous tasks differ from PHP scripts directly responding to a browser request — we can’t depend on having access to submitted $_POST data, nor can we directly pipe output to the user. This change would also make it easier to understand what the code is doing from reading it, and to test programmatically.
One dilemma I’ve encountered: how best to store the parsed import XML file. Since each step of the import (users, categories, plugins, etc) runs separately, we must…
store all of the parsed data in variables, which are serialized into an option between runs
(obviously, a huge amount of data for which this may not be the most robust or efficient method);
re-parse the XML on each run
(currently, parsers handle all parts of the XML at once, which means unnecessarily duplicated effort and time);
modify the parsers to parse only part of the XML at a time; or
split the XML file into chunks based on their contents (authors, categories, etc) and then feed only partial chunks to the parser at a time.
Any thoughts? Solving this problem could also help the plugin deal with large XML files that we used to need to break up by hand before importing. (The Tumblr importer doesn’t have the same problem because there is no massive amount of data being uploaded at the beginning.)
I haven’t yet finished transitioning all the steps; I’m afraid it won’t be possible to use this just yet. Before next Monday, I should have a downloadable plugin that’s safe to try.
If you have a Tumblr blog, can you help test an updated version of the importer? It uses their OAuth API, which requires you to create an application. It’s simple and the plugin walks you through it. Here’s the ZIP file to the beta version. You can report bugs on #22422.
Edit: The plugin has been released, the beta is over: https://wordpress.org/extend/plugins/tumblr-importer/