WordPress Importer Redux

Hi, I’m Ryan McCue. You may remember me from such projects as the REST API.

I’m here today to talk about something a bit different: the WordPress Importer. The WordPress Importer is key to a tonne of different workflows, and is one of the most used plugins on the repo.

Unfortunately, the Importer is also a bit unloved. After getting immensely frustrated at the Importer, I figured it was probably time we throw some attention at it. I’ve been working on fixing this with a new and improved Importer!

If you’re interested in checking out this new version of the Importer, grab it from GitHub. It’s still some way from release, but the vast majority of functionality is already in place. The plan is to eventually replace the existing Importer with this new version.

The key to these Importer improvements is rewriting the core processing, taking experience with the current Importer and building to fix those specific problems. This means fixing and improving a whole raft of problems:

  • Way less memory usage: Testing shows memory usage to import a 41MB WXR file is down from 132MB to 19MB (less than half the actual file size!). This means no more splitting files just to get them to import!
  • Faster parser: By using a streaming XML parser, we process data as we go, which is much more scalable than the current approach. Content can begin being imported as soon as the file is read, rather than waiting for pre-processing.
  • Resumable parsing: By storing more in the database instead of variables, we can quit and resume imports on-the-go.
  • Partial imports: Rethinking the deduplication approach allows better partial imports, such as when you’re updating a production site from staging.
  • Better CLI: Treating the CLI as a first-class citizen means a better experience for those doing imports on a daily basis, and better code quality and reusability.

Curious as to how all of this is done? Read on!

Continue reading

#importers

Migration update: try this importer

Hey everyone,

The importer is largely unchanged from last week, with the exception of a few UI changes:

  • #341: Progress for posts/attachments/menu items is now shown correctly (in %d of %d format)
  • #342: The debug view (showing the raw data) now uses print_r through a special chars filter
  • #340: UI now has full-sentence strings to communicate how the process works and when the import is done, and Refresh/Abort buttons are shown above and below the progress.

An import of a WordPress WXR file in progress

A completed WordPress import

I’ve also had the chance to run it against a large number of import files, including ones sent to me by generous volunteers who read some of my previous weekly updates (props @tieptoep). No catastrophes, yet!

Obviously, it’s still a work in progress, but I’m now willing to risk a public beta. The usual disclaimers (please don’t use this on a production site) apply.

Although I’m not aware of any other plugins that build on the WXR importer through its API, I nevertheless generated some PHPDoc API documentation using phpDocumentor 2, which might be handy if you decide to hook into this or reuse its components.

I’d love to hear your feedback on the interface, on the general experience using this importer, and any errors or warnings that you encounter. Thanks!

#importers, #migration-portability, #weekly-update

Migration update: cron importer part 2

Hey everybody — I have good news and bad news.

Good news: I’ve finished porting all the individual import steps to the cron model and now have a mostly working frontend UI (largely unchanged from the previous iteration of the importer) that utilizes it.

As of this evening, the cron model is able to parse, process, and finish importing two test XML files from the core unit tests (valid-wxr-1.1.xml and small-export.xml). The test case, which uses exactly the same assertions as the core unit test, passes all 193 assertions. (Update: an errorless import of the wptest.io sample data has been done.)

WordPress import in progress

WordPress cron import in progress

A completed cron import

A completed cron import

Bad news: I wanted to tag a version and release a download today, but I’ve decided not to do so due to the unconfirmed stability of the importer. As some astute observers noted last week, storing the temporary data in the options table can blow out caches. Although I’ve attempted to mitigate this (see [2180] and this reference from a few years back on Core Trac), I still need to test this against some real systems before I release it and break your server.

Those who are very interested can always check out the source in Subversion. I will post a comment under this post if a download is made available before my next weekly update.

Although an overhaul of the XML parser, as suggested in the comments on last week’s post, is probably necessary to avoid memory and caching issues, my first priority was to finish the migration of processes to the cron tasks. As soon as I can post a working importer, I will immediately turn my attention to the XML parsing step.

#importers, #migration-portability, #weekly-update

Migration update: cron importer

Following last week’s update about the WP_Importer_Cron approach to writing importers and running import jobs, I’ve been steadily transitioning code from the current non-stateful, single-execution plugin to a stateful, step-wise process (#327).

At the same time, I needed to separate presentation from logic/backend processing (#331) — something that @otto42 also recommended — in two ways:

  • Removing direct printf(), echo statements that were used by the WXR importer (example)
    and changing them to WP_Error objects (example of fatal error; of non-fatal warning)
  • Handling uploads and UI choices in a separate class

Why must this be done now? Well, asynchronous tasks differ from PHP scripts directly responding to a browser request — we can’t depend on having access to submitted $_POST data, nor can we directly pipe output to the user. This change would also make it easier to understand what the code is doing from reading it, and to test programmatically.

One dilemma I’ve encountered: how best to store the parsed import XML file. Since each step of the import (users, categories, plugins, etc) runs separately, we must…

  1. store all of the parsed data in variables, which are serialized into an option between runs
    (obviously, a huge amount of data for which this may not be the most robust or efficient method);
  2. re-parse the XML on each run
    (currently, parsers handle all parts of the XML at once, which means unnecessarily duplicated effort and time);
  3. modify the parsers to parse only part of the XML at a time; or
  4. split the XML file into chunks based on their contents (authors, categories, etc) and then feed only partial chunks to the parser at a time.

Any thoughts? Solving this problem could also help the plugin deal with large XML files that we used to need to break up by hand before importing. (The Tumblr importer doesn’t have the same problem because there is no massive amount of data being uploaded at the beginning.)

I haven’t yet finished transitioning all the steps; I’m afraid it won’t be possible to use this just yet. Before next Monday, I should have a downloadable plugin that’s safe to try.

#importers, #migration-portability, #weekly-update

Antsy for 3.6 to start and need a…

Antsy for 3.6 to start and need a project? Who wants to make an official importer for the new Twitter archives? Would think we’ll want to add that into the importers list. Would suggest importer auto-assign “status” or “aside” post format (or make it an option in the plugin to choose format). Who’s in? I volunteer to ux review and test. 🙂
http://thenextweb.com/twitter/2012/12/16/twitter-has-started-rolling-out-the-option-to-download-all-your-tweets/

#importers, #plugin, #twitter

If you have a Tumblr blog can you…

If you have a Tumblr blog, can you help test an updated version of the importer? It uses their OAuth API, which requires you to create an application. It’s simple and the plugin walks you through it. Here’s the ZIP file to the beta version. You can report bugs on #22422.

Edit: The plugin has been released, the beta is over: https://wordpress.org/extend/plugins/tumblr-importer/

#importers