WordPress Importer Redux

Hi, I’m Ryan McCue. You may remember me from such projects as the REST API.

I’m here today to talk about something a bit different: the WordPress Importer. The WordPress Importer is key to a tonne of different workflows, and is one of the most used plugins on the repo.

Unfortunately, the Importer is also a bit unloved. After getting immensely frustrated at the Importer, I figured it was probably time we throw some attention at it. I’ve been working on fixing this with a new and improved Importer!

If you’re interested in checking out this new version of the Importer, grab it from GitHub. It’s still some way from release, but the vast majority of functionality is already in place. The plan is to eventually replace the existing Importer with this new version.

The key to these Importer improvements is rewriting the core processing, taking experience with the current Importer and building to fix those specific problems. This means fixing and improving a whole raft of problems:

  • Way less memory usage: Testing shows memory usage to import a 41MB WXR file is down from 132MB to 19MB (less than half the actual file size!). This means no more splitting files just to get them to import!
  • Faster parser: By using a streaming XML parser, we process data as we go, which is much more scalable than the current approach. Content can begin being imported as soon as the file is read, rather than waiting for pre-processing.
  • Resumable parsing: By storing more in the database instead of variables, we can quit and resume imports on-the-go.
  • Partial imports: Rethinking the deduplication approach allows better partial imports, such as when you’re updating a production site from staging.
  • Better CLI: Treating the CLI as a first-class citizen means a better experience for those doing imports on a daily basis, and better code quality and reusability.

Curious as to how all of this is done? Read on!

Parsing

One problem with the Importer stands out immediately: resource usage. Importing any large file takes time (which is somewhat understandable) and a lot of memory. Some of my initial benchmarks indicate roughly the memory used usually has a lower limit of 3x the size of the file being imported. In addition, the importer is split into disparate stages: parse, import, and post-process. Importing in stages is not usually a problem, but it requires a lot of memory, as the entire file needs to be loaded into memory, potentially multiple times. The huge use of memory requires workarounds; the most common of these is splitting a WXR file into multiple files.

The problems with the Importer mirror other well-known problems with parsing XML. While the easiest way to parse an XML file is to simply load it into memory, transform it into some data that’s easier to work with, then do your tasks on it. This is the way that most people parse XML, and it works fine for small files, but it’s not a scalable solution.

Thankfully, this is a solved problem. The key is moving from a push parser – which takes a file and returns some data – to a pull (streaming) parser – which takes a file, and gives you a cursor to move through the file. Pull parsers are typically a bit harder to work with, but thankfully PHP allows us to use a hybrid approach (XMLReader and DOMDocument share an internal representation).

Why use a push parser to start with? Push parsers have two key benefits: they’re typically much easier to use, and you get all of the data at the start. Having the full picture of the data makes it easier to work with, and typically it means you can do cross-referencing quite easily (e.g. posts with terms). This usually means you eliminate a post-processing stage, however the Importer needs to do post-processing anyway (thanks to media URLs and IDs changing), so we don’t gain anything there. In addition, since we’re also producing the XML from the WXR export, we can optimise the export format for single-pass parsing, which will help reduce post-processing. The ease of use of the parser is only a concern for the plugin maintainers, and the hybrid approach allows us to use the pull parser in a much easier way.

Switching to a hybrid pull/push parser changes the structure of the importer to two stages: combined parse/import, and post-process. As we move through the file, we import each item as we get to it, then forget about it and move on. This means we only ever need the current item (post, term, etc) in memory at one time, so memory usage is massively reduced. In my testing, this took the import of a 41MB WXR file from 132MB to 19MB, with a slight speed boost too.

Processing

Once we’ve got our data out of XML and into an internal representation, we need to do some processing. Right now, the processing is suboptimal, as it wasn’t designed or built for large imports and sites.

Checking if posts have been imported already is very slow. Each post being imported calls post_exists, which does a direct database query for SELECT ID FROM wp_posts WHERE post_date = %s AND post_title = %s and post_content = %s. None of these fields are indexed, which makes this query slow on sites with lots of posts (for example, a site that you’ve just imported into). It also completely ignores the GUID field, which is designed for being a deduplication field in the first place. A much more performant solution is to load a map of GUID => ID ahead of time, and use the GUID to deduplicate from here. This increases memory usage, but insignifcantly, as it’s only one integer and one small string per post. (This has potential problems if posts are being created while importing; we can lock the site in maintenance mode, or the user can turn this “prefill” option off if desired.)

The importer also currently holds a lot of state in memory. Any posts that need the parent changed, users updated, or URL search-and-replace are stored in memory with maps of the data. This increases memory usage, and also means that the import isn’t resumable. Instead, we can store much of this data in the post metadata, which allows resuming importing as well as much easier debugging if a post’s data isn’t quite what you expect. This also moves the memory usage off to the database, which is much more efficient at deciding what’s important.

Media

Right now, media importing has a few problems. It’s currently not idempotent (repeatable with the same result), so you can’t re-run or resume an import; the core attachment fetching is slow; and it’s built into the main importing process.

On import, the GUID of the attachment is changed to the new attachment URL. In the past, the GUID has been used by both WordPress core and plugins for the attachment URL, but this is now regarded as bad practice, and is avoided wherever possible. The Importer however doesn’t reflect these changes in attitude, as it still changes the GUID to avoid references to the old URL. This makes repeating an import (such as resuming an import, or doing a partial update) impossible, as there’s now no ID to cross-reference between the two. Removing the GUID change means we can easily check if an attachment has been imported already. The downside is that plugins still using the GUID may break; this needs to be fixed in the plugins in question, as the GUID is already an unreliable reference to the image.

As each attachment is imported into WordPress, we also sideload the image at the same time, which blocks the importing process. This means that you’re waiting on an image to download when you could be running the rest of the import. Instead, we can change this up to create the media posts as we go, but leave the downloading to post-processing. During post-processing, we can then use a HTTP tool or library that supports parallel downloads to fetch the images simultaneously and process them as they come in. This massively improves time taken for media importing. Decoupling the downloading means we can also handle downloads using a different tool, or even use the existing local files if we already have a copy of the uploads directory (!!!).

CLI

The importer was never built with anything except the admin UI in mind. This causes problems with things like CLI importing, where wp-cli needs to reinvent parts of the importer in order to use it nicely. The new importer is instead built with the CLI as a first-class citizen, allowing a much nicer import process.

One of the components of this is building logging support into the importer from the start. Typically, users will want lots of information about the import, but for those who frequently run the importer, it can be a pain to sort out the interesting messages from the regular logging. This is fixed by having a configurable logging level, which also allows detailed debugging if desired.

Let’s Go!

The rewritten Importer already fixes all of these issues, along with other long-standing bugs.

The plan for the Importer is to restore some of the features which didn’t make the cut along the road to this rewrite. Importantly, it currently doesn’t contain any admin UI, with only the CLI interface available. After the final pieces of compatibility are sorted out, the plan is to replace the existing Importer plugin with the rewrite, and ship a version 1 of the Importer. (Did you know the current stable version is only 0.6?)

I’d love to have your help improving this! If there’s anything you’ve ever wanted from the Importer, now’s the time to say so. Testing your export files with the new Importer to detect any discrepencies would be super helpful too, as due to the nature of the rewrite, it’s definitely possible that something’s been missed along the way.

Hopefully you care about the Importer as much as I do. Let’s make it great together.

#importers