WordPress Importer Redux

Hi, I’m Ryan McCue. You may remember me from such projects as the REST APIREST API The REST API is an acronym for the RESTful Application Program Interface (API) that uses HTTP requests to GET, PUT, POST and DELETE data. It is how the front end of an application (think “phone app” or “website”) can communicate with the data store (think “database” or “file system”) https://developer.wordpress.org/rest-api/..

I’m here today to talk about something a bit different: the WordPress Importer. The WordPress Importer is key to a tonne of different workflows, and is one of the most used plugins on the repo.

Unfortunately, the Importer is also a bit unloved. After getting immensely frustrated at the Importer, I figured it was probably time we throw some attention at it. I’ve been working on fixing this with a new and improved Importer!

If you’re interested in checking out this new version of the Importer, grab it from GitHub. It’s still some way from releaseRelease A release is the distribution of the final version of an application. A software release may be either public or private and generally constitutes the initial or new generation of a new or upgraded application. A release is preceded by the distribution of alpha and then beta versions of the software., but the vast majority of functionality is already in place. The plan is to eventually replace the existing Importer with this new version.

The key to these Importer improvements is rewriting the coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. processing, taking experience with the current Importer and building to fix those specific problems. This means fixing and improving a whole raft of problems:

  • Way less memory usage: Testing shows memory usage to import a 41MB WXR file is down from 132MB to 19MB (less than half the actual file size!). This means no more splitting files just to get them to import!
  • Faster parser: By using a streaming XML parser, we process data as we go, which is much more scalable than the current approach. Content can begin being imported as soon as the file is read, rather than waiting for pre-processing.
  • Resumable parsing: By storing more in the database instead of variables, we can quit and resume imports on-the-go.
  • Partial imports: Rethinking the deduplication approach allows better partial imports, such as when you’re updating a production siteProduction Site A production site is a live site online meant to be viewed by your visitors, as opposed to a site that is staged for development or testing. from staging.
  • Better CLICLI Command Line Interface. Terminal (Bash) in Mac, Command Prompt in Windows, or WP-CLI for WordPress.: Treating the CLI as a first-class citizen means a better experience for those doing imports on a daily basis, and better code quality and reusability.

Curious as to how all of this is done? Read on!

Continue reading

#importers

Migration update: try this importer

Hey everyone,

The importer is largely unchanged from last week, with the exception of a few UIUI User interface changes:

  • #341: Progress for posts/attachments/menu items is now shown correctly (in %d of %d format)
  • #342: The debug view (showing the raw data) now uses print_r through a special chars filterFilter Filters are one of the two types of Hooks https://codex.wordpress.org/Plugin_API/Hooks. They provide a way for functions to modify data of other functions. They are the counterpart to Actions. Unlike Actions, filters are meant to work in an isolated manner, and should never have side effects such as affecting global variables and output.
  • #340: UI now has full-sentence strings to communicate how the process works and when the import is done, and Refresh/Abort buttons are shown above and below the progress.

An import of a WordPress WXR file in progress

A completed WordPress import

I’ve also had the chance to run it against a large number of import files, including ones sent to me by generous volunteers who read some of my previous weekly updates (props @tieptoep). No catastrophes, yet!

Obviously, it’s still a work in progress, but I’m now willing to risk a public betaBeta A pre-release of software that is given out to a large group of users to trial under real conditions. Beta versions have gone through alpha testing in-house and are generally fairly close in look, feel and function to the final product; however, design changes often occur as part of the process.. The usual disclaimers (please don’t use this on a production siteProduction Site A production site is a live site online meant to be viewed by your visitors, as opposed to a site that is staged for development or testing.) apply.

Although I’m not aware of any other plugins that build on the WXR importer through its APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., I nevertheless generated some PHPDoc API documentation using phpDocumentor 2, which might be handy if you decide to hook into this or reuse its components.

I’d love to hear your feedback on the interface, on the general experience using this importer, and any errors or warnings that you encounter. Thanks!

#importers, #migration-portability, #weekly-update

Migration update: cron importer part 2

Hey everybody — I have good news and bad news.

Good news: I’ve finished porting all the individual import steps to the cron model and now have a mostly working frontend UIUI User interface (largely unchanged from the previous iteration of the importer) that utilizes it.

As of this evening, the cron model is able to parse, process, and finish importing two test XML files from the coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. unit tests (valid-wxr-1.1.xml and small-export.xml). The test case, which uses exactly the same assertions as the core unit testunit test Code written to test a small piece of code or functionality within a larger application. Everything from themes to WordPress core have a series of unit tests. Also see regression., passes all 193 assertions. (Update: an errorless import of the wptest.io sample data has been done.)

WordPress import in progress

WordPress cron import in progress

A completed cron import

A completed cron import

Bad news: I wanted to tagtag A directory in Subversion. WordPress uses tags to store a single snapshot of a version (3.6, 3.6.1, etc.), the common convention of tags in version control systems. (Not to be confused with post tags.) a version and releaseRelease A release is the distribution of the final version of an application. A software release may be either public or private and generally constitutes the initial or new generation of a new or upgraded application. A release is preceded by the distribution of alpha and then beta versions of the software. a download today, but I’ve decided not to do so due to the unconfirmed stability of the importer. As some astute observers noted last week, storing the temporary data in the options table can blow out caches. Although I’ve attempted to mitigate this (see [2180] and this reference from a few years back on Core Trac), I still need to test this against some real systems before I release it and break your server.

Those who are very interested can always check out the source in Subversion. I will post a comment under this post if a download is made available before my next weekly update.

Although an overhaul of the XML parser, as suggested in the comments on last week’s post, is probably necessary to avoid memory and caching issues, my first priority was to finish the migrationMigration Moving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. of processes to the cron tasks. As soon as I can post a working importer, I will immediately turn my attention to the XML parsing step.

#importers, #migration-portability, #weekly-update

Migration update: cron importer

Following last week’s update about the WP_Importer_Cron approach to writing importers and running import jobs, I’ve been steadily transitioning code from the current non-stateful, single-execution pluginPlugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party to a stateful, step-wise process (#327).

At the same time, I needed to separate presentation from logic/backend processing (#331) — something that @otto42 also recommended — in two ways:

  • Removing direct printf(), echo statements that were used by the WXR importer (example)
    and changing them to WP_Error objects (example of fatal error; of non-fatal warning)
  • Handling uploads and UIUI User interface choices in a separate class

Why must this be done now? Well, asynchronous tasks differ from PHPPHP The web scripting language in which WordPress is primarily architected. WordPress requires PHP 5.6.20 scripts directly responding to a browser request — we can’t depend on having access to submitted $_POST data, nor can we directly pipe output to the user. This change would also makemake A collection of P2 blogs at make.wordpress.org, which are the home to a number of contributor groups, including core development (make/core, formerly "wpdevel"), the UI working group (make/ui), translators (make/polyglots), the theme reviewers (make/themes), resources for plugin authors (make/plugins), and the accessibility working group (make/accessibility). it easier to understand what the code is doing from reading it, and to test programmatically.

One dilemma I’ve encountered: how best to store the parsed import XML file. Since each step of the import (users, categories, plugins, etc) runs separately, we must…

  1. store all of the parsed data in variables, which are serialized into an option between runs
    (obviously, a huge amount of data for which this may not be the most robust or efficient method);
  2. re-parse the XML on each run
    (currently, parsers handle all parts of the XML at once, which means unnecessarily duplicated effort and time);
  3. modify the parsers to parse only part of the XML at a time; or
  4. split the XML file into chunks based on their contents (authors, categories, etc) and then feed only partial chunks to the parser at a time.

Any thoughts? Solving this problem could also help the plugin deal with large XML files that we used to need to break up by hand before importing. (The Tumblr importer doesn’t have the same problem because there is no massive amount of data being uploaded at the beginning.)

I haven’t yet finished transitioning all the steps; I’m afraid it won’t be possible to use this just yet. Before next Monday, I should have a downloadable plugin that’s safe to try.

#importers, #migration-portability, #weekly-update

Antsy for 3.6 to start and need a…

Antsy for 3.6 to start and need a project? Who wants to makemake A collection of P2 blogs at make.wordpress.org, which are the home to a number of contributor groups, including core development (make/core, formerly "wpdevel"), the UI working group (make/ui), translators (make/polyglots), the theme reviewers (make/themes), resources for plugin authors (make/plugins), and the accessibility working group (make/accessibility). an official importer for the new Twitter archives? Would think we’ll want to add that into the importers list. Would suggest importer auto-assign “status” or “aside” post format (or make it an option in the pluginPlugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party to choose format). Who’s in? I volunteer to ux review and test. 🙂
http://thenextweb.com/twitter/2012/12/16/twitter-has-started-rolling-out-the-option-to-download-all-your-tweets/

#importers, #plugin, #twitter

If you have a Tumblr blog can you…

If you have a Tumblr blogblog (versus network, site), can you help test an updated version of the importer? It uses their OAuth APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., which requires you to create an application. It’s simple and the pluginPlugin A plugin is a piece of software containing a group of functions that can be added to a WordPress website. They can extend functionality or add new features to your WordPress websites. WordPress plugins are written in the PHP programming language and integrate seamlessly with WordPress. These can be free in the WordPress.org Plugin Directory https://wordpress.org/plugins/ or can be cost-based plugin from a third-party walks you through it. Here’s the ZIP file to the beta version. You can report bugs on #22422.

Edit: The plugin has been released, the beta is over: https://wordpress.org/extend/plugins/tumblr-importer/

#importers