Make WordPress Core

Tagged: importers Toggle Comment Threads | Keyboard Shortcuts

  • Frederick Ding 5:16 am on August 6, 2013 Permalink
    Tags: importers, ,   

    Migration update: try this importer 

    Hey everyone,

    The importer is largely unchanged from last week, with the exception of a few UI changes:

    • #341: Progress for posts/attachments/menu items is now shown correctly (in %d of %d format)
    • #342: The debug view (showing the raw data) now uses print_r through a special chars filter
    • #340: UI now has full-sentence strings to communicate how the process works and when the import is done, and Refresh/Abort buttons are shown above and below the progress.

    An import of a WordPress WXR file in progress

    A completed WordPress import

    I’ve also had the chance to run it against a large number of import files, including ones sent to me by generous volunteers who read some of my previous weekly updates (props @tieptoep). No catastrophes, yet!

    Obviously, it’s still a work in progress, but I’m now willing to risk a public beta. The usual disclaimers (please don’t use this on a production site) apply.

    Although I’m not aware of any other plugins that build on the WXR importer through its API, I nevertheless generated some PHPDoc API documentation using phpDocumentor 2, which might be handy if you decide to hook into this or reuse its components.

    I’d love to hear your feedback on the interface, on the general experience using this importer, and any errors or warnings that you encounter. Thanks!

    • thisislawatts 9:45 am on August 7, 2013 Permalink | Log in to Reply

      Really excited about this plugin, it’s development seems to coincide with my needing to transfer 6 WordPress blogs(~600 posts each) into 2. So you can only imagine my joy!

      I have downloaded the beta above I don’t seem to get anything beyond the initial ‘Import Status’, which shows how many posts it’s got to import, then clicking ‘Refresh’ I get -> http://thisis.la/pix/of-course.png. I am not sure what it’s doing here. I am running this on XAMPP so could it be an issue with the Cron not running? If you could point me towards a useful debugging point I will tell you every

      Then a second though on the UI, these blogs that I am importing have a stack of users, which typically result in -> http://thisis.la/pix/wordpress-import-assign.png. And that is super boring to fill in, so I created a quick JS snippet to go through and autofill those values. https://gist.github.com/thisislawatts/6163831. Thinking it would be great if those usernames wrapped in ()’s were also wrapped in a span to spare my messy regex.


      • Frederick Ding 7:13 pm on August 7, 2013 Permalink | Log in to Reply

        Thanks for trying it out! I expected there to be some problems, so let’s see if we can figure out what’s going on :)

        Since you mentioned using XAMPP, I imagine it to be this scenario: On local installations (especially with hostnames that do not resolve in DNS), cron theoretically falls back to “alternative cron” mode. I haven’t yet tested with alternative cron; most likely it’s redirecting the admin user right after clicking on “Refresh”, and that fails to pass on the nonce. I think this is what’s happening. Let me try it out on a local install and see what happens.

        The JavaScript for prefilling users is a good idea, although it can also be done server-side. (It currently uses wp_dropdown_users(), which can pre-select a user.) Aside from this, I’d be fine with including the usernames in a data-* attribute; it’d make this a lot easier.

        Edit: I’ve added a new Trac ticket for this enhancement in GSoC #346; it’s related to previous requests that were punted (#8455, #16148).

    • TCBarrett 2:21 pm on August 14, 2013 Permalink | Log in to Reply

      This looks great. One of the problems with exporting is that you cannot export a single post type with attachments, it has to be all or nothing. Does this solve that problem? Or is there an equivalent project?

  • Frederick Ding 3:33 am on July 30, 2013 Permalink
    Tags: importers, ,   

    Migration update: cron importer part 2 

    Hey everybody — I have good news and bad news.

    Good news: I’ve finished porting all the individual import steps to the cron model and now have a mostly working frontend UI (largely unchanged from the previous iteration of the importer) that utilizes it.

    As of this evening, the cron model is able to parse, process, and finish importing two test XML files from the core unit tests (valid-wxr-1.1.xml and small-export.xml). The test case, which uses exactly the same assertions as the core unit test, passes all 193 assertions. (Update: an errorless import of the wptest.io sample data has been done.)

    WordPress import in progress

    WordPress cron import in progress

    A completed cron import

    A completed cron import

    Bad news: I wanted to tag a version and release a download today, but I’ve decided not to do so due to the unconfirmed stability of the importer. As some astute observers noted last week, storing the temporary data in the options table can blow out caches. Although I’ve attempted to mitigate this (see [2180] and this reference from a few years back on Core Trac), I still need to test this against some real systems before I release it and break your server.

    Those who are very interested can always check out the source in Subversion. I will post a comment under this post if a download is made available before my next weekly update.

    Although an overhaul of the XML parser, as suggested in the comments on last week’s post, is probably necessary to avoid memory and caching issues, my first priority was to finish the migration of processes to the cron tasks. As soon as I can post a working importer, I will immediately turn my attention to the XML parsing step.

  • Frederick Ding 12:03 am on July 23, 2013 Permalink
    Tags: importers, ,   

    Migration update: cron importer 

    Following last week’s update about the WP_Importer_Cron approach to writing importers and running import jobs, I’ve been steadily transitioning code from the current non-stateful, single-execution plugin to a stateful, step-wise process (#327).

    At the same time, I needed to separate presentation from logic/backend processing (#331) — something that @otto42 also recommended — in two ways:

    • Removing direct printf(), echo statements that were used by the WXR importer (example)
      and changing them to WP_Error objects (example of fatal error; of non-fatal warning)
    • Handling uploads and UI choices in a separate class

    Why must this be done now? Well, asynchronous tasks differ from PHP scripts directly responding to a browser request — we can’t depend on having access to submitted $_POST data, nor can we directly pipe output to the user. This change would also make it easier to understand what the code is doing from reading it, and to test programmatically.

    One dilemma I’ve encountered: how best to store the parsed import XML file. Since each step of the import (users, categories, plugins, etc) runs separately, we must…

    1. store all of the parsed data in variables, which are serialized into an option between runs
      (obviously, a huge amount of data for which this may not be the most robust or efficient method);
    2. re-parse the XML on each run
      (currently, parsers handle all parts of the XML at once, which means unnecessarily duplicated effort and time);
    3. modify the parsers to parse only part of the XML at a time; or
    4. split the XML file into chunks based on their contents (authors, categories, etc) and then feed only partial chunks to the parser at a time.

    Any thoughts? Solving this problem could also help the plugin deal with large XML files that we used to need to break up by hand before importing. (The Tumblr importer doesn’t have the same problem because there is no massive amount of data being uploaded at the beginning.)

    I haven’t yet finished transitioning all the steps; I’m afraid it won’t be possible to use this just yet. Before next Monday, I should have a downloadable plugin that’s safe to try.

    • dllh 1:34 am on July 23, 2013 Permalink | Log in to Reply

      You’ll want to be careful about trying to store data in options. For example, in an environment that’s using memcached, which has a size limit, you can blow up sites by trying to store too much data in options, so it’s not necessarily a matter only of efficiency.

      Also, if you use options, I imagine you also have to use something like an extra tracking option to know which parts of the import you’ve handled. This is just begging for race conditions.

      I wonder if there’s anything you can do using FormData to split the file into chunks client-side and assemble them server-side. I’m not sure what browser support is like, and I worry it’d be pretty brittle. Just thinking aloud.

      • Frederick Ding 1:44 am on July 23, 2013 Permalink | Log in to Reply

        You’re completely right about the caching implications — I’ve all but eliminated possibility #1.

        I was thinking about the possibilities of client-side file splitting, too, even though it’s not the most robust or reliable way. Last week, I looked briefly at https://github.com/blueimp/jQuery-File-Upload which supports chunking/multipart — but that’s not quite what we’d need, is it? The browser would need to operate along XML node boundaries, not just file size. (I’d be intrigued if it’s possible to utilize a browser’s native DOM engine to do that…)

    • Ryan McCue 1:50 am on July 23, 2013 Permalink | Log in to Reply

      So, with regards to XML parsing options: splitting the XML file is something that absolutely should not be done. You can do it, but only if you use proper XML serialization/deserialization, which is likely going to take up a chunk of time.

      Reparsing is a bit of a pain too, since the XML parsing usually takes up the largest amount of time there. The best bet with regards to memory is to use a SAX-style parser which streams the XML, whereas a DOM-style parser will read it all at once. SimpleXML is a DOM-style parser, so you should avoid that for performance, whereas xml_parse is SAX-based.

      I’m biased as the lead developer of it, but you could use SimplePie, which a) is built into core, b) has built in caching (via the transients API, but you’d probably want file-based caching here) and c) natively parses RSS (given that’s its job), which WXR is based on. This handles picking a parser all internally, although doesn’t use a stream parser due to internal implementation details, so you may want to stick away from it for that. I’m relatively certain I could write a parser (in `WXR_Parser_*` form) in a few hours, so it should be easy to do.

      At the least, I’d take a look into how SimplePie handles caching. Basically, it’s a giant serialized array, usually saved to a file (although in WP, this uses transients).

      (If you do want to take the full SimplePie route, I’d love to help, most likely in a consulting role. It’d simplify the `WXR_Parser` classes significantly by moving the XML parsing out of those.)

      I’ve spent a fair bit of time messing with XML parsers, so feel free to pick my brain on this! :)

      • Frederick Ding 2:51 am on July 23, 2013 Permalink | Log in to Reply

        I should have remembered that you’re an expert on XML!

        I’m not familiar with the differences between SAX/pull/DOM, but if I’m reading this right, then pull — with a way to persist state — would be the most appropriate model to follow. (In PHP, XMLReader appears to be a pull parser that’s enabled by default since PHP 5.1.2 — so I’d just need to write a WXR_Parser_* to take advantage of it.)

        Thanks for pointing me in this direction!

        Edit: Actually, scratch my eagerness to use XMLReader. It appears to be weakly documented and most of what I can see involves using SimpleXML/DOM on nodes…

        • Ryan McCue 6:43 am on July 23, 2013 Permalink | Log in to Reply

          XMLReader or SAX are the ones I’d go for. There’s an IBM DeveloperWorks article that might help with that.

          The way I’d handle it is to keep state of where you currently are in the stack. WXR_Parser_XML uses this concept to handle it, but uses the SAX parser instead. Conceptually, the way it keeps track of the position is the way you’d want to do it (although there’s a bunch of cases it doesn’t handle).

          WXR_Parser_XML isn’t the best though, since all this data is loaded into memory and then pushed to the database later. Although it means you have cross dependencies, I’d consider inserting posts/etc as you go, rather than bunching them all up. This is a pretty fundamental rework of the internal parsing API, but it’s one that you’ll need for this sort of thing.

          Personally, I’d create two objects (a parser and a data handler) and use dependency injection to ensure that you still have a weak coupling between the two.

          (As a note, SimplePie also uses the SAX parser, but loads all the data into memory because it has to. This is a case where it’s a much better idea to use XMLReader directly.)

          Regarding XMLReader documentation, are there any specific examples? I’m happy to assist with specifics here, ping me at my email (see footer) if you’d like.

        • Ryan McCue 6:51 am on July 23, 2013 Permalink | Log in to Reply

          Also, the reason people are using XMLReader with SimpleXML/DOM is because they prefer the SimpleXML/DOM API, and the performance using the hybrid is better than straight SimpleXML. Personally, I’d stick with the one rather than switching, because there are some performance concerns with that.

      • Ryan McCue 1:55 am on July 23, 2013 Permalink | Log in to Reply

        (What you really want here is a resumable SAX/pull parser, with some way to persist state across processes. With `xml_parse`, you should be able to feed in data as you stream it in, but you’ll still need to keep the data in memory since I don’t know of any streamable serialization in PHP. Please note that if you do use your own code here, there are security considerations that will need to be discussed. Your mentors should be able to help with that.)

  • Jen 11:00 pm on December 16, 2012 Permalink
    Tags: importers, , twitter   

    Antsy for 3.6 to start and need a project? Who wants to make an official importer for the new Twitter archives? Would think we’ll want to add that into the importers list. Would suggest importer auto-assign “status” or “aside” post format (or make it an option in the plugin to choose format). Who’s in? I volunteer to ux review and test. :)

    • Aaron Brazell 11:05 pm on December 16, 2012 Permalink | Log in to Reply

      I already was planning on doing this as a plugin, and I’ve been quiet for awhile. I can do this. But… I need to have the archives available to me, and my account doesn’t have it yet.

      • Jane Wells 11:26 pm on December 16, 2012 Permalink | Log in to Reply

        Mine either, but I’ll see if I can wrangle one we can use.

      • Ryan Duff 11:26 pm on December 16, 2012 Permalink | Log in to Reply

        Or as soon as someone gets and volunteers a copy of their archive. From the post it doesn’t seem there’s an api, but a set of html pages + xml, or json files (pick your poison)

        Also, from what I’ve heard they files are monthly, so if you’ve been on Twitter for 4 years you’d be looking at 48 json files to handle w/ an importer.

        Of course that’s all based off what I’ve read. I don’t have it yet. Just something to chew on.

        • Aaron Brazell 11:32 pm on December 16, 2012 Permalink | Log in to Reply

          Yeah it’ll be an interesting challenge but I wanna see what the data looks like first. If they turn it on for me, I’ve got 6 years of archives which would be a good stress test too.

          • Andrew Nacin 11:33 pm on December 16, 2012 Permalink | Log in to Reply

            Could handle the zip that they send you as-is. In fact, I imagine that would be the best approach for users.

            • Aaron Brazell 2:54 pm on December 17, 2012 Permalink

              I feel like we need to be able to parse the HTML, CSV and JSON in any zip file if we approach it that way. From the user perspective, I think you’re right but I sure hope the CSV and HTML are decent enough.

        • Samuel Wood (Otto) 11:38 pm on December 16, 2012 Permalink | Log in to Reply

          It’ll just be a matter of iterating the json. Simple stupid, for the most part.

    • Samuel Wood (Otto) 11:06 pm on December 16, 2012 Permalink | Log in to Reply

      If they actually roll it out to people unchanged, then it should be fairly trivial. When I get it on my account, I’ll let you know.

    • Phil Erb 11:25 pm on December 16, 2012 Permalink | Log in to Reply

      When archives are available to me, I’d love to help test.

    • Andrew Nacin 11:36 pm on December 16, 2012 Permalink | Log in to Reply

      We now have an API for importers, which means we can add this to wp-admin/import.php the moment it is done.

      When anyone gets access to their zip, please share it (or at least a sample so we can learn the format).

    • Aaron D. Campbell 12:29 am on December 17, 2012 Permalink | Log in to Reply

      Looks like my account has the option. I’m requesting an archive and will post it somewhere to use.

    • Myatu 6:53 am on December 17, 2012 Permalink | Log in to Reply

      Wouldn’t it be more sensible to have this as a 3rd party plugin, rather than having to maintain more bloat?

      • Peter Westwood 10:40 am on December 17, 2012 Permalink | Log in to Reply

        It will be a plugin anyway not in the core download – all the importers are plugins now.

        • Myatu 6:21 am on December 18, 2012 Permalink | Log in to Reply

          Actually forgot about that! Shows how often I’ve used that feature. I stand corrected :)

      • Jane Wells 11:39 am on December 17, 2012 Permalink | Log in to Reply

        As @westi states, all importers are plugins, not core code. I didn’t specify that in my post, since I took it as a given that core developers know that.

    • Simon Wheatley 8:34 am on December 17, 2012 Permalink | Log in to Reply

      Note that the current Twitter IDs overflow (?) and corrupt if you convert them to integers on a 32bit system. That one has got me before. (Apologies if that’s teaching everyone to suck eggs.)

      Also, would you mind putting in a filter for the post data before save… I’d prefer to store tweets in a custom post type. Ta! :)

    • Andrew Nacin 6:49 pm on December 17, 2012 Permalink | Log in to Reply

      I’ve outlined a potential plan for such an importer on a Trac ticket: https://core.trac.wordpress.org/ticket/22981. If you want to continue to discuss the idea, feel free to do so here. Implementation can occur on the ticket. (This is a plugin, but an official importer is also a core priority, hence the use of core trac.)

      I’ve also uploaded a tweet archive contributed by @chadhuber to the ticket. It does contain sane json.

    • Beau Lebens 9:23 pm on December 17, 2012 Permalink | Log in to Reply

      In case it helps, I already made one, packaged in here: https://wordpress.org/extend/plugins/keyring-social-importers/

  • Andrew Nacin 11:14 pm on November 21, 2012 Permalink
    Tags: importers   

    If you have a Tumblr blog, can you help test an updated version of the importer? It uses their OAuth API, which requires you to create an application. It’s simple and the plugin walks you through it. Here’s the ZIP file to the beta version. You can report bugs on #22422.

    Edit: The plugin has been released, the beta is over: https://wordpress.org/extend/plugins/tumblr-importer/

compose new post
next post/next comment
previous post/previous comment
show/hide comments
go to top
go to login
show/hide help
shift + esc
Skip to toolbar