WordPress.org

Make WordPress Core

Tagged: importers Toggle Comment Threads | Keyboard Shortcuts

  • Ryan McCue 1:27 am on November 18, 2015 Permalink
    Tags: importers   

    WordPress Importer Redux 

    Hi, I’m Ryan McCue. You may remember me from such projects as the REST API.

    I’m here today to talk about something a bit different: the WordPress Importer. The WordPress Importer is key to a tonne of different workflows, and is one of the most used plugins on the repo.

    Unfortunately, the Importer is also a bit unloved. After getting immensely frustrated at the Importer, I figured it was probably time we throw some attention at it. I’ve been working on fixing this with a new and improved Importer!

    If you’re interested in checking out this new version of the Importer, grab it from GitHub. It’s still some way from release, but the vast majority of functionality is already in place. The plan is to eventually replace the existing Importer with this new version.

    The key to these Importer improvements is rewriting the core processing, taking experience with the current Importer and building to fix those specific problems. This means fixing and improving a whole raft of problems:

    • Way less memory usage: Testing shows memory usage to import a 41MB WXR file is down from 132MB to 19MB (less than half the actual file size!). This means no more splitting files just to get them to import!
    • Faster parser: By using a streaming XML parser, we process data as we go, which is much more scalable than the current approach. Content can begin being imported as soon as the file is read, rather than waiting for pre-processing.
    • Resumable parsing: By storing more in the database instead of variables, we can quit and resume imports on-the-go.
    • Partial imports: Rethinking the deduplication approach allows better partial imports, such as when you’re updating a production site from staging.
    • Better CLI: Treating the CLI as a first-class citizen means a better experience for those doing imports on a daily basis, and better code quality and reusability.

    Curious as to how all of this is done? Read on!

    (More …)

     
    • pavot 1:36 am on November 18, 2015 Permalink | Log in to Reply

      I can’t express how happy I am about this progress!
      Thank you very much.

    • Jon Brown 1:54 am on November 18, 2015 Permalink | Log in to Reply

      Ditto. This is awesome. Having spent a lot of hours last weekend trying to debug a massive wp.com WXR/XML export (split in over 20 parts and failing on files 14-16)… this is so timely. Looking forward to testing it.

      Curious if the pull parser works with a folder full of xml files the way the WP importer did? Since as I said above, what .com dumped was a couple dozen xml files and I don’t have a convenient way to glue them back together.

      Huge thank you for attending to a long neglected corner of WordPress!

      • Ryan McCue 1:56 am on November 18, 2015 Permalink | Log in to Reply

        Curious if the pull parser works with a folder full of xml files the way the WP importer did?

        You should be able to work with split files by just importing them one after another. (You should also be able to import them in any order, theoretically.) This is something I haven’t put a lot of testing into yet though, so give it a shot and let me know!

        In the future, hopefully we’ll never need to split files again. πŸ™‚

      • Brent Toderash 6:05 am on November 19, 2015 Permalink | Log in to Reply

        If you’re up for an adventure, on the command line you could do something like

        cat $(ls *.xml -t) > complete.xml

        to stitch the files back together in date order, then try importing the whole thing at once πŸ™‚

    • Sakin Shrestha 2:12 am on November 18, 2015 Permalink | Log in to Reply

      Wow this is awesome… Thanks a lot πŸ™‚

    • nicholas_io 2:13 am on November 18, 2015 Permalink | Log in to Reply

      This is Awesome! The WordPress importer as it currently stands is far from being good. I’ll be happy testing it out

    • Mike 2:20 am on November 18, 2015 Permalink | Log in to Reply

      Great stuff Ryan. Really nice performance improvements. I’ve often ran into memory limits during import.

      As an aside, has there been discussion around offline import, most specifically around media items? Such a possibility would allow for sites to be exported wholly together with all of their content items and either kept for archive or imported more easily. Such a site could have been mocked up on an unreachable development server.

      • Ryan McCue 2:24 am on November 18, 2015 Permalink | Log in to Reply

        Thanks Mike!

        We have an issue filed for that one. Deferring the actual images getting imported will get a few gains: offline imports, parallel downloads, or static content imports (i.e. image files in a zip).

        You can pass 'fetch_attachments' => false (also the default internally, but not in the CLI) into the importer class in __construct for this already. πŸ™‚

    • Ryan Markel 2:25 am on November 18, 2015 Permalink | Log in to Reply

      Hey, Ryan.

      This is other Ryan. πŸ™‚ We’ve been dealing with super-large imports on WordPress.com VIP for some time now and I’ve been dealing with them for a year-plus. I’d love to chat with you at some point about the kinds of things we have run into in these large imports and in other things that are edgier-case like site merges and other things that the import currently can struggle with.

      Some of what we are doing on WordPress.com is almost certainly not practical for movement to core, but we can probably provide some valuable insight on larger imports and problems that are encountered there.

      • Ryan McCue 2:49 am on November 18, 2015 Permalink | Log in to Reply

        Hey other Ryan! Let’s chat; one of the big use cases that lead to me rewriting this was huge VIP-level imports, so it’s definitely something I’m conscious of. I’d love to have your and VIP’s feedback πŸ™‚

    • Ipstenu (Mika Epstein) 2:43 am on November 18, 2015 Permalink | Log in to Reply

      This will be a game changer for the many people who want to export from wordpress.com and import to self-hosted where they either can’t afford or aren’t able to use the migration package.

      You, sir, deserve all the pizza bites.

    • Matthew Eppelsheimer 3:15 am on November 18, 2015 Permalink | Log in to Reply

      This is so, so, great Ryan!

      Looking forward to testing this and seeing if our pet bugs in the 0.6 importer to fix “someday” still exist. I’ll bet not…

    • JakePT 3:38 am on November 18, 2015 Permalink | Log in to Reply

      My biggest frustration is the inability to import Posts only and import images at the same time. Images only come across if the entire site is exported. This is very annoying when trying to transfer a client’s blog between sites without bringing across menus and random plugin post types. I can understand the challenge since they’re only img tags and not necessarily attached to the post, but it’s definitely my biggest hangup with the importer.

    • bishless 4:59 am on November 18, 2015 Permalink | Log in to Reply

      This is great news! Thanks for your work, Ryan!

    • jeffmcneill 5:43 am on November 18, 2015 Permalink | Log in to Reply

      I’ve run into a ton of problems with importing, usually a memory or parsing issue. I’ve had to eject a lot of content from several site migrations because of this. That you are working on this critical tool for WordPress means you should be applauded, and lauded. Thank you!

    • DeBAAT 6:26 am on November 18, 2015 Permalink | Log in to Reply

      Reading the responses, you’ve definitely hit a soft spot!
      Thanks already for picking this challenge.

      I myself did have some issues with the importer lately as I was migrating some content from dev to prod.
      And this actually had to do with the GUID, so please take real good care to using this.
      I mean, a post developed on site x.dev typically has x.dev in the guid. When I try to import it into x.com, I would like to have an option whether or not to ‘translate’ the GUID from x.dev to x.com.

      The same goes for the GUID of an image. In some cases, the image may have been uploaded already in the x.com site before. So how do you discriminate between the two images?
      Hope you’ll get it working nicely.

      Reading your article, I thought you mentioned something about changing the exporter as well. I’m not sure however, so are there any plans to work on the exporter as well?

      • Ryan McCue 6:47 am on November 18, 2015 Permalink | Log in to Reply

        The main thing with the GUID is that it shouldn’t ever change. The new importer actually uses it more, so if you change the GUIDs, later imports will reimport the same data again. That said, @westonruter‘s pull request would let you write complex logic to map this back and forth, and I do know it’s a common use case. Happy to discuss further on a GH issue πŸ™‚

        Specifically on images, there’s a flag in the options in the new importer called update_attachment_guids – the old importer always changes the GUID to the new filename, but this rewrite avoids that by default, as it makes deduplication on future imports impossible. (The reason the old importer did this is because originally the GUID contained the filename, so that’s occasionally used by older code to find the image URL.)

        There’s no real plans to change the exporter right now, but if it turns out we can improve import efficiency by changing data in exports (i.e. avoid post-processing by being careful about the element order), we’ll likely do that.

        Thanks for the feedback!

    • Ahmad Awais 6:38 am on November 18, 2015 Permalink | Log in to Reply

      I’ve had my fair share of WTF Momentsβ„’ with WordPress importer. So, believe me when I say this, I am glad someone is finally working on improving that. It’s a great news. Going to test it this weekend.

    • Sam Hotchkiss 6:51 am on November 18, 2015 Permalink | Log in to Reply

      You’re amazing, Ryan, well done.

    • Per Soderlind 6:53 am on November 18, 2015 Permalink | Log in to Reply

      Excellent Ryan, can’t wait to test it πŸ™‚

    • Dreb Bits 7:30 am on November 18, 2015 Permalink | Log in to Reply

      Truly amazing! Will find time this weekend to test this baby! And thanks a lot initiating the movement to improve the WordPress importer πŸ™‚

    • Ajay 7:52 am on November 18, 2015 Permalink | Log in to Reply

      This is awesome news. It’s been a long time coming. Are you planning to update the Exporter as well?

      • Ryan McCue 7:54 am on November 18, 2015 Permalink | Log in to Reply

        No major plans there yet, but if there’s improvements to be made there, I’ll be sure to take a look. πŸ™‚

    • Morten Rand-Hendriksen 8:00 am on November 18, 2015 Permalink | Log in to Reply

      Thank you Ryan. This is a sorely needed upgrade.

    • connectr 8:22 am on November 18, 2015 Permalink | Log in to Reply

      Ooh, goodie! It always baffled me why such an important tool received so little love. The GSoC project looked hopeful, but then not much came from it. Super duper happy it’s back on the radar πŸ™‚

    • Mario Peshev 8:37 am on November 18, 2015 Permalink | Log in to Reply

      Thanks for the great work Ryan – given the state of the original Importer for years now, that’s incredible news.

    • Omaar Osmaan 8:38 am on November 18, 2015 Permalink | Log in to Reply

      So excited about this- πŸ˜€

    • dimitris33 9:42 am on November 18, 2015 Permalink | Log in to Reply

      Great news Thanks!

    • capuderg 10:10 am on November 18, 2015 Permalink | Log in to Reply

      Great!!! So excited! Thanks Ryan for working on this!

    • Tran Ngoc Tuan Anh 10:36 am on November 18, 2015 Permalink | Log in to Reply

      This is a great news.

      I’m wondering about media import. Some themes use licenced images and don’t want to import, they usually are replaced by dummy image as a placeholder. Can we do that (maybe with a filter) with this new importer?

      Also, would it help in importing theme mods?

      • Ryan McCue 10:39 am on November 18, 2015 Permalink | Log in to Reply

        You can filter `wxr_importer.pre_process.post`, which gets passed the full post data, and go ahead and change anything in there. πŸ™‚

        I’m not sure what the current state of theme mods is in the export data, so I can’t speak to that, but file an issue and I can take a look. πŸ™‚

    • Primoz Cigler 10:42 am on November 18, 2015 Permalink | Log in to Reply

      Wow, incredible! Just awesome Ryan. It is clear from responses here that this was a huge pain for many people.

      Something came to my mind when I was reading this, I am not sure if it will be relevant, but still worth mentioning: would it be possible to utilize the REST API and switch to JSON instead? I believe it would make for much cleaner approach, as I was always struggling with any XML files, while JSON is so human-readable? I believe the import/export would be possible without the XML file at all, only import WP calling the export WP over the JSON calls.

      I would love to hear the feedback on my thinking and if that is something that should be considered in the future when migrating content over the web, as it sounds much leaner approach πŸ™‚

      • Ryan McCue 10:48 am on November 18, 2015 Permalink | Log in to Reply

        As the lead developer of the API, I’m conscious of this. πŸ˜‰

        There’s a number of reasons it’s better to stick with WXR/XML right now:

        • PHP has a streaming XML parser built-in, but it doesn’t have a streaming JSON parser out of the box. This means any streaming parsing would need to take place in userland (i.e. a library), which I suspect would cause performance to be worse.
        • WXR is compatible with RSS, which means the export format can be imported by a lot of other tools, not just WP. Switching to our own format might have benefits to us, but it would likely harm the wider ecosystem.
        • XML isn’t really a problem most of the time. It’s not the greatest format ever invented, but it’s reasonably readable, and there’s tonnes of tooling around it.

        With that said, if somebody wanted to try it, it’d be interesting to see at the least. πŸ™‚ The XML parsing in this new version of the importer is contained in a few small methods, so you could probably swap that out for JSON reasonably easily.

        • Primoz Cigler 10:52 am on November 18, 2015 Permalink | Log in to Reply

          Gotcha, thank you for your swift reply!

          I don’t promise anything, but I might actually try to implement the REST API way. I will give you heads-up if I do.

    • sonjanyc 11:23 am on November 18, 2015 Permalink | Log in to Reply

      This is amazing, Ryan!! You will make so many people so very happy with this, me included!! πŸ™‚

    • Alvaro Gois dos Santos 11:28 am on November 18, 2015 Permalink | Log in to Reply

      Wow. Awesome @rmccue, thanks!

      Let’s try that importer…

    • Damian 12:55 pm on November 18, 2015 Permalink | Log in to Reply

      Thanks Ryan, this is problem we all faced at some point. My concern is the exporter now. It will be also updated ? Currently how the exporter work is primitive. Eg: You can’t export featured images if you just export posts, you need to export the whole content.

      Or if you want to export just taxonomies becuase you have lots of them and you want to duplicate them in a new site.

      Thanks again!

      • Ryan McCue 1:35 pm on November 18, 2015 Permalink | Log in to Reply

        Right now, I don’t have any plans to work on the exporter (unless there’s performance gains to be had for the importer there), however there’s a few tickets already on that: #27048 and #17379 are the two most relevant here I think.

        At the end of the day, these things just need someone to sit down and spend some time on them. πŸ™‚

    • Rich Tabor 1:32 pm on November 18, 2015 Permalink | Log in to Reply

      Really glad this is happening!

    • Josh Eaton 2:57 pm on November 18, 2015 Permalink | Log in to Reply

      Wait, so you’re saying I no longer need 8GB of RAM in my local VM just to import a WXR file? πŸŽ‰

    • Steven Gliebe 3:21 pm on November 18, 2015 Permalink | Log in to Reply

      This is music to my hears. Thank you for working on this!

    • Ben Doherty (Oomph, Inc) 5:33 pm on November 18, 2015 Permalink | Log in to Reply

      This is great, I’m very excited to see this development. I wonder if parallelization is also something that’s been considered, as PHP is just SLOW when walking through large files, and I found that parallelization can greatly improve overall import times, especially when munging XML files. My quick elevator pitch is that the importer spins off M processes, 1..N, each of which imports every M+Nth post from the file in parallel, and then the post-cleanup operation (translating URLs, IDs, etc) runs when all of the processes have completed.

      • Ryan McCue 11:55 am on November 19, 2015 Permalink | Log in to Reply

        Parallelisation is definitely something in my mind, but given the complex way that the object cache and database interact in concurrent requests (i.e. concurrent saves break basically everything in WP), not something I’ve put a huge amount of time into. πŸ™‚

        I’d love to see someone have a go at it though! The new importer should let you build this as a plugin on top reasonably easily, so if you want to try it… πŸ˜‰

    • pathartl 6:12 pm on November 18, 2015 Permalink | Log in to Reply

      A much needed update. I opened up #30887 a while ago when I first noticed the GUID issue, in case you wanted another ticket to close :). Great work as always.

    • Shmoo 2:26 am on November 19, 2015 Permalink | Log in to Reply

      Ryan should get a WordPress company sports car for doing this! πŸš— πŸ’¨ vroom vroom..

      Can’t wait to test this.. Did you the WP Importer yesterday -10MB file #drama

    • lepardman 9:00 am on November 19, 2015 Permalink | Log in to Reply

      Awesome! Just a thought: it would be nice to have the ability to remap post types (maybe even taxonomies?) on import.

      Example: yesterday I was trying to import posts from an old site to a new one I’m working on. The site was so old that the developers used posts as a way to store the clients products and their actual news posts were just added as text to a huge list in the editor on a page. Some years after, a new developer added a new post type called news since post were “taken”. So I figured that I at least could do an export of the lastest news and the products. Then import them and switch post types, news as post and the products (old post type post) into a new post type called products. But that wasn’t possible, I just got and error that the post type news didn’t exist and the import failed.

      TL;DR: I couldn’t remap posts on import to other (nowadays more correct) post types. A nice way would be to be able to assign them to post type(s) just as you can change author (I guess a bulk option would be needed). I’m unsure if this would be possible but I’m throwing out the idea and then let’s hear what you guys say.

      (I guess I could do a search and replace in the xml files or create the correspondent post type temporarily, import and then do a search and replace in db but it would be so much nicer to have this built in)

      Thank you!

    • chzumbrunnen 3:14 pm on November 19, 2015 Permalink | Log in to Reply

      Wow, Ryan, that sounds great. When I understand you right my biggest question is already somehow fixed.
      Normally for an import the old site hast to be still reachable to get the images. Now with the new importer it should be possible to import images manually. But I’d like to go one step further and be able to enter the path to the images folder from where the images should be imported.

      • Ryan McCue 12:36 am on November 20, 2015 Permalink | Log in to Reply

        The tooling around this stuff will need some time to develop and mature, but I’d also love to have that eventually πŸ™‚ Right now, should be technically possible, but just a pain to do so.

    • DoodleDogCody 3:56 pm on November 19, 2015 Permalink | Log in to Reply

      In regards to Media. I have noticed that with the current importer media items are not imported unless we select export all from the exporter. This seems a little odd to me. Media items should be imported if they exists in a post, page, or whatever post type that we export as media items are not thought of as a seperate psot type to most people. I know that’s how they are stored in the database but If I am building a new site for a client and need to export and import only their posts on the current site. I want to click export post and choose the date range of all. Then when I import I expect the images to import as well.

    • Ella Iseulde Van Dorpe 5:23 pm on November 19, 2015 Permalink | Log in to Reply

      ❀️

    • Samuel Wood (Otto) 12:33 am on November 20, 2015 Permalink | Log in to Reply

      Awesome. I’m gonna steal your XML code. πŸ™‚

    • Michael Ecklund 8:02 pm on November 20, 2015 Permalink | Log in to Reply

      Thank you for taking the time to update the importer. It’s been begging for attention for quite some time.

    • RobbyMcCullough 7:42 pm on November 25, 2015 Permalink | Log in to Reply

      This is great news. Thanks, Ryan. Really excited to hear about this project.

    • Caleb Evans 1:06 am on November 29, 2015 Permalink | Log in to Reply

      Hey, @rmccueβ€”very happy to hear that the Importer is getting some much-needed love. Do you have any plans to get the Importer merged into Core, or will it remain a separate plugin? I find it somewhat counterintuitive that WordPress can export its own content out of the box, but cannot import that exported content without a separate plugin. It’s not too big of a dealβ€”just one less plugin to install for new installationsβ€”but it’s certainly something that would be nice to have, and it seems like the sensible thing to do.

    • pipdig 7:04 pm on November 29, 2015 Permalink | Log in to Reply

      Ryan, you have been added to my Christmas card list!

    • Caleb Evans 4:23 am on November 30, 2015 Permalink | Log in to Reply

      Hi, @rmccue. This all looks great, and it has me very excited. Out of curiosity, do you have any plans on trying to get this merged into Core? I find it strange that WordPress can export its own content, yet it cannot import that same exported content without a plugin. Thoughts?

      • Ryan McCue 4:27 am on November 30, 2015 Permalink | Log in to Reply

        No plans right now to integrate this into core. I think it makes more sense as a plugin personally. The export being built-in is fundamentally about your data being portable and free to access, whereas the importer doesn’t really affect data portability or freedom. Plus, having it as a plugin means we can improve it faster πŸ™‚

        (That said, others may disagree, so might be something to discuss further!)

    • Tevya 7:19 pm on January 13, 2016 Permalink | Log in to Reply

      So what’s the status on this currently? Is it ready for general use? We’ve been having the old importer fail to bring over images and media, even though the box is checked. Hoping this can save us.

    • helmar 6:34 am on February 13, 2016 Permalink | Log in to Reply

      Sounds super promising. Hope to read more about this soon, even have something to work with. If financial support is needed, I’m in.

    • birdrockdesigns 9:39 am on March 9, 2016 Permalink | Log in to Reply

      I’m really, really hoping you will be successful with the repo.

  • Frederick Ding 5:16 am on August 6, 2013 Permalink
    Tags: importers, ,   

    Migration update: try this importer 

    Hey everyone,

    The importer is largely unchanged from last week, with the exception of a few UI changes:

    • #341: Progress for posts/attachments/menu items is now shown correctly (in %d of %d format)
    • #342: The debug view (showing the raw data) now uses print_r through a special chars filter
    • #340: UI now has full-sentence strings to communicate how the process works and when the import is done, and Refresh/Abort buttons are shown above and below the progress.

    An import of a WordPress WXR file in progress

    A completed WordPress import

    I’ve also had the chance to run it against a large number of import files, including ones sent to me by generous volunteers who read some of my previous weekly updates (props @tieptoep). No catastrophes, yet!

    Obviously, it’s still a work in progress, but I’m now willing to risk a public beta. The usual disclaimers (please don’t use this on a production site) apply.

    Although I’m not aware of any other plugins that build on the WXR importer through its API, I nevertheless generated some PHPDoc API documentation using phpDocumentor 2, which might be handy if you decide to hook into this or reuse its components.

    I’d love to hear your feedback on the interface, on the general experience using this importer, and any errors or warnings that you encounter. Thanks!

     
    • thisislawatts 9:45 am on August 7, 2013 Permalink | Log in to Reply

      Really excited about this plugin, it’s development seems to coincide with my needing to transfer 6 WordPress blogs(~600 posts each) into 2. So you can only imagine my joy!

      I have downloaded the beta above I don’t seem to get anything beyond the initial ‘Import Status’, which shows how many posts it’s got to import, then clicking ‘Refresh’ I get -> http://thisis.la/pix/of-course.png. I am not sure what it’s doing here. I am running this on XAMPP so could it be an issue with the Cron not running? If you could point me towards a useful debugging point I will tell you every

      Then a second though on the UI, these blogs that I am importing have a stack of users, which typically result in -> http://thisis.la/pix/wordpress-import-assign.png. And that is super boring to fill in, so I created a quick JS snippet to go through and autofill those values. https://gist.github.com/thisislawatts/6163831. Thinking it would be great if those usernames wrapped in ()’s were also wrapped in a span to spare my messy regex.

      Thanks

      • Frederick Ding 7:13 pm on August 7, 2013 Permalink | Log in to Reply

        Thanks for trying it out! I expected there to be some problems, so let’s see if we can figure out what’s going on πŸ™‚

        Since you mentioned using XAMPP, I imagine it to be this scenario: On local installations (especially with hostnames that do not resolve in DNS), cron theoretically falls back to “alternative cron” mode. I haven’t yet tested with alternative cron; most likely it’s redirecting the admin user right after clicking on “Refresh”, and that fails to pass on the nonce. I think this is what’s happening. Let me try it out on a local install and see what happens.

        The JavaScript for prefilling users is a good idea, although it can also be done server-side. (It currently uses wp_dropdown_users(), which can pre-select a user.) Aside from this, I’d be fine with including the usernames in a data-* attribute; it’d make this a lot easier.

        Edit: I’ve added a new Trac ticket for this enhancement in GSoC #346; it’s related to previous requests that were punted (#8455, #16148).

    • TCBarrett 2:21 pm on August 14, 2013 Permalink | Log in to Reply

      This looks great. One of the problems with exporting is that you cannot export a single post type with attachments, it has to be all or nothing. Does this solve that problem? Or is there an equivalent project?

  • Frederick Ding 3:33 am on July 30, 2013 Permalink
    Tags: importers, ,   

    Migration update: cron importer part 2 

    Hey everybody — I have good news and bad news.

    Good news: I’ve finished porting all the individual import steps to the cron model and now have a mostly working frontend UI (largely unchanged from the previous iteration of the importer) that utilizes it.

    As of this evening, the cron model is able to parse, process, and finish importing two test XML files from the core unit tests (valid-wxr-1.1.xml and small-export.xml). The test case, which uses exactly the same assertions as the core unit test, passes all 193 assertions. (Update: an errorless import of the wptest.io sample data has been done.)

    WordPress import in progress

    WordPress cron import in progress

    A completed cron import

    A completed cron import

    Bad news: I wanted to tag a version and release a download today, but I’ve decided not to do so due to the unconfirmed stability of the importer. As some astute observers noted last week, storing the temporary data in the options table can blow out caches. Although I’ve attempted to mitigate this (see [2180] and this reference from a few years back on Core Trac), I still need to test this against some real systems before I release it and break your server.

    Those who are very interested can always check out the source in Subversion. I will post a comment under this post if a download is made available before my next weekly update.

    Although an overhaul of the XML parser, as suggested in the comments on last week’s post, is probably necessary to avoid memory and caching issues, my first priority was to finish the migration of processes to the cron tasks. As soon as I can post a working importer, I will immediately turn my attention to the XML parsing step.

     
  • Frederick Ding 12:03 am on July 23, 2013 Permalink
    Tags: importers, ,   

    Migration update: cron importer 

    Following last week’s update about the WP_Importer_Cron approach to writing importers and running import jobs, I’ve been steadily transitioning code from the current non-stateful, single-execution plugin to a stateful, step-wise process (#327).

    At the same time, I needed to separate presentation from logic/backend processing (#331) — something that @otto42 also recommended — in two ways:

    • Removing direct printf(), echo statements that were used by the WXR importer (example)
      and changing them to WP_Error objects (example of fatal error; of non-fatal warning)
    • Handling uploads and UI choices in a separate class

    Why must this be done now? Well, asynchronous tasks differ from PHP scripts directly responding to a browser request — we can’t depend on having access to submitted $_POST data, nor can we directly pipe output to the user. This change would also make it easier to understand what the code is doing from reading it, and to test programmatically.

    One dilemma I’ve encountered: how best to store the parsed import XML file. Since each step of the import (users, categories, plugins, etc) runs separately, we must…

    1. store all of the parsed data in variables, which are serialized into an option between runs
      (obviously, a huge amount of data for which this may not be the most robust or efficient method);
    2. re-parse the XML on each run
      (currently, parsers handle all parts of the XML at once, which means unnecessarily duplicated effort and time);
    3. modify the parsers to parse only part of the XML at a time; or
    4. split the XML file into chunks based on their contents (authors, categories, etc) and then feed only partial chunks to the parser at a time.

    Any thoughts? Solving this problem could also help the plugin deal with large XML files that we used to need to break up by hand before importing. (The Tumblr importer doesn’t have the same problem because there is no massive amount of data being uploaded at the beginning.)

    I haven’t yet finished transitioning all the steps; I’m afraid it won’t be possible to use this just yet. Before next Monday, I should have a downloadable plugin that’s safe to try.

     
    • dllh 1:34 am on July 23, 2013 Permalink | Log in to Reply

      You’ll want to be careful about trying to store data in options. For example, in an environment that’s using memcached, which has a size limit, you can blow up sites by trying to store too much data in options, so it’s not necessarily a matter only of efficiency.

      Also, if you use options, I imagine you also have to use something like an extra tracking option to know which parts of the import you’ve handled. This is just begging for race conditions.

      I wonder if there’s anything you can do using FormData to split the file into chunks client-side and assemble them server-side. I’m not sure what browser support is like, and I worry it’d be pretty brittle. Just thinking aloud.

      • Frederick Ding 1:44 am on July 23, 2013 Permalink | Log in to Reply

        You’re completely right about the caching implications — I’ve all but eliminated possibility #1.

        I was thinking about the possibilities of client-side file splitting, too, even though it’s not the most robust or reliable way. Last week, I looked briefly at https://github.com/blueimp/jQuery-File-Upload which supports chunking/multipart — but that’s not quite what we’d need, is it? The browser would need to operate along XML node boundaries, not just file size. (I’d be intrigued if it’s possible to utilize a browser’s native DOM engine to do that…)

    • Ryan McCue 1:50 am on July 23, 2013 Permalink | Log in to Reply

      So, with regards to XML parsing options: splitting the XML file is something that absolutely should not be done. You can do it, but only if you use proper XML serialization/deserialization, which is likely going to take up a chunk of time.

      Reparsing is a bit of a pain too, since the XML parsing usually takes up the largest amount of time there. The best bet with regards to memory is to use a SAX-style parser which streams the XML, whereas a DOM-style parser will read it all at once. SimpleXML is a DOM-style parser, so you should avoid that for performance, whereas xml_parse is SAX-based.

      I’m biased as the lead developer of it, but you could use SimplePie, which a) is built into core, b) has built in caching (via the transients API, but you’d probably want file-based caching here) and c) natively parses RSS (given that’s its job), which WXR is based on. This handles picking a parser all internally, although doesn’t use a stream parser due to internal implementation details, so you may want to stick away from it for that. I’m relatively certain I could write a parser (in `WXR_Parser_*` form) in a few hours, so it should be easy to do.

      At the least, I’d take a look into how SimplePie handles caching. Basically, it’s a giant serialized array, usually saved to a file (although in WP, this uses transients).

      (If you do want to take the full SimplePie route, I’d love to help, most likely in a consulting role. It’d simplify the `WXR_Parser` classes significantly by moving the XML parsing out of those.)

      I’ve spent a fair bit of time messing with XML parsers, so feel free to pick my brain on this! πŸ™‚

      • Ryan McCue 1:55 am on July 23, 2013 Permalink | Log in to Reply

        (What you really want here is a resumable SAX/pull parser, with some way to persist state across processes. With `xml_parse`, you should be able to feed in data as you stream it in, but you’ll still need to keep the data in memory since I don’t know of any streamable serialization in PHP. Please note that if you do use your own code here, there are security considerations that will need to be discussed. Your mentors should be able to help with that.)

      • Frederick Ding 2:51 am on July 23, 2013 Permalink | Log in to Reply

        I should have remembered that you’re an expert on XML!

        I’m not familiar with the differences between SAX/pull/DOM, but if I’m reading this right, then pull — with a way to persist state — would be the most appropriate model to follow. (In PHP, XMLReader appears to be a pull parser that’s enabled by default since PHP 5.1.2 — so I’d just need to write a WXR_Parser_* to take advantage of it.)

        Thanks for pointing me in this direction!

        Edit: Actually, scratch my eagerness to use XMLReader. It appears to be weakly documented and most of what I can see involves using SimpleXML/DOM on nodes…

        • Ryan McCue 6:43 am on July 23, 2013 Permalink | Log in to Reply

          XMLReader or SAX are the ones I’d go for. There’s an IBM DeveloperWorks article that might help with that.

          The way I’d handle it is to keep state of where you currently are in the stack. WXR_Parser_XML uses this concept to handle it, but uses the SAX parser instead. Conceptually, the way it keeps track of the position is the way you’d want to do it (although there’s a bunch of cases it doesn’t handle).

          WXR_Parser_XML isn’t the best though, since all this data is loaded into memory and then pushed to the database later. Although it means you have cross dependencies, I’d consider inserting posts/etc as you go, rather than bunching them all up. This is a pretty fundamental rework of the internal parsing API, but it’s one that you’ll need for this sort of thing.

          Personally, I’d create two objects (a parser and a data handler) and use dependency injection to ensure that you still have a weak coupling between the two.

          (As a note, SimplePie also uses the SAX parser, but loads all the data into memory because it has to. This is a case where it’s a much better idea to use XMLReader directly.)

          Regarding XMLReader documentation, are there any specific examples? I’m happy to assist with specifics here, ping me at my email (see footer) if you’d like.

        • Ryan McCue 6:51 am on July 23, 2013 Permalink | Log in to Reply

          Also, the reason people are using XMLReader with SimpleXML/DOM is because they prefer the SimpleXML/DOM API, and the performance using the hybrid is better than straight SimpleXML. Personally, I’d stick with the one rather than switching, because there are some performance concerns with that.

  • Jen 11:00 pm on December 16, 2012 Permalink
    Tags: importers, , twitter   

    Antsy for 3.6 to start and need a project? Who wants to make an official importer for the new Twitter archives? Would think we’ll want to add that into the importers list. Would suggest importer auto-assign “status” or “aside” post format (or make it an option in the plugin to choose format). Who’s in? I volunteer to ux review and test. πŸ™‚
    http://thenextweb.com/twitter/2012/12/16/twitter-has-started-rolling-out-the-option-to-download-all-your-tweets/

     
    • Aaron Brazell 11:05 pm on December 16, 2012 Permalink | Log in to Reply

      I already was planning on doing this as a plugin, and I’ve been quiet for awhile. I can do this. But… I need to have the archives available to me, and my account doesn’t have it yet.

      • Jane Wells 11:26 pm on December 16, 2012 Permalink | Log in to Reply

        Mine either, but I’ll see if I can wrangle one we can use.

      • Ryan Duff 11:26 pm on December 16, 2012 Permalink | Log in to Reply

        Or as soon as someone gets and volunteers a copy of their archive. From the post it doesn’t seem there’s an api, but a set of html pages + xml, or json files (pick your poison)

        Also, from what I’ve heard they files are monthly, so if you’ve been on Twitter for 4 years you’d be looking at 48 json files to handle w/ an importer.

        Of course that’s all based off what I’ve read. I don’t have it yet. Just something to chew on.

        • Aaron Brazell 11:32 pm on December 16, 2012 Permalink | Log in to Reply

          Yeah it’ll be an interesting challenge but I wanna see what the data looks like first. If they turn it on for me, I’ve got 6 years of archives which would be a good stress test too.

          • Andrew Nacin 11:33 pm on December 16, 2012 Permalink | Log in to Reply

            Could handle the zip that they send you as-is. In fact, I imagine that would be the best approach for users.

            • Aaron Brazell 2:54 pm on December 17, 2012 Permalink

              I feel like we need to be able to parse the HTML, CSV and JSON in any zip file if we approach it that way. From the user perspective, I think you’re right but I sure hope the CSV and HTML are decent enough.

        • Samuel Wood (Otto) 11:38 pm on December 16, 2012 Permalink | Log in to Reply

          It’ll just be a matter of iterating the json. Simple stupid, for the most part.

    • Samuel Wood (Otto) 11:06 pm on December 16, 2012 Permalink | Log in to Reply

      If they actually roll it out to people unchanged, then it should be fairly trivial. When I get it on my account, I’ll let you know.

    • Phil Erb 11:25 pm on December 16, 2012 Permalink | Log in to Reply

      When archives are available to me, I’d love to help test.

    • Andrew Nacin 11:36 pm on December 16, 2012 Permalink | Log in to Reply

      We now have an API for importers, which means we can add this to wp-admin/import.php the moment it is done.

      When anyone gets access to their zip, please share it (or at least a sample so we can learn the format).

    • Aaron D. Campbell 12:29 am on December 17, 2012 Permalink | Log in to Reply

      Looks like my account has the option. I’m requesting an archive and will post it somewhere to use.

    • Myatu 6:53 am on December 17, 2012 Permalink | Log in to Reply

      Wouldn’t it be more sensible to have this as a 3rd party plugin, rather than having to maintain more bloat?

      • Peter Westwood 10:40 am on December 17, 2012 Permalink | Log in to Reply

        It will be a plugin anyway not in the core download – all the importers are plugins now.

        • Myatu 6:21 am on December 18, 2012 Permalink | Log in to Reply

          Actually forgot about that! Shows how often I’ve used that feature. I stand corrected πŸ™‚

      • Jane Wells 11:39 am on December 17, 2012 Permalink | Log in to Reply

        As @westi states, all importers are plugins, not core code. I didn’t specify that in my post, since I took it as a given that core developers know that.

    • Simon Wheatley 8:34 am on December 17, 2012 Permalink | Log in to Reply

      Note that the current Twitter IDs overflow (?) and corrupt if you convert them to integers on a 32bit system. That one has got me before. (Apologies if that’s teaching everyone to suck eggs.)

      Also, would you mind putting in a filter for the post data before save… I’d prefer to store tweets in a custom post type. Ta! πŸ™‚

    • Andrew Nacin 6:49 pm on December 17, 2012 Permalink | Log in to Reply

      I’ve outlined a potential plan for such an importer on a Trac ticket: https://core.trac.wordpress.org/ticket/22981. If you want to continue to discuss the idea, feel free to do so here. Implementation can occur on the ticket. (This is a plugin, but an official importer is also a core priority, hence the use of core trac.)

      I’ve also uploaded a tweet archive contributed by @chadhuber to the ticket. It does contain sane json.

    • Beau Lebens 9:23 pm on December 17, 2012 Permalink | Log in to Reply

      In case it helps, I already made one, packaged in here: https://wordpress.org/extend/plugins/keyring-social-importers/

  • Andrew Nacin 11:14 pm on November 21, 2012 Permalink
    Tags: importers   

    If you have a Tumblr blog, can you help test an updated version of the importer? It uses their OAuth API, which requires you to create an application. It’s simple and the plugin walks you through it. Here’s the ZIP file to the beta version. You can report bugs on #22422.

    Edit: The plugin has been released, the beta is over: https://wordpress.org/extend/plugins/tumblr-importer/

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel
Skip to toolbar