Using Playground for data liberation, site synchronization, and building streaming parsers

The contributors on the Playground team increase their focus on building the data migrationMigration Moving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. tools to help the Data Liberation project and unlock powerful new use cases for WordPress. The latest updates to Playground included a key prerequisite: the ability to manage multiple Playground instances as a site transfer requires two WordPress sites.

Why work on data tools at all?

Research proves that there are no reliable, free,  open sourceOpen Source Open Source denotes software for which the original source code is made freely available and may be redistributed and modified. Open Source **must be** delivered via a licensing model, see GPL. solution for:

  • Content import and export
  • Site import and export
  • Site transfer (including bulk transfers, e.g., Tumblr -> WordPress or Weebly -> WordPress)

To successfully execute site-to-site synchronization, WordPress coreCore Core is the set of software required to run WordPress. The Core Development Team builds WordPress. would need reliable data migration tools. Yes, there is the WXR content export. However, it doesn’t work for backing up a photography blog full of media files, plugins, APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. integrations, and custom tables. There are paid products available, but nothing in core.

Many Playground use-cases are about moving your data. To name just a few:

Why is this so hard?

Moving data is a complex problem – consider migrating links. Imagine you’re moving a site from https://my-old-site.com to https://my-new-site.com/blog/. If you just moved the posts, all the links would still point to the old domain so you’ll need an importer that can adjust all the URLs in your entire database. The typical tools like preg_replace or wp_search_replace can only replace some URLs correctly. They won’t reliably adjust deeply encoded data, such as this URLURL A specific web address of a website or web page on the Internet, such as a website’s URL www.wordpress.org inside JSONJSON JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML. inside an HTMLHTML HTML is an acronym for Hyper Text Markup Language. It is a markup language that is used in the development of web pages and websites. comment inside a WXR export:

<wp:content>
    <!-- wp:image {"src": "https://ud83dude80-old-site.com/%73ite/"} -->
</wp:content>

A reliable replacement operation requires parsing each data format carefully and replacing the relevant parts of the URL. Four parsers would be needed:

  • an XML parser, 
  • an HTML parser, 
  • a JSON parser, 
  • a WHATWG URL parser. 

Most of those tools are not available. PHPPHP PHP (recursive acronym for PHP: Hypertext Preprocessor) is a widely-used open source general-purpose scripting language that is especially suited for web development and can be embedded into HTML. https://www.php.net/manual/en/preface.php. provides json_encode(), which isn’t free of issues, and that’s it. You can’t even rely on DOMDocument to parse XML because of its limited availability and non-streaming nature.

What needs to be built?

After about 8 month of research, and many years of dealing with site transfer problems, using various WordPress synchronization plugins, past attempts of data importing, testing non-WordPress tools, the team @dmsnell and @zieladam came up with the following requirements:

  • Any parser would need to be streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs.
  • The data synchronization tools must account for data conflicts, WordPress plugins, invalid inputs, and unexpected power outages.
  • The errors must be non-fatal, retriable, and allow manual resolution by the user. No data loss, ever,
  • The transfer target site should be usable as early as possible and show no broken links or images during the transfer.

Several parsers are now in a prototype state or early drafts of the following streaming libraries:

Moreover, WordPress core now has an HTML parser, and @dmsnell has been exploring a UTF-8 decoder that would enable fast and regex-less URL detection in long data streams.

There are still technical challenges to figure out, such as how to pause and resume the data streaming. As this work progresses, you’ll start seeing incremental improvements in Playground. One possible roadmap is shipping a reliable content importer, then a reliable site zip importer and exporter, then cloning a site, and then extending towards full-featured site transfers and synchronization.

Advantages of building it into Playground

Playground covers a lot of ground:

  • User-centric environment. The need to move data around is so natural in Playground. Many people asked for reliable WXR imports, site exports, synchronization with git, and the ability to share their Playground. Playground allows for active users feed back every step of the way.
  • Quality Assurance. Playground has flows in place to report problems and share reproduction links, making it a perfect environment for collaboration.
  • Space to mature the API. Playground doesn’t provide the same backward compatibility guarantees as WordPress core. So the team can prototype a parser, find a use case where our design breaks down, and start over.
  • Control over the runtime. With Playground, the team can lean on PHP extensions to validate ideas, test them on a simulated slow hardware, and ship them to a tablet to see how they do when the app goes into background and the internet is flaky.

Extending Playground methodically allows for building the spec-compliant, solid foundation WordPress needs.

What’s next?

The work is structured to ship a progression of meaningful user flows. For example, the Try WordPress extension guides you through extracting content from any website to a Playground site and will enable WordPress plugins to receive different types of content specific to their plugins and process them. This gives you choice in the outcome of your liberated data.

Furthermore, the team will take the necessary time to build rigorous and reliable software. An early version might be shipped of this or that parser once the architecture is solid. The goal is a solid design that will serve WordPress for years.

Progress will be communicated in the open to maintain feedback loops, and to keep using this work to ship new Playground features.You can read more detail on this GitHubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. https://github.com/ Kickoff post: Kickoff Data Liberation: Let’s Build WordPress-first Data Migration Tools and follow along on the Data Liberation Tracking issue.

Props to @bph and @akirk for review.

+make.wordpress.org/core/

#dataliberation