Whitelisted WordCamp Production Data for Dev Environments

Right now WordCampWordCamp WordCamps are casual, locally-organized conferences covering everything related to WordPress. They're one of the places where the WordPress community comes together to teach one another what they’ve learned throughout the year and share the joy. Learn more. devs use a small subset of the production database that was manually created, because it wouldn’t be safe to keep copies of the production database in local environments.

That works good enough for most things, but we keep running into situations where reproducing bugs and testing fixes is much harder, and takes much longer, than it would if we had real-world data to work with.

So, I’d like to create a way to safely use a whitelisted copy of production data in local environments. Here’s how I envision it working:

Create a script that runs on the production web server once a day
It would create a copy of the primary database on the production database server
Then run lots of SQL commands against that copy in order to redact anything that hasn’t been whitelisted
Have another script in dev environments that uses sftp to download a copy of the whitelisted database once a day

The whitelist would contain a list of tables, columns, and keys that have been determined to not have any sensitive data. For example:

wp_users – The table itself would be whitelisted, but only the ID, user_login, user_nicename, user_registered, user_status, display_name, spam, and deleted fields would be whitelisted. Because user_pass, user_email, and user_activation_key would not be in the whitelist, the script would replace the contents of those columns with [redacted] (or in the case of user_email, redacted@example.org).
wp_usermeta – The table itself would be whitelisted, along with the umeta_id, user_id, meta_key, and meta_value columns, but only certain meta_key rows would be whitelisted. For instance, first_name, last_name, description, and wp_capabilities would be whitelisted, but session_tokens and wordcamp-qbo-oauth would not be.

Additionally, the script would have some logic to redact potentially sensitive values within whitelisted columns. For example, any e-mail addresses inside a meta_value value would be replaced with redacted@example.org.

What does Systems think about that? I’d do all the work to build the script, but I want to make sure you don’t have any security/privacy concerns.

#prio3

Barry 4:09 pm on February 20, 2017

Can you just create an export file like this one that can be used for local development? Your proposed idea seems fragile and very likely to leak private data at some point.
- Ian Dunn 4:30 pm on February 20, 2017
  
  i don’t think that’d be helpful in the kinds of situations we’re running into.
  
  As an alternative idea, what about setting up a staging server or w.org style sandbox? That’d give us a way to reproduce bugs and test fixes, without having to touch production, and without having any production data locally.
  
  That’d require much more work on your part, though. Is that the kind of thing Systems has time for?
  - Barry 4:31 pm on February 20, 2017
    
    i don’t think that’d be helpful in the kinds of situations we’re running into.
    
    Could you provide the specific situations where you think your idea would help and my idea wouldn’t?
    - Ian Dunn 7:22 pm on February 20, 2017
      
      In general, they’re situations that are dependent on specific data, not something that we could predict and manually create sample data to test.
      
      A few examples:
      
      The Jetpack 4.4.2 migrationMigration Moving the code, database and media files for a website site from one server to another. Most typically done when changing hosting companies. moved posts from the `safecss` post type to the new `custom_css` post type, and our sanitization mangled some posts, by converting the `>` CSSCSS CSS is an acronym for cascading style sheets. This is what controls the design or look and feel of a site. selector to an HTMLHTML HTML is an acronym for Hyper Text Markup Language. It is a markup language that is used in the development of web pages and websites. entity. It didn’t show up in testing because none of our sample sites used that operator.
      
      Sometimes we get reports of particular posts missing from REST APIREST API The REST API is an acronym for the RESTful Application Program Interface (API) that uses HTTP requests to GET, PUT, POST and DELETE data. It is how the front end of an application (think “phone app” or “website”) can communicate with the data store (think “database” or “file system”) https://developer.wordpress.org/rest-api/. endpoints. We can see it happening on production, but can’t reproduce it locally with sample data.
      
      There are lots of others, but I’ve haven’t kept a list or anything, so it’s hard to remember specifics. I’d guess that we run into significant delays every other month, with shorter delays happening once or twice a month.
    - Ian Dunn 8:58 pm on March 14, 2017
      
      The sitemeta charset issue is another example.
    - Ian Dunn 5:33 pm on March 21, 2017
      
      Ran into another example today, with a report about some WordCamp sites not showing up in one of our tools. I was able to reproduce this one locally, but I had to spend extra time setting up new sample data to match what was happening on production.
      
      I fixed it locally, but still didn’t have a reasonable amount of confidence that it would completely fix it in production, and that it wouldn’t have any side-effects, since I was still only testing it on a small subset of the real data. So, I had to deployDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. the fix to production and hope that it didn’t cause any issues. It didn’t in this case, but if it had then I would have had to revert it and reproduce the new issue locally, fix it, and again deploy without confidence that the commit won’t be introducing new bugs.
      
      If we had a sandbox with a copy of the production database (or whitelisted data locally, or some other reasonable solution), then reproducing the issue would have been much quicker, and I’d be able to properly test things before deployingDeploy Launching code from a local development environment to the production web server, so that it's available to visitors. them.
Ian Dunn 5:14 am on February 10, 2018

This is resolved by the new wordcampWordCamp WordCamps are casual, locally-organized conferences covering everything related to WordPress. They're one of the places where the WordPress community comes together to teach one another what they’ve learned throughout the year and share the joy. Learn more. role on w.org sandboxes.