Whitelisted WordCamp Production Data for Dev Environments

Right now WordCampWordCamp WordCamps are casual, locally-organized conferences covering everything related to WordPress. They're one of the places where the WordPress community comes together to teach one another what they’ve learned throughout the year and share the joy. Learn more. devs use a small subset of the production database that was manually created, because it wouldn’t be safe to keep copies of the production database in local environments.

That works good enough for most things, but we keep running into situations where reproducing bugs and testing fixes is much harder, and takes much longer, than it would if we had real-world data to work with.

So, I’d like to create a way to safely use a whitelisted copy of production data in local environments. Here’s how I envision it working:

  1. Create a script that runs on the production web server once a day
  2. It would create a copy of the primary database on the production database server
  3. Then run lots of SQL commands against that copy in order to redact anything that hasn’t been whitelisted
  4. Have another script in dev environments that uses sftp to download a copy of the whitelisted database once a day

The whitelist would contain a list of tables, columns, and keys that have been determined to not have any sensitive data. For example:

  • wp_users – The table itself would be whitelisted, but only the ID, user_login, user_nicename, user_registered, user_status, display_name, spam, and deleted fields would be whitelisted. Because user_pass, user_email, and user_activation_key would not be in the whitelist, the script would replace the contents of those columns with [redacted] (or in the case of user_email, redacted@example.org).
  • wp_usermeta – The table itself would be whitelisted, along with the umeta_id, user_id, meta_key, and meta_value columns, but only certain meta_key rows would be whitelisted. For instance, first_name, last_name, description, and wp_capabilities would be whitelisted, but session_tokens and wordcamp-qbo-oauth would not be.

Additionally, the script would have some logic to redact potentially sensitive values within whitelisted columns. For example, any e-mail addresses inside a meta_value value would be replaced with redacted@example.org.

What does Systems think about that? I’d do all the work to build the script, but I want to make sure you don’t have any security/privacy concerns.

#prio3