Git Mirror History Breakage
A few years ago, I started publishing a mirror of WordPress on GitHub. It was subsequently promoted to WordPress/WordPress. What I neglected to do, however, was provide an appropriate authors.txt file, until recently. That means that earlier commits are attributed to dummy e-mail addresses and as such cannot be associated with user accounts on GitHub. Considering the recent introduction of contributions on GitHub, this seems a shame. Also, if we were to move to Git in the future, we would probably want our official mirror to have the best possible data.
Proposed
That we re-run the git-svn import with a proper authors.txt file.
Upsides
We’ll have a proper Git mirror with good and consistent author data, that we can, if desired, use for a future migration to Git. Commits will be properly attributed in GitHub.
Downsides
This will break Git history. If you have a Git checkout of WordPress, either standalone or in a submodule, that’ll mean that you’ll have to rebase your master branch off of origin (or even better, blow the whole thing away and re-clone).
So: thoughts? Would this ruin your day?
Gustavo Bordoni 4:25 am on January 10, 2013 Permalink | Log in to Reply
If this means that WordPress is taking any sort of steps towards using Git as a solution for code versioning I’m all for it!
Mark Jaquith 5:43 am on January 10, 2013 Permalink | Log in to Reply
I’ll just add that we can’t commit to anything at this stage.
Japh 5:44 am on January 10, 2013 Permalink | Log in to Reply
I see what you did there.
John Saddington 11:16 am on January 10, 2013 Permalink | Log in to Reply
LOL. Seriously. Totally sucks to lose that historical contri trail…
Gustavo Bordoni 6:41 am on January 10, 2013 Permalink | Log in to Reply
I know, just expressing my feeling about git. haha
Claudio Sanches 3:28 pm on January 10, 2013 Permalink | Log in to Reply
I also would be happy to use GIT to contribute.
I certainly send several Pull Requests
Andrew Nacin 3:35 pm on January 10, 2013 Permalink | Log in to Reply
You can already use Git to contribute. We operate using patches (and would probably continue to do so even if we switched).
Scott Taylor 4:26 pm on January 10, 2013 Permalink
are Unit Tests on Git yet?
Scott Taylor 4:25 am on January 10, 2013 Permalink | Log in to Reply
DO IT! And document the ideal way to mirror with authors.txt after, please. I mirrored a bunch of repos and forget to do the authors part and I haven’t collected the energy to start over yet.
Bryan Petty 5:18 am on January 10, 2013 Permalink | Log in to Reply
Authors file is simple: one author per line: “loginname = Joe User “.
Run this to generate your initial authors file (from the root of your SVN checkout):
$ svn log -q | awk -F ‘|’ ‘/^r/ {sub(“^ “, “”, $2); sub(” $”, “”, $2); print $2″ = “$2″ “}’ | sort -u > authors.txt
Fill in the file with real names and email addresses.
I use a modified version of Mark’s script to mirror a ton of repos myself:
https://gist.github.com/3061041
It’s mostly self-explanatory, see forked gist for Mark’s version.
Mark Jaquith 6:02 am on January 10, 2013 Permalink | Log in to Reply
Another thing I’d do is contact everyone in that file and get them to doublecheck that we have an e-mail address that they’re likely to control for life. Probably best to use e-mail at a personal domain, if they have one, instead of Gmail or a company e-mail address that they might lose in the future.
Daryl Koopersmith 4:25 am on January 10, 2013 Permalink | Log in to Reply
I think an accurate repository is worth the temporary breakage.
Japh 4:26 am on January 10, 2013 Permalink | Log in to Reply
+1
Peter Westwood 9:34 am on January 10, 2013 Permalink | Log in to Reply
+∞
Boone Gorges 11:53 am on January 10, 2013 Permalink | Log in to Reply
+1
Tom Willmot 2:13 pm on January 10, 2013 Permalink | Log in to Reply
+1
Aaron D. Campbell 2:54 pm on January 10, 2013 Permalink | Log in to Reply
+1
mojowill 3:03 pm on January 10, 2013 Permalink | Log in to Reply
+1
Till 11:23 pm on January 10, 2013 Permalink | Log in to Reply
+1
Piet 10:45 am on January 11, 2013 Permalink | Log in to Reply
+1
Ryan McCue 4:27 am on January 10, 2013 Permalink | Log in to Reply
+1, I’d say we do it.
What’d be really cool is if we can get the props parsed so that git lists the commit author as whoever was prop’d, and the committer as the person who actually committed it. AFAIK, that’s not possible without a complicated script though.
Bryan Petty 4:56 am on January 10, 2013 Permalink | Log in to Reply
That would be cool, but I really can’t even think of any way to do something like this with the current repo as is without amending commits after the initial clone, which would be extremely resource intensive and could take weeks to do. Given that and the work involved with integrating the same process into the mirror updates for future commits as well, I would just say forget it.
Bryan Petty 5:04 am on January 10, 2013 Permalink | Log in to Reply
Actually, come to think about it, git-filter-branch might be able to handle this efficiently.
Mark Jaquith 6:00 am on January 10, 2013 Permalink | Log in to Reply
git-filter-branchindeed could do it. It probably wouldn’t be too bad.Mark Jaquith 5:58 am on January 10, 2013 Permalink | Log in to Reply
Interesting idea. But wouldn’t be able to handle issues with multiple props recipients. But we could give it to the first person or just in this case give it to the committer.
Ryan McCue 7:39 am on January 10, 2013 Permalink | Log in to Reply
Multiple props authors mean that it’s ambiguous who actually created the patch, so the committer should be assigned credit lest we accidentally attribute it to the wrong person.
(Also, we’d probably want to make sure that we fix up typos. `rmmcue` for example.
Peter Westwood 9:39 am on January 10, 2013 Permalink | Log in to Reply
While parsing props like this would be cool I don’t think it would accurately reflect the way our process has worked and I would much rather put effort into collecting the props to commit data into a format we can integrate into the WP.org profiles more easily.
I started on this a while back but haven’t finished yet, what I’m mostly missing is an 100% accurate props extraction method.
Ryan McCue 9:55 am on January 10, 2013 Permalink | Log in to Reply
At the moment, there’s basically two forms of commits with props: 1) the committer is merely committing a patch that was on a ticket (this is where we’d want to split author/committer); and 2) the committer is writing the patch with inspiration from someone (we’d want author = committer in this case).
As far as I’ve seen, 1 seems to be the much more common case, but 2 is fairly common too. It could be a problem. (Regarding effort, it’s relatively simple using
git filter-branch, so that shouldn’t be much of an issue.)Michael Beckwith 4:27 am on January 10, 2013 Permalink | Log in to Reply
I do all my pulling of WP from the svn repo anyway, but I keep an eye on some development via github. No harm for my stuff
topdown 4:31 am on January 10, 2013 Permalink | Log in to Reply
I think that authors/contributors should be recognized when ever possible…
+1 I say fix it.
Mike Schinkel 4:48 am on January 10, 2013 Permalink | Log in to Reply
Got for it!
Bryan Petty 4:59 am on January 10, 2013 Permalink | Log in to Reply
I think you’re already aware that I actually use my own clone of the WP repo partly for this reason, but also because it’s nice having branch and tag names that are exactly the same as the branch and tag names in SVN. It would be nice if those were fixed up as well if you do this.
Mark Jaquith 5:56 am on January 10, 2013 Permalink | Log in to Reply
Yeah, if we’re doing this, we should take the time to iron out all other niggling issues. Would love to have your input on that. My issue with branch names is that it create ambiguous references. So if you go to checkout “3.5″ it will check out the 3.5 branch. In order to check out the 3.5 tag, you need to do
git checkout tags/3.5. Not the end of the world. Might be worth it to get everything cleaned up.Hey, maybe we can just rebase me and retroactively teach me all this Git and Git-SVN subtleties!
Just don’t push me, man.
Ryan McCue 7:41 am on January 10, 2013 Permalink | Log in to Reply
The way I do it for SimplePie is to name the branches spelled out (ala WP.org release notice slugs), such as
one-dot-two. That avoids the ambiguity there. However, that’s probably a pain for WP.Another option I’ve seen which are popular: rename all tags (or all branches) to
vX.Xso that any one starting withvis the tag (or branch) and without is the opposite.Boone Gorges 11:55 am on January 10, 2013 Permalink | Log in to Reply
For my own stuff I do something like this. `3.5` is the tag, and `3.5.x` is the branch. I think Drupal does it this way.
Aaron D. Campbell 2:57 pm on January 10, 2013 Permalink
Or enforce 3 digits for all tags and 2 for all branches, so 3.5 is a branch and 3.5.0 is the first 3.5.x tag
Bryan Petty 3:48 pm on January 10, 2013 Permalink | Log in to Reply
Mark does already do this, which is why the branches are named
#.#-branch.Anyway, git does assume you wanted the branch instead of the tag, but that’s almost always the case for me anyway. I almost never checkout the tags, and I don’t think anyone else does either (definitely not with SVN either). In the 5 months or so that I’ve had my mirror running, this has never gotten in my way once or annoyed me in any way.
Bryan Petty 3:58 pm on January 10, 2013 Permalink | Log in to Reply
One other issue that’s really minor is that there’s still an
iisbranch in SVN that didn’t make it into your mirror that probably should.sourceforge 5:12 am on January 10, 2013 Permalink | Log in to Reply
it would be good, i have been asking /systems guys to install git as revision control, but it seemed only someone in some driver’s seat could ask for stuff there! git is fast, no problem if it breaks for a while! thanks for this! full ahead flank
Ozh 6:54 am on January 10, 2013 Permalink | Log in to Reply
I think it’s possible to modify afterwards the author of each commit, so you don’t break the whole history
https://gist.github.com/4032945
Ryan McCue 7:42 am on January 10, 2013 Permalink | Log in to Reply
That will change the commit hashes, since the author/committer is stored as part of the commit object (which is used to create the hashes). There’s no way (by design) to change these after the fact without doing this.
Ryan McCue 7:44 am on January 10, 2013 Permalink | Log in to Reply
(Also, forgot to note: even if this only changed one commit, this would cascade down through all subsequent commits, since the parent’s hash is also included in the commit object)
aristath 6:57 am on January 10, 2013 Permalink | Log in to Reply
I think it would be a great step forward. Drupal also used to be in SVN and switched to Git a couple of years ago. It was entitled “the great git migration” and took almost a year to design, layout and implement the whole process but it was worth it. Using Git has many advantages! I believe that breaking the history is worth it in the long run.
Sure it might be a bit inconvenient at first, but I believe that it could really give a new boost to WordPress development.
aristath 6:59 am on January 10, 2013 Permalink | Log in to Reply
correction… Drupal used to be CVS, not SVN. But the principal is the same…
Ryan McCue 9:24 am on January 10, 2013 Permalink | Log in to Reply
To clarify: this isn’t about moving WordPress to Git, this is about fixing up the Git mirror of the SVN repo. This is a step we’d need to take if it was decided to move WP to Git, but it’s not the main goal.
Remkus de Vries 7:28 am on January 10, 2013 Permalink | Log in to Reply
Git ‘er done I say. Having to do a rebase / clone is no biggy at this stage.
Baki Goxhaj 8:59 am on January 10, 2013 Permalink | Log in to Reply
Re-cloning WordPress is not a big deal and adding appropriate author information is the way to go toward the future, thus I think it should be done — the sooner the better.
Tareq Hasan 9:33 am on January 10, 2013 Permalink | Log in to Reply
Surely go for it. A step towards SVN to Git.
Abhishek Ghosh 10:58 am on January 10, 2013 Permalink | Log in to Reply
Git is always a better option but needs carefulness on individual basis. Many options for an user is to download. The developer is getting the option to create a better documentation or guide. Cloning is not really difficult.
There are basic problems too, a good guide is needed for increasing awareness.
As practically we are not shifting, there is time.
Mark Rowatt Anderson 12:06 pm on January 10, 2013 Permalink | Log in to Reply
Two thumbs up – go for it!
Edward Caissie 1:26 pm on January 10, 2013 Permalink | Log in to Reply
It reads like a lot of great points above … and I am all for them, too. Any rebase / clone issues would be far outweighed by the eventual benefits this will bring.
Amy Hendrix (sabreuse) 1:31 pm on January 10, 2013 Permalink | Log in to Reply
+1 It’s really not a big deal to rebase now compared to not having a good history sometime later.
Tom Willmot 2:16 pm on January 10, 2013 Permalink | Log in to Reply
+1 Do it. We run everything with WordPress as submodule, would not be hard to re-clone.
aaronholbrook 2:30 pm on January 10, 2013 Permalink | Log in to Reply
+1, anything that would move us closer to using Git would be fantastic. Also not a big deal to re-clone if needed.
Chris Jean 3:00 pm on January 10, 2013 Permalink | Log in to Reply
Sounds like a bandaid that needs to be ripped off. Better now than later when even more people use it.
mojowill 3:04 pm on January 10, 2013 Permalink | Log in to Reply
I’d love to see a full move to GIT for everything on wporg!
Sam Parsons 10:45 pm on January 10, 2013 Permalink | Log in to Reply
I’m all for the update in order to improve the history and prepare for a possible move to git. I’m wondering whether you plan to send a little message (could it be automated?) to all those who have forked the repo on github?
https://github.com/WordPress/WordPress/network/members
That would be hugely helpful in communicating the upcoming changes in case those people don’t read this blog (perish the thought).
Mark Jaquith 6:47 am on January 11, 2013 Permalink | Log in to Reply
GitHub removed their private messaging feature, so I’d have no automated way of notifying everyone. This doesn’t concern me so much as we don’t accept pull requests on GitHub, so it’s not like their forks are functional in that way. I also think a lot of people fork repos and never update it from the upstream again. So they probably wouldn’t notice. And it’s easy enough to destroy it and refork it.
What I was considering doing was putting a note on our project description on GitHub, for the next few months, providing a link to a post that explained what happened and how to resolve the divergent Git history.
Mark Jaquith 6:51 am on January 11, 2013 Permalink | Log in to Reply
As the response was overwhelmingly positive (even from some of you who are traditionally serial devil’s advocates), I think we’re going to move forward with this. Thanks, all, for your feedback.
What I’ll likely do it consult with various people (@bpetty, notably) about implementation, doublecheck the e-mail address in my authors.txt file (recommending that everyone use addresses at personal domains that they’re likely to control indefinitely), and then push out a WordPress-Fixup repo for people to audit, before pushing the new history to the WordPress repo.
Bryan Petty 7:02 pm on January 11, 2013 Permalink | Log in to Reply
Confirming email addresses used would definitely be a good idea. I think a large portion of what you have now originally came from my list, which was meticulously put together from scouring plugin readmes, wp-hackers archives, and personal sites for publicly visible addresses since, at the time, I knew I wouldn’t be able to simply pull them from WP.org accounts used to make the commits (which would likely be the best source, aside from contacting everyone individually).
BFTrick 10:12 pm on January 15, 2013 Permalink | Log in to Reply
This sounds brilliant.