Skip to main content

How The Xen Project Streamlined Code Review By Analyzing Mailing List Data

By 2016-06-078月 22nd, 2017Blog

The Xen Project’s code contributions have grown more than 10 percent each year. Although growth is extremely healthy to the project as a whole, it has its growing pains. For the Xen Project, it led to issues with its code review process: maintainers believed that their review workload increased and a number of vendors claimed that it took significantly longer for contributions to be upstreamed, compared to the past.

The project developed some basic scripts that correlated development list traffic with Git commits, which showed indeed that it took longer for patches to be committed. In order to identify possible root causes, the project initially ran a number of surveys to identify possible causes for the slow down. Unfortunately, many of the observations made by community members contradicted each other, and were thus not actionable. To solve this problem, the Xen Project worked with Bitergia, a company that focuses on analyzing community software development processes, to better understand and address the issues at hand.

We recently sat down with Lars Kurth, who is the chairperson for the Xen Project, to discuss the overall growth of the Xen Project community as well as how the community was able to improve its code review process through software development analytics.

Like many FOSS projects, the Xen project code review process uses a mailing list-based review process, and this could be a good blueprint for projects that are finding themselves in the same predicament. Why has there been so much growth in the Xen Project Community?

Lars Kurth: The Xen Project hypervisor powers some of the biggest cloud computing companies in the world, including Alibaba’s Aliyun Cloud Services, Amazon Web Services, IBM Softlayer, Tencent, Rackspace and Oracle (to name a few).

It is also being increasingly used in new market segments such as the automotive industry, embedded and mobile as well as IoT. It is a platform of innovation that is consistently being updated to fit the new needs of computing with commits coming from developers across the world. We’ve experienced incredible community growth of 100 percent in the last five years. A lot of this growth has come from new geographic locations — most of the growth is from China and Ukraine. How did the project notice that there might be an issue and how did people respond to this?

Lars Kurth: In mid 2014, maintainers started to notice that their review workload increased. At the same time, some contributors noticed that it took longer to get their changes upstreamed. We first developed some basic scripts to prove that the total elapsed time from first code review to commit had indeed increased. I then ran a number of surveys, to be able to form a working thesis on the root causes.

In terms of response, there were a lot of differing opinions on what exactly was causing the process to slow down. Some thought that we did not have enough maintainers, some thought we did not have enough committers, others felt that the maintainers were not coordinating reviews well enough, while others felt that newcomers wrote lower quality code or there could be cultural and language issues.

Community members made a lot of assumptions based on their own worst experiences, without facts to support them. There were so many contradictions among the group that we couldn’t identify a clear root cause for what we saw. What were some of your initial ideas on how to improve this and why did you eventually choose to work with Bitergia for open analytics of the review process?

Lars Kurth: We first took a step back and looked at some things we could do that made sense without a ton of data. For example, I developed a training course for new contributors. I then did a road tour (primarily to Asia) to build personal relationships with new contributors and to deliver the new training.

In the year before, we started experimenting with design and architecture reviews for complex features. We decided to encourage these more without being overly prescriptive. We highlighted positive examples in the training material.

I also kicked off a number of surveys around our governance, to see whether maybe we have scalability issues. Unfortunately, we didn’t have any data to support this, and as expected different community members had different views. We did change our release cadence from 9-12 months to 6 months, to make it less painful for contributors if a feature missed a release.

It became increasingly clear that to make true progress, we would need reliable data. And to get that we needed to work with a software development analytics specialist. I had watched Bitergia for a while and made a proposal to the Xen Project Advisory Board to fund development of metrics collection tools for our code review process. How did you collect the data (including what tools you used) to get what you needed from the mailing list and Git repositories?

Lars Kurth: We used existing tools such as MLStats and CVSAnalY to collect mailing list and Git data. The challenge was to identify the different stages of a code review in the database that was generated by MLStats and to link it to the Git activity database generated by CVSAnalY. After that step we ended up with a combined code review database and ran statistical analysis over the combined database. Quite a bit of plumbing and filtering had to be developed from scratch for that to take place. Were there any challenges that you experienced along the way?

Lars Kurth: First we had to develop a reasonably accurate model of the code review process. This was rather challenging, as e-mail is essentially unstructured. Also, I had to act as bridge between Bitergia, which implemented the tools and the community. This took a significant portion of time. However, without spending that time, it would have been quite likely that the project would fail.

To de-risk the project, we designed it in two phases: the first phase focused on statistical analysis that allowed us to test some theories; the second phase focused on improving accuracy of the tools and making the data accessible to community stakeholders. What were your initial results from the analysis?

Lars Kurth: There were three key areas that we found were causing the slow down:

  • Huge growth in comment activity from 2013 to 2015

  • The time it took to merge patches (=time to merge) increased significantly from 2012 to the first half of 2014. However, from the second half of 2014  time to merge moved back to its long-term average. This was a strong indicator that the measures we took actually had an effect.

  • Complex patches were taking significantly longer to merge than small patches. As it turns out, a significant number of new features were actually rather complex. At the same time, the demands on the project to deliver better quality and security had also raised the bar for what could be accepted. How did the community respond to your data? How did you use it to help you make decisions about what was best to improve the process?

Lars Kurth: Most people were receptive to the data, but some were concerned that we were only able to match 60 percent of the code reviews to Git commits. For the statistical analysis, this was a big enough sample.

Further investigation showed that the main factor for this low match rate was caused by cross-posting of patches across FOSS communities. For example, some QEMU and Linux patches cross-posted for review on the Xen Project mailing lists, but the code did not end up in Xen. Once this was understood, a few key people in the community started to see the potential value of the new tools.

This is where stage two of the project came in. We defined a set of use cases and supporting data that broadly covered three areas:

  • Community use cases to encourage desired behavior: this would be metrics such as real review contributions (not justed  ACKed-by and Reviewed-by flags), comparing review activity against contributions.

  • Performance use cases that would allow us to spot issues early: these would allow us to filter time-related metrics by a number of different criteria such as complexity of a patch series

  • Backlog use cases to optimize process and focus: the intention here was to give contributors and maintainers tools to see what reviews are active, nearly complete, complete or stale. How have you made improvements based on your findings and what have been the end results for you?

Lars Kurth: We had to iterate the use cases, the data supporting them and how the data is shown. I expect that that process will continue, as more community members use the tools. For example, we realized that the code review process dashboard that was developed as part of the project is also useful for vendors to estimate how long it will take to get something upstreamed based on past performance.

Overall, I am very excited about this project, and although the initial contract with Bitergia has ended, we have an Outreachy intern working with Bitergia and me on the tools over the summer. How can this analysis support other projects with the similar code review processes?

Lars Kurth: I believe that projects like the Linux kernel and others that use e-mail based code review processes and Git should be able to use and build on our work. Hopefully, we will be able to create a basis for collaboration that helps different projects become more efficient and ultimately improve what we build.


The Linux Foundation
Follow Us