Final Report

About project

LibreOffice is a large and complex office software and has an extensive CI system to ensure that new patches do not introduce bugs. A lot of unit tests are run in Jenkins when contributers submit their patches to gerrit. It usually takes hours to run all the tests across different platforms, especially in rush hours. Therefore, a better test selection method is needed to reduce the load in testing while maintaining a high software quality.

Recently, machine learning is used to predict whether a patch can pass a given test. This can greatly reduce the testing load because we can skip the tests that is very likely to pass when a patch is submitted. Therefore, a machine learning based unit test selection algorithm is implemented to select tests to run in the CI chain to reduce testing load.

The model is available on Github and integrated in Jenkins.

Model performance

testlabelselect model predicts the failing probability of each unit test given the patch.

  Fail (Predicted) Pass (Predicted)
Fail (Actual) 3860 203
Pass (Actual) 191593 1109768

testfailure model predicts the overall failing probability of a patch based on patch features only.

  Fail (Predicted) Pass (Predicted)
Fail (Actual) 614 527
Pass (Actual) 2155 4863

testoverall model improves upon testfailure by using testlabelselect predictions to predict whether a patch will fail any unit test.

  Fail (Predicted) Pass (Predicted)
Fail (Actual) 810 331
Pass (Actual) 2413 4605

A smart inference is built based on testlabelselect and testoverall predictions. By setting a threshold for the number of failed unit tests, 91% of failures can be captured, while reducing computation by 57%.

  Fail (Predicted) Pass (Predicted)
Fail (Actual) 10617 1054
Pass (Actual) 30103 39815

Currently, the smart inference is integrated into Jenkins to save computation. If a patch is likely to fail any unit test, the sequential fast track will be run because it is assumed that the patch will fail some unit tests and there is no need to run everything. If it is likely to pass, the normal track will be run to ensure code correctness.

testlabelselect is not directly used to select unit tests because it is not able to capture all failures, about 5% failures will escape and it could cause severe problem.

Note: The tables are confusion matrices. Rows represent actual labels and columns represent predicted labels. For more information, please check this link.

Tasks

My Work during GSoC

During the 3-month GSoC program, I trained 3 XGBoost models (testlabelselect, testfailure, testoverall) to select unit tests for a patch to run, and integrated the models in Jenkins.

In the first month, I trained testlabelselect to predict whether patch will fail test by feeding (patch,test) pair into the model, which predicts a value between 0 (pass) and 1 (fail). The first version of testlabelselect is able to capture 90% of fail unit tests , while skipping 80% of pass unit tests.

In the second month, I improved the performance of testlabelselect from 90% fail recall and 80% pass recall to 95% fail recall and 85% pass recall by manipulating the feature extraction pipelines. I’ve also trained testfailure to predict whether patch will fail any unit tests solely based on patch features for Jenkins integration purpose. However, its performance (54% fail recall and 69% pass recall) is far worse than testlabelselect.

In the third month, I trained a new model testoverall, an improvement of testfailure, that uses testlabelselect prediction results to predict whether patch will fail any unit test. The model itself achieves a 71% fail recall and 66% pass recall. With smart inference algorithm, the performance is improved to 91% fail recall and 57% pass recall.

What’s next

Since this project is something from 0 to 1, there is space for future development:

  • Model improvement
  • Jenkins job improvement when the model reaches a better performance

Acknowledgement

I’m honored that I can be chosen by LibreOffice to be part of GSoC 2023 program. I’m also glad that most goals are achieved during the project period.

I’d like to thank Thorsten Behrens, Christian Lohmaier and Stéphane Guillou, who have been very helpful to my project.

I also want to thank the LibreOffice community who has been providing me a lot of feedbacks throughout the project.

At last, I’d like to thank Mozilla’s bugbug and rust-code-analysis, whose work provides me a code base to work on.

Further resources

Mozilla’s work: https://hacks.mozilla.org/2020/07/testing-firefox-more-efficiently-with-machine-learning/

bugbug: https://github.com/mozilla/bugbug

rust-code-analysis: https://mozilla.github.io/rust-code-analysis/

Posts

subscribe via RSS