LibreOffice CI Test Selection with Machine Learning

Week 11

2023-08-10T00:00:00+00:00

Model improvement

Now, the testfailure model no longer considers author features.

Week 10

2023-08-03T00:00:00+00:00

Smart inference

Previously, jenkins only uses testfailure results to decide whether the patch will pass or fail. Since it is not very accurate and testselect is accurate, a better algorithm using its prediction is used to pass or fail a patch.

testoverall is proposed to integrate testselect’s predictions into testfailure. Compared to testfailure, its failure recall significantly increases from 54% to 71%, while pass recall slightly drops from 70% to 65%. Since failure recall is much more important than pass recall, the model is a huge improvement.

Due to testoverall outstanding performance, it replaces testfailure in inference.

Besides, a new condition is added to decide whether the patch should pass or fail. Originally, it only looks at whether the overall failing probability has reached a threshold (0.4). Now, the number of failed unit tests are counted. If it reaches the threshold (10), then the patch is also considered to be failed. With the improved algorithm, the inference is able to recall 91% failures, while reducing computation by 57%.

Jenkins integration

Currently, the model is integrated into Jenkins job gerrit_master_ml. It first runs the machine learning model to predict whether the patch will pass or fail. If the patch is likely to fail, then the fast track will be run. If it is likely to fail, then the normal build will be run.

Week 9

2023-07-27T00:00:00+00:00

Model improvement

To improve model performance, the model based on grouped unit tests is implemented. Originally, the model is trained to predict on the level of around 700 unit tests, which is too much. To reduce the number of predictions, unit tests are grouped into 80 groups based on their folder parents and functions in mapping.py. The performance has improved to:

	Fail (Predicted)	Pass (Predicted)
Fail (Actual)	3860	203
Pass (Actual)	191593	1109768

testselect is now able to recognize 95% (94% previously) of all failures, while reducing computation by 85% (84% previously).

Week 8

2023-07-20T00:00:00+00:00

Result archive

Every time Jenkins runs the model, the inference results will be saved to probability.csv, which is archived by Jenkins.

Jenkins integration

The model is integrated into a master job. In this job, the model will first be run to decide whether the commit is likely to fail. If it is, then run gerrit_linux_clang_dbgutil first. If it fails, then return -1, else run the rest builds. If the model predicts that the commit is unlikely to fail, then run all the build in parallel like before.

Week 7

2023-07-12T00:00:00+00:00

Jenkins integration

Currently, the model is integrated into Jenkins. The average build duration is around 15s, and it is able to support 5 builds in parallel.

Its output log mainly contains the probability of a patch to fail a unit test and its overall probability to fail any test. The overall probability is shown in the build summary page.

Week 6

2023-07-06T00:00:00+00:00

Jenkins integration

Currently, the model is integrated into Jenkins. The average build duration is around 15s, and it is able to support 5 builds in parallel.

Its output log mainly contains the probability of a patch to fail a unit test and its overall probability to fail any test.

Further work will be done to better integrate the model into Jenkins.

Week 5

2023-06-29T00:00:00+00:00

Model inference

Inference is completed using testfailure and testselect to predict unit tests to be run for a commit.

More training results, such as models, intermediate data and metrics are shared in the repository.

Week 4

2023-06-22T00:00:00+00:00

Model training

Two models are trained with the full dataset:

testfailure predicts whether a commit will fail any unit test. It only considers commit features.
testselect predicts which unit tests the commit will fail. It only considers unit tests features.

These two models are based on bugbug, but they have one main limitation. Commit and unit test features are considered independently. The better way to solve this problem is to consider these two kinds of features together to predict whether a (commit, test) pair will pass or fail.

Week 3

2023-06-15T00:00:00+00:00

Model training

Basic model training pipeline is completed with testselect model. Further optimization is needed to reduce memory and time cost, together with performance.

Currently, testselect is trained on a subset of size 16384 (containing training and testing set) of the full dataset of size 122019 due to memory cost, and it has reached a failure recall of 91.4% and saving 90% of unit test computational cost. Its detailed confusion matrix is shown below:

	Fail (Predicted)	Pass (Predicted)
Fail (Actual)	480	45
Pass (Actual)	556910	5045893

Week 2

2023-06-07T00:00:00+00:00

Commit feature extraction

Commit feature extraction is finished with multiprocessing. The commits come from the csv table. Features are based on the patch (what changes in the commit), code features, author features and so on. The output is saved in data/commits.json.

Unit test feature extraction

Unit test feature extraction is finished with single thread with speed up. It computes features of unit tests from data/commits.json.

LibreOffice CI Test Selection with Machine Learning

Week 11

Model improvement

Week 10

Smart inference

Jenkins integration

Week 9

Model improvement

Week 8

Result archive

Jenkins integration

Week 7

Jenkins integration

Week 6

Jenkins integration

Week 5

Model inference

Model sharing

Week 4

Model training

Week 3

Model training

Week 2

Commit feature extraction

Unit test feature extraction