work / ai / research / software-quality-prediction

Software Quality Prediction

An honest comparative study of predicting code quality from nine metrics, where the best model reaches only 0.4418 F1, and that is the finding.

period

2025

status

research

Report (PDF) →Try the demo →

best weighted f1: 0.4418; decision tree
samples: 1,600; 9 code metrics each
micro-avg auc: ~0.60; all models, moderate
classes: 3; high / medium / low

system architecture / interactive

fig. 00 / software-quality-prediction / hover nodes to trace the data flow

The premise, and the honest result

Software-quality assurance is expensive and inconsistent when done by hand, so the appeal of predicting quality from cheap code metrics is obvious. This study evaluates whether four machine-learning models can classify code modules as High, Medium, or Low quality from nine metrics, and the honest answer is: not well. The best model, a decision tree, reaches only 0.4418 weighted F1. Rather than bury that, I made the why the point of the study.

The dataset

1,600 samples, nine features: Lines of Code, Cyclomatic Complexity, Number of Functions, Code Churn, Comment Density, Number of Bugs, Has Unit Tests, Code Owner Experience, and the Quality Label target. The classes are nicely balanced (566 High, 533 Low, 501 Medium), so imbalance is not the excuse.

Preprocessing was careful: mode imputation for the 80 missing values in three columns, Winsorizing at the 1st and 99th percentiles to tame outliers in Code Churn and Number of Functions, an absolute-value transform on Code Churn (negative values meant deleted lines), StandardScaler normalization, and a stratified 90/10 split.

Results

Model	Accuracy	Weighted F1
Decision Tree	0.44	0.4418
Random Forest	0.36	0.35
Neural Network	0.35	0.35
KNN	0.31	0.30

All four models cluster near random for a three-class problem, with micro-average AUC around 0.60, indicating only moderate discriminative power. Interestingly, the simple decision tree beat the random forest and the neural network here, and showed the most balanced per-class behavior, while the others biased toward particular classes.

What the unsupervised view confirmed

To check whether the labels were the problem or the features were, I ran k-means without using the labels. The elbow method cleanly suggested k=3, matching the three quality tiers, which is encouraging. But projecting the clusters with PCA revealed heavy overlap between them. The natural structure in the code metrics does not line up with the quality labels. That is the crux: the features and the target are only weakly related.

Why I present this as a finding

Predicting software quality from code metrics alone is genuinely hard. The likely culprits: quality is partly subjective and inconsistently labeled, the truly predictive signals (version history, developer activity, review data) are not in these nine metrics, and the relationships that do exist are non-linear in ways these models struggle to capture.

A weaker version of this project would have cherry-picked a metric to look impressive. The stronger, more useful version reports the ceiling honestly and points at the real gap: better features, not fancier models. The playground lets you explore the cluster structure and the correlation matrix to see the weak signal for yourself.

stack

Pythonscikit-learnNeural NetworkDecision TreeRandom ForestKNNk-meansPCA

the evidence

Cholo