playground / machine learning / software engineering

Software Quality Prediction

An honest study of a hard problem. The best model reaches 0.4418 F1, and the unsupervised view explains exactly why the signal is weak.

Full case study →Report (PDF) ↗

best weighted F1: 0.4418; decision tree
micro-avg AUC: ~0.60; all models
cluster overlap: high; PCA projection
verdict: weak signal; reported honestly

samples: 1,600; / 9 code metrics each
classes: 3; / High / Medium / Low, balanced
split: 90/10; / stratified
clustering: k=3; / confirmed by elbow method

interactive / live in your browser

supervised scoreboard / weighted F1

Decision Treebest0.4418

Random Forest0.3500

Neural Network0.3500

KNN0.3000

For a three-class problem, chance is ~0.33. Every model hovers just above it. The simple decision tree, not the random forest or neural network, leads.

the nine code metrics / spread

Lines of Codeµ=4939.27 σ=2867.25

Cyclomatic Complexityµ=25.08 σ=13.88

Num Functionsµ=103.18 σ=55.5

Code Churnµ=102.57 σ=50.55

Comment Densityµ=0.55 σ=0.26

Num Bugsµ=2.93 σ=1.72

Code Owner Experienceµ=5.05 σ=2.56

Wide, overlapping distributions across classes. The k-means projection below shows why the labels are hard to separate.

the pipeline

From raw data to a verifiable result

01 / dataset
Nine code metrics
1,600 modules described by lines of code, cyclomatic complexity, function count, churn, comment density, bug count, unit-test coverage, and owner experience. Balanced across three quality tiers.
02 / preprocessing
Careful cleaning
Mode imputation for missing values, Winsorizing at the 1st/99th percentiles for outliers, absolute-value transform on churn, and StandardScaler normalization.
03 / models
Four supervised, one unsupervised
Neural network, KNN, decision tree, and random forest for classification, plus k-means to probe the natural structure of the data.
04 / results
The ceiling is low
Below: the supervised scoreboard and the k-means / PCA projection. The decision tree leads, but every model hovers near chance for a three-class problem.
05 / why
The features are the limit
The elbow method cleanly finds k=3, matching the quality tiers, but PCA shows the clusters overlap heavily. The code metrics and the quality labels are only weakly related. Better features, not fancier models, is the fix.

evaluation artifacts

next experiment

Brain Tumor Segmentation & Classification

From raw data to a verifiable result

Nine code metrics

Careful cleaning

Four supervised, one unsupervised

The ceiling is low

The features are the limit