Skip to content
Imtiaz Hossain

playground / machine learning / software engineering

Software Quality Prediction

An honest study of a hard problem. The best model reaches 0.4418 F1, and the unsupervised view explains exactly why the signal is weak.

best weighted F1
0.4418
decision tree
micro-avg AUC
~0.60
all models
cluster overlap
high
PCA projection
verdict
weak signal
reported honestly
samples
1,600
/ 9 code metrics each
classes
3
/ High / Medium / Low, balanced
split
90/10
/ stratified
clustering
k=3
/ confirmed by elbow method

interactive / live in your browser

supervised scoreboard / weighted F1

Decision Treebest0.4418
Random Forest0.3500
Neural Network0.3500
KNN0.3000

For a three-class problem, chance is ~0.33. Every model hovers just above it. The simple decision tree, not the random forest or neural network, leads.

the nine code metrics / spread

Lines of Codeµ=4939.27 σ=2867.25
Cyclomatic Complexityµ=25.08 σ=13.88
Num Functionsµ=103.18 σ=55.5
Code Churnµ=102.57 σ=50.55
Comment Densityµ=0.55 σ=0.26
Num Bugsµ=2.93 σ=1.72
Code Owner Experienceµ=5.05 σ=2.56

Wide, overlapping distributions across classes. The k-means projection below shows why the labels are hard to separate.

the pipeline

From raw data to a verifiable result

  1. 01 / dataset

    Nine code metrics

    1,600 modules described by lines of code, cyclomatic complexity, function count, churn, comment density, bug count, unit-test coverage, and owner experience. Balanced across three quality tiers.

  2. 02 / preprocessing

    Careful cleaning

    Mode imputation for missing values, Winsorizing at the 1st/99th percentiles for outliers, absolute-value transform on churn, and StandardScaler normalization.

  3. 03 / models

    Four supervised, one unsupervised

    Neural network, KNN, decision tree, and random forest for classification, plus k-means to probe the natural structure of the data.

  4. 04 / results

    The ceiling is low

    Below: the supervised scoreboard and the k-means / PCA projection. The decision tree leads, but every model hovers near chance for a three-class problem.

  5. 05 / why

    The features are the limit

    The elbow method cleanly finds k=3, matching the quality tiers, but PCA shows the clusters overlap heavily. The code metrics and the quality labels are only weakly related. Better features, not fancier models, is the fix.

evaluation artifacts