playground / nlp / text classification

News Topic Classification

Twenty-seven experiments in one grid. Hover any cell to compare accuracy against training cost, from a 1.5s logistic regression to a 35-minute BERT.

interactive / live in your browser

9 models x 3 preprocessing pipelines / ★ marks the best run / hover any cell

best run

BERT-Base

none / WordPiece

macro-F1 vs field

training cost (log)

BERT buys 0.016 macro-F1 over Bi-GRU for roughly 100x the training time.

the pipeline

From raw data to a verifiable result

01 / dataset
Four topics, imbalanced
102,002 training and 12,000 test headlines across Science & Technology, Business, Sports, and World News. Training is imbalanced 3.4x; the test set is balanced.
02 / preprocessing
Three pipelines
None (raw, HTML included), Extreme (stemming and full stopword removal), and Optimum (lemmatization, negation-preserving). Each is applied identically to every model.
none: worst-case baselineextreme: porter stemmingoptimum: from EDA
03 / representations
TF-IDF to WordPiece
TF-IDF for the classical models, from-scratch Skip-gram embeddings for six recurrent variants, and WordPiece for BERT-Base.
04 / training
Nine architectures
From logistic regression to bidirectional gated networks to a fine-tuned transformer, all under inverse-frequency class weighting on an 8GB RTX 3070.
05 / results
The 27-run matrix
Below: every model times every pipeline. The story is in the contrasts, so hover cells to compare macro-F1 against training cost.
06 / insight
BERT breaks the rule
Preprocessing that helps shallow models hurts BERT: its WordPiece tokenizer handles raw HTML gracefully and is degraded by stemming. Preprocessing is model-specific, not universal.

evaluation artifacts

next experiment

Software Quality Prediction