playground / nlp / text classification
News Topic Classification
Twenty-seven experiments in one grid. Hover any cell to compare accuracy against training cost, from a 1.5s logistic regression to a 35-minute BERT.
- best macro-F1
- 0.9376
- BERT-Base, none
- best non-transformer
- 0.9214
- Bi-GRU, 18.7s
- BERT train time
- ~35 min
- vs 19s for Bi-GRU
- F1 gap
- 0.016
- BERT over Bi-GRU
- corpus
- 102k
- / training headlines, 4 classes
- test set
- 12,000
- / balanced across classes
- experiments
- 27
- / 9 models x 3 pipelines
- imbalance
- 3.4x
- / handled by class weighting
interactive / live in your browser
| model | none | extreme | optimum |
|---|---|---|---|
| LogReg | |||
| DNN | |||
| RNN | |||
| GRU | |||
| LSTM | |||
| Bi-RNN | |||
| Bi-GRU | |||
| Bi-LSTM | |||
| BERT-Base |
9 models x 3 preprocessing pipelines / ★ marks the best run / hover any cell
best run
BERT-Base
none / WordPiece
- macro-F1
- 0.9376
- training time
- 35.2 min
macro-F1 vs field
training cost (log)
BERT buys 0.016 macro-F1 over Bi-GRU for roughly 100x the training time.
the pipeline
From raw data to a verifiable result
- 01 / dataset
Four topics, imbalanced
102,002 training and 12,000 test headlines across Science & Technology, Business, Sports, and World News. Training is imbalanced 3.4x; the test set is balanced.
- 02 / preprocessing
Three pipelines
None (raw, HTML included), Extreme (stemming and full stopword removal), and Optimum (lemmatization, negation-preserving). Each is applied identically to every model.
none: worst-case baselineextreme: porter stemmingoptimum: from EDA - 03 / representations
TF-IDF to WordPiece
TF-IDF for the classical models, from-scratch Skip-gram embeddings for six recurrent variants, and WordPiece for BERT-Base.
- 04 / training
Nine architectures
From logistic regression to bidirectional gated networks to a fine-tuned transformer, all under inverse-frequency class weighting on an 8GB RTX 3070.
- 05 / results
The 27-run matrix
Below: every model times every pipeline. The story is in the contrasts, so hover cells to compare macro-F1 against training cost.
- 06 / insight
BERT breaks the rule
Preprocessing that helps shallow models hurts BERT: its WordPiece tokenizer handles raw HTML gracefully and is degraded by stemming. Preprocessing is model-specific, not universal.
evaluation artifacts