Methodology
How the RFM segmentation, signal detection, and CLV regression pipeline works.
Dataset Disclosure — B2B Context
The UCI Online Retail dataset records wholesale B2B transactions from a UK gift and novelty distributor (December 2010–December 2011). Customer entities are primarily retail businesses purchasing inventory for resale — not individual end consumers. Quantities per line item frequently exceed 100 units; a meaningful number of accounts show lifetime revenue exceeding £20,000 — consistent with wholesale account values, not individual retail spending.
All RFM segmentation, CLV modelling, and win-back recommendations are demonstrated on this dataset to illustrate analytical methodology. The strategic recommendations are framed as if applied to a B2C retail environment — the intended deployment context. In a production engagement, the identical techniques would apply to direct consumer transaction data (loyalty programme records, app purchase histories, or CRM exports) from platforms such as Flipkart, Nykaa, or Reliance Retail.
Win-back response rates of 10–20% assumed in ROI modelling are consumer retail benchmarks. B2B wholesale win-back rates are typically lower (5–12%) but higher in revenue impact per recovered account.
RFM Scoring Methodology
Each customer receives three scores — Recency (R), Frequency (F), and Monetary (M) — on a 1–5 scale using quintile-based binning. Score 5 indicates the strongest behaviour in that dimension; score 1 the weakest. Segments are then assigned based on score combinations.
| Segment | Criteria |
|---|---|
| High Value | R≥4, F≥4, M≥4 |
| Loyal | R≥3, F≥3, M≥3 (not High Value) |
| Promising | R≥3, F≥2, M≥2 |
| At-Risk | R≤2, F≥3, M≥3 |
| Lost | R=1, F=1 or R≤2, F≤2 |
| Low Value | All others |
Data Preprocessing Pipeline
| Step | Action | Volume | Rationale |
|---|---|---|---|
| 1 | Raw dataset | 541,909 rows | UCI Online Retail, Dec 2010–Dec 2011 |
| 2 | Cancelled order removal | ~9,685 rows removed | InvoiceNo prefixed 'C' and Quantity < 0. Not netted against originating orders — slight overstatement of M values for high-return customers. |
| 3 | Null CustomerID removal | ~135,080 rows removed | Anonymous guest purchases cannot be attributed. Represent a material revenue share — all aggregate revenue claims in this project are attributed customers only. |
| 4 | Duplicate row removal | ~5,268 rows removed | Exact duplicates across all 8 columns; near-duplicates resolved by retaining most complete Description. |
| 5 | UnitPrice anomaly removal | ~1,456 rows removed | UnitPrice ≤ 0. Rows with UnitPrice = 0 excluded from monetary calculations but retained for frequency counts. |
| 6 | Final dataset | 392,692 rows, 4,338 customers | Non-product service codes (POST, DOT) excluded from SKU analysis. |
Quantity outliers (e.g. Quantity > 5,000) were retained — consistent with wholesale bulk ordering and critical for accurately representing the highest-value accounts.
CLV Regression
R_score, F_score, M_score (quintile-scored 1–5)
clv_proxy (winsorized at P99 = £1,057,680)
Linear Regression (scikit-learn)
0.1426 (14.3% of CLV variance explained)
Note: clv_proxy is computed as (total_revenue / customer_lifespan_days) × 365 × frequency. Single-transaction customers (lifespan=1 day) produce artificially inflated values; winsorization corrects this before training.