Methodology

How the RFM segmentation, signal detection, and CLV regression pipeline works.

Dataset Disclosure — B2B Context

The UCI Online Retail dataset records wholesale B2B transactions from a UK gift and novelty distributor (December 2010–December 2011). Customer entities are primarily retail businesses purchasing inventory for resale — not individual end consumers. Quantities per line item frequently exceed 100 units; a meaningful number of accounts show lifetime revenue exceeding £20,000 — consistent with wholesale account values, not individual retail spending.

All RFM segmentation, CLV modelling, and win-back recommendations are demonstrated on this dataset to illustrate analytical methodology. The strategic recommendations are framed as if applied to a B2C retail environment — the intended deployment context. In a production engagement, the identical techniques would apply to direct consumer transaction data (loyalty programme records, app purchase histories, or CRM exports) from platforms such as Flipkart, Nykaa, or Reliance Retail.

Win-back response rates of 10–20% assumed in ROI modelling are consumer retail benchmarks. B2B wholesale win-back rates are typically lower (5–12%) but higher in revenue impact per recovered account.

RFM Scoring Methodology

Each customer receives three scores — Recency (R), Frequency (F), and Monetary (M) — on a 1–5 scale using quintile-based binning. Score 5 indicates the strongest behaviour in that dimension; score 1 the weakest. Segments are then assigned based on score combinations.

Segment	Criteria
High Value	R≥4, F≥4, M≥4
Loyal	R≥3, F≥3, M≥3 (not High Value)
Promising	R≥3, F≥2, M≥2
At-Risk	R≤2, F≥3, M≥3
Lost	R=1, F=1 or R≤2, F≤2
Low Value	All others

Data Preprocessing Pipeline

Step	Action	Volume	Rationale
1	Raw dataset	541,909 rows	UCI Online Retail, Dec 2010–Dec 2011
2	Cancelled order removal	~9,685 rows removed	InvoiceNo prefixed 'C' and Quantity < 0. Not netted against originating orders — slight overstatement of M values for high-return customers.
3	Null CustomerID removal	~135,080 rows removed	Anonymous guest purchases cannot be attributed. Represent a material revenue share — all aggregate revenue claims in this project are attributed customers only.
4	Duplicate row removal	~5,268 rows removed	Exact duplicates across all 8 columns; near-duplicates resolved by retaining most complete Description.
5	UnitPrice anomaly removal	~1,456 rows removed	UnitPrice ≤ 0. Rows with UnitPrice = 0 excluded from monetary calculations but retained for frequency counts.
6	Final dataset	392,692 rows, 4,338 customers	Non-product service codes (POST, DOT) excluded from SKU analysis.

Quantity outliers (e.g. Quantity > 5,000) were retained — consistent with wholesale bulk ordering and critical for accurately representing the highest-value accounts.

CLV Regression

Features

R_score, F_score, M_score (quintile-scored 1–5)

Target

clv_proxy (winsorized at P99 = £1,057,680)

Model

Linear Regression (scikit-learn)

In-sample R²

0.1426 (14.3% of CLV variance explained)

Note: clv_proxy is computed as (total_revenue / customer_lifespan_days) × 365 × frequency. Single-transaction customers (lifespan=1 day) produce artificially inflated values; winsorization corrects this before training.

Methodology

Dataset Disclosure — B2B Context

RFM Scoring Methodology

Data Preprocessing Pipeline

CLV Regression

Tableau Dashboards

Dashboard 1 — RFM Segmentation Overview

Dashboard 2 — Segment Deep Dive