RFM Decision IntelligenceExplore →

Methodology

How the RFM segmentation, signal detection, and CLV regression pipeline works.

Dataset Disclosure — B2B Context

The UCI Online Retail dataset records wholesale B2B transactions from a UK gift and novelty distributor (December 2010–December 2011). Customer entities are primarily retail businesses purchasing inventory for resale — not individual end consumers. Quantities per line item frequently exceed 100 units; a meaningful number of accounts show lifetime revenue exceeding £20,000 — consistent with wholesale account values, not individual retail spending.

All RFM segmentation, CLV modelling, and win-back recommendations are demonstrated on this dataset to illustrate analytical methodology. The strategic recommendations are framed as if applied to a B2C retail environment — the intended deployment context. In a production engagement, the identical techniques would apply to direct consumer transaction data (loyalty programme records, app purchase histories, or CRM exports) from platforms such as Flipkart, Nykaa, or Reliance Retail.

Win-back response rates of 10–20% assumed in ROI modelling are consumer retail benchmarks. B2B wholesale win-back rates are typically lower (5–12%) but higher in revenue impact per recovered account.

RFM Scoring Methodology

Each customer receives three scores — Recency (R), Frequency (F), and Monetary (M) — on a 1–5 scale using quintile-based binning. Score 5 indicates the strongest behaviour in that dimension; score 1 the weakest. Segments are then assigned based on score combinations.

SegmentCriteria
High ValueR≥4, F≥4, M≥4
LoyalR≥3, F≥3, M≥3 (not High Value)
PromisingR≥3, F≥2, M≥2
At-RiskR≤2, F≥3, M≥3
LostR=1, F=1 or R≤2, F≤2
Low ValueAll others

Data Preprocessing Pipeline

StepActionVolumeRationale
1Raw dataset541,909 rowsUCI Online Retail, Dec 2010–Dec 2011
2Cancelled order removal~9,685 rows removedInvoiceNo prefixed 'C' and Quantity < 0. Not netted against originating orders — slight overstatement of M values for high-return customers.
3Null CustomerID removal~135,080 rows removedAnonymous guest purchases cannot be attributed. Represent a material revenue share — all aggregate revenue claims in this project are attributed customers only.
4Duplicate row removal~5,268 rows removedExact duplicates across all 8 columns; near-duplicates resolved by retaining most complete Description.
5UnitPrice anomaly removal~1,456 rows removedUnitPrice ≤ 0. Rows with UnitPrice = 0 excluded from monetary calculations but retained for frequency counts.
6Final dataset392,692 rows, 4,338 customersNon-product service codes (POST, DOT) excluded from SKU analysis.

Quantity outliers (e.g. Quantity > 5,000) were retained — consistent with wholesale bulk ordering and critical for accurately representing the highest-value accounts.

CLV Regression

Features

R_score, F_score, M_score (quintile-scored 1–5)

Target

clv_proxy (winsorized at P99 = £1,057,680)

Model

Linear Regression (scikit-learn)

In-sample R²

0.1426 (14.3% of CLV variance explained)

Note: clv_proxy is computed as (total_revenue / customer_lifespan_days) × 365 × frequency. Single-transaction customers (lifespan=1 day) produce artificially inflated values; winsorization corrects this before training.

Tableau Dashboards

Dashboard 1 — RFM Segmentation Overview

Dashboard 2 — Segment Deep Dive