Dynamic Pricing with Learning
Contextual bandits · Exploration vs exploitation
Pricing under unknown demand response: each price the seller posts is both a revenue decision and a probe that reveals information about the demand curve. The seller must balance exploration (try uncertain prices to learn) against exploitation (charge the current best price to earn). Distinct from markdown optimisation, which assumes the demand function is known. Foundational papers: Besbes & Zeevi (2009, 2012); den Boer (2015) survey; Auer-Cesa-Bianchi-Fischer (2002) UCB; Thompson sampling (1933).
Why it matters
Market scale · revenue lift · theoretical guarantees · production deployments
Where the decision sits
Online price-setting · e-commerce · travel · ride-hailing
Dynamic pricing with learning applies whenever a seller can change the offered price many times during the selling horizon, observes the resulting sales, and faces an unknown price-response curve. Unlike markdown optimisation (where demand parameters are known and the question is the optimal schedule), here every posted price doubles as an experiment. Examples: e-commerce SKUs without prior elasticity estimates, ride-hailing surge multipliers, airline fare classes, hotel ADR, ad-auction bid pricing, and digital subscription pricing.
Problem & formulation
Stochastic bandit on a continuous (or discrete) price arm space
True (unknown) demand model
A linear price-response with Gaussian noise; the parameter vector \(\theta = (a, b)\) is unknown to the seller:
Intercept \(a > 0\) (max demand at zero price), slope \(b > 0\) (price sensitivity), noise scale \(\sigma\).
Sequential protocol
| Symbol | Meaning | Domain |
|---|---|---|
| \(t \in \{1, \ldots, T\}\) | Round in the selling horizon | discrete |
| \(p_t \in [p_{\min}, p_{\max}]\) | Price posted in round \(t\) (decision) | bounded |
| \(D_t\) | Realised noisy demand at price \(p_t\) | random |
| \(\theta = (a, b)\) | True demand parameters (unknown) | \(\mathbb{R}^2_+\) |
| \(H_t\) | History \(\{(p_s, D_s)\}_{s < t}\) up to round \(t\) | filtration |
Objective — maximise expected revenue
Each \(p_t\) is a function of the observed history — a policy rather than a fixed schedule.
Regret — the standard performance metric
The shortfall versus a clairvoyant oracle who knew \(\theta\) from round 1 and always charged \(p^{\ast} = a / (2b)\).
Algorithm 1 — Greedy (exploit-only)
Estimate \(\hat{\theta}_t\) by least squares on \(H_t\); always pick the empirically best price:
Simple but can lock onto a wrong estimate forever — incomplete learning trap.
Algorithm 2 — \(\varepsilon\)-greedy
Forced exploration breaks the lock-in; regret is linear in \(T\) for fixed \(\varepsilon\) but practical.
Algorithm 3 — UCB / LinUCB
Pick the price with the highest optimistic expected revenue — mean plus a confidence-radius bonus:
Optimism in the face of uncertainty. Cumulative regret \(\tilde{O}(d\sqrt{T})\) for d-dim context (Li et al., 2010).
Algorithm 4 — Thompson sampling
Sample a parameter from the posterior, then act greedily on the sample:
Both Thompson sampling and UCB achieve sub-linear regret; Thompson is often better in practice.
Real-world → OR mapping
How a pricing manager’s vocabulary translates to the bandit
| On the pricing desk | In the bandit |
|---|---|
| Number of price-tests over the horizon | \(T\) |
| Allowed price band | \([p_{\min}, p_{\max}]\) |
| Unknown elasticity / willingness-to-pay | \(\theta = (a, b)\) |
| Posted price in week / hour / minute \(t\) | \(p_t\) |
| Observed units sold | \(D_t\) |
| “A/B test budget” (random-pricing share) | \(\varepsilon\) in \(\varepsilon\)-greedy |
| Aggressiveness of exploration | \(\alpha\) in UCB |
| Lost revenue vs. omniscient oracle | \(R(T)\) |
Interactive solver
Simulate a learning seller against a true (hidden) demand curve; compare algorithms
Under the hood
We maintain a conjugate Bayesian posterior on \((a, b)\) with a Normal prior and noise std \(\sigma\), updated via Bayesian linear regression on the design matrix \(X = [\,\mathbf{1}, -\mathbf{p}\,]\) and observations \(\mathbf{D}\). At each round, the chosen rule sees the current posterior mean \(\hat{\theta}_t\) and covariance \(\Sigma_t\) and emits a price. Greedy: \(p = \hat{a}_t / (2\hat{b}_t)\) clipped to the price band. UCB: scan a discretised price grid and pick the maximum of \(p\,(\hat{a}_t - \hat{b}_t p) + \alpha \sqrt{\mathrm{Var}[p\,D(p) \mid H_t]}\). Thompson: sample \(\theta_t \sim \mathcal{N}(\hat{\theta}_t, \Sigma_t)\), then act greedily on the sample. Posterior updates run in \(O(d^2)\) per round; total simulation cost is linear in \(T\). The oracle baseline always charges \(p^{\ast} = a / (2b)\) and the random baseline draws uniformly from \([p_{\min}, p_{\max}]\).
Reading the solution
What the cumulative-revenue and price-trace tell you
Three patterns to watch for
- Sub-linear regret curve. The regret-vs-oracle gap should grow as \(\sqrt{T}\), not linearly. On the cumulative-revenue chart, the algorithm curve should run roughly parallel to (just below) the oracle line after a few dozen rounds — a clear sign learning is converging.
- Convergence of \(p_t\) to \(p^{\ast}\). In the lower panel, the posted-price scatter should funnel toward the oracle-optimal price as \(t\) grows. UCB and Thompson tighten faster than \(\varepsilon\)-greedy, which keeps probing forever.
- Posterior std on \(b\) shrinks like \(1/\sqrt{t}\). Wide posterior at \(t = 5\), narrow at \(t = 200\). If it does not shrink, the algorithm is not exploring enough — raise \(\alpha\) (UCB) or \(\varepsilon\) (greedy).
Sensitivity questions the simulator answers
- Push \(\sigma\) up — how much longer does it take to identify \(p^{\ast}\)? (Identification is hardest exactly when noise is high.)
- Switch from UCB to \(\varepsilon\)-greedy — does the regret gap widen at large \(T\)? (Yes, because \(\varepsilon\)-greedy has linear regret for fixed \(\varepsilon\).)
- Tighten the price band \([p_{\min}, p_{\max}]\) around the true \(p^{\ast}\) — how does final regret change? (Smaller band, less to lose from sub-optimal prices.)
- Drop \(\alpha\) toward zero — UCB collapses to greedy and risks getting stuck on a wrong estimate.
Model extensions
From single-SKU baseline to production retail-pricing variants
Contextual bandits
Per-customer-segment pricing where the demand parameters depend on observed features (device, geography, channel). LinUCB reduces to ridge regression with a UCB bonus per arm.
Personalisation →Inventory-constrained pricing
Limited stock turns the bandit into a finite-horizon revenue management problem: each unit sold is one less to test on. Couples directly to clearance pricing.
Markdown →Competitor-aware pricing
Demand depends on (own price, competitor price). The bandit becomes a two-player game; equilibrium learning replaces single-agent regret.
Fairness & regulatory constraints
Prohibitions on price discrimination by protected attributes; floors required by law (alcohol, tobacco). Constrained-bandit literature (Wu et al. 2016).
Robust / misspecified pricing
The true demand may not be linear. Robust bandits hedge against model misspecification — pay a small price for guarantee.
Off-policy learning from logs
Use historical sales logs (collected under a different pricing policy) to evaluate or train new policies without live experiments. IPS, doubly-robust estimators.
Joint pricing + assortment
Decide which SKUs to display and at what prices, jointly. Demand for one product depends on the prices of substitutes shown.
Assortment →Reinforcement learning
Long-horizon dynamics — today’s low price affects tomorrow’s reference price and customer expectations. Value-function methods replace one-shot bandits.
Key references
Foundational dynamic-pricing-with-learning literature
Back to the retail domain
Dynamic pricing with learning lives in the Price × Operational cell of the 4P decision matrix — the always-on revenue lever for SKUs with unknown elasticity.
Open Retail Landing