Dynamic Pricing with Learning

Contextual bandits · Exploration vs exploitation

Pricing under unknown demand response: each price the seller posts is both a revenue decision and a probe that reveals information about the demand curve. The seller must balance exploration (try uncertain prices to learn) against exploitation (charge the current best price to earn). Distinct from markdown optimisation, which assumes the demand function is known. Foundational papers: Besbes & Zeevi (2009, 2012); den Boer (2015) survey; Auer-Cesa-Bianchi-Fischer (2002) UCB; Thompson sampling (1933).

Why it matters

Market scale · revenue lift · theoretical guarantees · production deployments

~$20B+

Estimated global market for dynamic-pricing software and consulting across retail, travel, and ride-hailing.

Source: Gartner / Forrester industry estimates.

5–15%

Documented revenue lift when learning-based pricing replaces rule-based or static pricing in retail and e-commerce case studies.

Source: industry case studies; den Boer (2015) survey.

$\tilde{O}(d\sqrt{T})$

LinUCB cumulative-regret guarantee: with d-dimensional context and T rounds, regret grows only as $\sqrt{T}$ — the seller’s loss vs. the omniscient oracle is sub-linear.

Li, Chu, Langford & Schapire (2010), WWW.

M+/day

Amazon retail makes millions of price changes daily; Uber surge, airline yield, and hotel revenue management run continuous learning loops on price.

Industry reports; Misic & Perakis (2020) review.

Where the decision sits

Online price-setting · e-commerce · travel · ride-hailing

Dynamic pricing with learning applies whenever a seller can change the offered price many times during the selling horizon, observes the resulting sales, and faces an unknown price-response curve. Unlike markdown optimisation (where demand parameters are known and the question is the optimal schedule), here every posted price doubles as an experiment. Examples: e-commerce SKUs without prior elasticity estimates, ride-hailing surge multipliers, airline fare classes, hotel ADR, ad-auction bid pricing, and digital subscription pricing.

Observedemand history

Estimateprice-response $\hat{\theta}_t$

Choose priceexplore vs exploit

Observe salenoisy demand $D_t$

Updateposterior on $\theta$

Problem & formulation

Stochastic bandit on a continuous (or discrete) price arm space

OR family

Stochastic Bandits

Regret bound

$\tilde{O}(d\sqrt{T})$ (LinUCB)

Solver realism

★★★ Sub-linear regret

Reference

Besbes & Zeevi (2009)

True (unknown) demand model

A linear price-response with Gaussian noise; the parameter vector $\theta = (a, b)$ is unknown to the seller:

$$D(p) \;=\; a \;-\; b\,p \;+\; \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2)$$

Intercept $a > 0$ (max demand at zero price), slope $b > 0$ (price sensitivity), noise scale $\sigma$.

Sequential protocol

Symbol	Meaning	Domain
$t \in \{1, \ldots, T\}$	Round in the selling horizon	discrete
$p_t \in [p_{\min}, p_{\max}]$	Price posted in round $t$ (decision)	bounded
$D_t$	Realised noisy demand at price $p_t$	random
$\theta = (a, b)$	True demand parameters (unknown)	$\mathbb{R}^2_+$
$H_t$	History $\{(p_s, D_s)\}_{s < t}$ up to round $t$	filtration

Objective — maximise expected revenue

$$\max_{\{p_t\}_{t=1}^{T}} \;\; \mathbb{E}\!\left[\, \sum_{t=1}^{T} p_t \cdot D_t \,\right] \qquad \text{subject to } p_t \;=\; \pi_t(H_t).$$

Each $p_t$ is a function of the observed history — a policy rather than a fixed schedule.

Regret — the standard performance metric

$$R(T) \;=\; T \cdot \max_{p} \bigl(\,p \cdot \mathbb{E}[D(p)]\,\bigr) \;-\; \mathbb{E}\!\left[\sum_{t=1}^{T} p_t \cdot D_t\right]$$

The shortfall versus a clairvoyant oracle who knew $\theta$ from round 1 and always charged $p^{\ast} = a / (2b)$.

Algorithm 1 — Greedy (exploit-only)

Estimate $\hat{\theta}_t$ by least squares on $H_t$; always pick the empirically best price:

$$p_t^{\text{greedy}} \;=\; \arg\max_{p} \;\; p \cdot \bigl(\hat{a}_t \;-\; \hat{b}_t\,p\bigr) \;=\; \frac{\hat{a}_t}{2\,\hat{b}_t}.$$

Simple but can lock onto a wrong estimate forever — incomplete learning trap.

Algorithm 2 — $\varepsilon$-greedy

$$p_t \;=\; \begin{cases} \text{uniform random in } [p_{\min}, p_{\max}] & \text{w.p. } \varepsilon, \\[2pt] \hat{a}_t / (2\,\hat{b}_t) & \text{w.p. } 1 - \varepsilon. \end{cases}$$

Forced exploration breaks the lock-in; regret is linear in $T$ for fixed $\varepsilon$ but practical.

Algorithm 3 — UCB / LinUCB

Pick the price with the highest optimistic expected revenue — mean plus a confidence-radius bonus:

$$p_t^{\text{UCB}} \;=\; \arg\max_{p} \;\; \widehat{\mathbb{E}}[\,\text{Rev}(p)\,] \;+\; \alpha \cdot \sqrt{\widehat{\mathrm{Var}}[\,\text{Rev}(p)\,]}.$$

Optimism in the face of uncertainty. Cumulative regret $\tilde{O}(d\sqrt{T})$ for d-dim context (Li et al., 2010).

Algorithm 4 — Thompson sampling

Sample a parameter from the posterior, then act greedily on the sample:

$$\theta_t \;\sim\; \mathbb{P}\bigl(\theta \,\big|\, H_t\bigr), \qquad p_t \;=\; \arg\max_{p} \;\; p \cdot \mathbb{E}\bigl[D(p) \,\big|\, \theta_t\bigr].$$

Both Thompson sampling and UCB achieve sub-linear regret; Thompson is often better in practice.

Real-world → OR mapping

How a pricing manager’s vocabulary translates to the bandit

On the pricing desk	In the bandit
Number of price-tests over the horizon	$T$
Allowed price band	$[p_{\min}, p_{\max}]$
Unknown elasticity / willingness-to-pay	$\theta = (a, b)$
Posted price in week / hour / minute $t$	$p_t$
Observed units sold	$D_t$
“A/B test budget” (random-pricing share)	$\varepsilon$ in $\varepsilon$-greedy
Aggressiveness of exploration	$\alpha$ in UCB
Lost revenue vs. omniscient oracle	$R(T)$

Interactive solver

Simulate a learning seller against a true (hidden) demand curve; compare algorithms

Bandit pricing simulator

Linear demand · Bayesian (Normal-inverse-gamma) posterior on $(a, b)$

★★★ Sub-linear regret

True intercept $a$

Hidden — the seller does not know

True slope $b$

Hidden price sensitivity

Demand noise $\sigma$

Std-dev of $\varepsilon$

Horizon $T$

Number of pricing rounds

Random seed

Reproducible runs

Algorithm

Choose the learning rule

Price floor $p_{\min}$

Price ceiling $p_{\max}$

$\varepsilon$ (for epsilon-greedy)

Random-price probability

$\alpha$ (UCB exploration)

Confidence multiplier

—

Cumulative revenue ($)

—

Regret vs oracle ($)

—

Final $\hat{p}^{\ast}$ ($)

—

Exploration rounds

—

Posterior std on $b$

—

Optimality gap (%)

Algorithm cumulative revenue Oracle (knows $\theta$) cumulative Random-price cumulative Posted price $p_t$ (lower panel) Oracle price $p^{\ast}$

Under the hood

We maintain a conjugate Bayesian posterior on $(a, b)$ with a Normal prior and noise std $\sigma$, updated via Bayesian linear regression on the design matrix $X = [\,\mathbf{1}, -\mathbf{p}\,]$ and observations $\mathbf{D}$. At each round, the chosen rule sees the current posterior mean $\hat{\theta}_t$ and covariance $\Sigma_t$ and emits a price. Greedy: $p = \hat{a}_t / (2\hat{b}_t)$ clipped to the price band. UCB: scan a discretised price grid and pick the maximum of $p\,(\hat{a}_t - \hat{b}_t p) + \alpha \sqrt{\mathrm{Var}[p\,D(p) \mid H_t]}$. Thompson: sample $\theta_t \sim \mathcal{N}(\hat{\theta}_t, \Sigma_t)$, then act greedily on the sample. Posterior updates run in $O(d^2)$ per round; total simulation cost is linear in $T$. The oracle baseline always charges $p^{\ast} = a / (2b)$ and the random baseline draws uniformly from $[p_{\min}, p_{\max}]$.

Reading the solution

What the cumulative-revenue and price-trace tell you

Three patterns to watch for

Sub-linear regret curve. The regret-vs-oracle gap should grow as $\sqrt{T}$, not linearly. On the cumulative-revenue chart, the algorithm curve should run roughly parallel to (just below) the oracle line after a few dozen rounds — a clear sign learning is converging.
Convergence of $p_t$ to $p^{\ast}$. In the lower panel, the posted-price scatter should funnel toward the oracle-optimal price as $t$ grows. UCB and Thompson tighten faster than $\varepsilon$-greedy, which keeps probing forever.
Posterior std on $b$ shrinks like $1/\sqrt{t}$. Wide posterior at $t = 5$, narrow at $t = 200$. If it does not shrink, the algorithm is not exploring enough — raise $\alpha$ (UCB) or $\varepsilon$ (greedy).

Sensitivity questions the simulator answers

Push $\sigma$ up — how much longer does it take to identify $p^{\ast}$? (Identification is hardest exactly when noise is high.)
Switch from UCB to $\varepsilon$-greedy — does the regret gap widen at large $T$? (Yes, because $\varepsilon$-greedy has linear regret for fixed $\varepsilon$.)
Tighten the price band $[p_{\min}, p_{\max}]$ around the true $p^{\ast}$ — how does final regret change? (Smaller band, less to lose from sub-optimal prices.)
Drop $\alpha$ toward zero — UCB collapses to greedy and risks getting stuck on a wrong estimate.

Model extensions

From single-SKU baseline to production retail-pricing variants

Contextual bandits

Per-customer-segment pricing where the demand parameters depend on observed features (device, geography, channel). LinUCB reduces to ridge regression with a UCB bonus per arm.

Personalisation →

Inventory-constrained pricing

Limited stock turns the bandit into a finite-horizon revenue management problem: each unit sold is one less to test on. Couples directly to clearance pricing.

Markdown →

Competitor-aware pricing

Demand depends on (own price, competitor price). The bandit becomes a two-player game; equilibrium learning replaces single-agent regret.

Fairness & regulatory constraints

Prohibitions on price discrimination by protected attributes; floors required by law (alcohol, tobacco). Constrained-bandit literature (Wu et al. 2016).

Robust / misspecified pricing

The true demand may not be linear. Robust bandits hedge against model misspecification — pay a small price for guarantee.

Off-policy learning from logs

Use historical sales logs (collected under a different pricing policy) to evaluate or train new policies without live experiments. IPS, doubly-robust estimators.

Joint pricing + assortment

Decide which SKUs to display and at what prices, jointly. Demand for one product depends on the prices of substitutes shown.

Assortment →

Reinforcement learning

Long-horizon dynamics — today’s low price affects tomorrow’s reference price and customer expectations. Value-function methods replace one-shot bandits.

Key references

Foundational dynamic-pricing-with-learning literature

Besbes, O. & Zeevi, A. (2009).

Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms.

Operations Research 57(6): 1407–1420. doi:10.1287/opre.1080.0640

Besbes, O. & Zeevi, A. (2012).

Blind network revenue management.

Operations Research 60(6): 1537–1550. doi:10.1287/opre.1120.1099

den Boer, A. V. (2015).

Dynamic pricing and learning: Historical origins, current research, and new directions.

Surveys in Operations Research and Management Science 20(1): 1–18. doi:10.1016/j.sorms.2015.03.001

Li, L., Chu, W., Langford, J. & Schapire, R. E. (2010).

A contextual-bandit approach to personalised news article recommendation.

WWW 2010: 661–670. doi:10.1145/1772690.1772758

Auer, P., Cesa-Bianchi, N. & Fischer, P. (2002).

Finite-time analysis of the multiarmed bandit problem.

Machine Learning 47(2): 235–256. doi:10.1023/A:1013689704352

Thompson, W. R. (1933).

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.

Biometrika 25(3–4): 285–294. doi:10.1093/biomet/25.3-4.285

Misic, V. V. & Perakis, G. (2020).

Data analytics in operations management: A review.

M&SOM 22(1): 158–169. doi:10.1287/msom.2019.0805

Bertsimas, D. & Perakis, G. (2006).

Dynamic pricing: A learning approach.

In Mathematical and Computational Models for Congestion Charging, Springer.

Back to the retail domain

Dynamic pricing with learning lives in the Price × Operational cell of the 4P decision matrix — the always-on revenue lever for SKUs with unknown elasticity.

Open Retail Landing

Educational simulator · linear-demand and Gaussian-noise assumptions · production deployments need elasticity validation, fairness constraints, and competitive monitoring.

Symbol	Meaning	Domain
\(t \in \{1, \ldots, T\}\)	Round in the selling horizon	discrete
\(p_t \in [p_{\min}, p_{\max}]\)	Price posted in round \(t\) (decision)	bounded
\(D_t\)	Realised noisy demand at price \(p_t\)	random
\(\theta = (a, b)\)	True demand parameters (unknown)	\(\mathbb{R}^2_+\)
\(H_t\)	History \(\{(p_s, D_s)\}_{s < t}\) up to round \(t\)	filtration

On the pricing desk	In the bandit
Number of price-tests over the horizon	\(T\)
Allowed price band	\([p_{\min}, p_{\max}]\)
Unknown elasticity / willingness-to-pay	\(\theta = (a, b)\)
Posted price in week / hour / minute \(t\)	\(p_t\)
Observed units sold	\(D_t\)
“A/B test budget” (random-pricing share)	\(\varepsilon\) in \(\varepsilon\)-greedy
Aggressiveness of exploration	\(\alpha\) in UCB
Lost revenue vs. omniscient oracle	\(R(T)\)