Personalisation & Recommendation

Per-user MNL · Contextual bandits

Choose what to show which customer to maximise expected revenue (or CLV) — a per-user assortment problem layered on top of a predicted utility model. When utilities are known, the revenue-ordered MNL property (Talluri & van Ryzin 2004) still applies per user. When they must be learned online, the problem becomes a contextual bandit (Li, Chu, Langford & Schapire 2010). Industry scale: Netflix, Amazon, YouTube run this loop millions of times per minute. Handbook reference: Ricci, Rokach & Shapira (eds.) Recommender Systems Handbook.

Why it matters

The digital-retail revenue lever with the highest leverage

~35%

Share of Amazon’s sales attributed to personalised recommendations — the single most-cited scale reference in recommender literature.

Source: McKinsey retail analytics (2013 estimate, later confirmed in industry filings).

~75%

Share of Netflix viewing hours driven by recommendations — the canonical success case for CF-style personalisation.

Source: Gomez-Uribe & Hunt (2016), ACM TMIS.

+5–15%

Typical e-commerce revenue lift documented when replacing popularity-sort with MNL-aware personalised ranking on the homepage.

Source: Industry papers; Gallino & Moreno analyses.

$\tilde O(\sqrt{KT})$

Regret of contextual-bandit algorithms like LinUCB after $T$ rounds with $K$ arms — the theoretical ceiling on how fast online learning converges.

Source: Li, Chu, Langford & Schapire (2010), WWW.

Where the decision sits

Each page-view, each email, each push-notification — a recommendation decision

Recommendation is a decision made at every session: homepage, search, category, product-detail, cart-add, checkout, post-purchase email, push notification. The retailer picks $K$ items to show, the customer clicks (or doesn’t) and optionally buys, and the system learns. Two decision models matter: when utilities are known (offline batch-scored from last night’s model), use MNL assortment per user; when utilities must be learned as users arrive (new customer, cold-start item, or A/B-test phase), use a contextual bandit.

Contextuser + session

Rank / select $K$recommendation

Click / buy / skipcustomer response

Update modelnightly or online

Problem & formulation

Per-user MNL + contextual-bandit learning

Decision model

Per-user subset selection

Demand model

Contextual MNL / LinUCB

Complexity

O(N log N) per user (offline)

Reference

Talluri-vanRyzin; Li et al. 2010

Sets and parameters

Symbol	Meaning	Unit
$u \in \mathcal{U}$	User (current visitor)	finite
$i \in \mathcal{N}$	Candidate item in the catalog	finite
$x_u$	User context vector (features, history, device)	$\mathbb{R}^d$
$z_i$	Item feature vector	$\mathbb{R}^d$
$\theta$	Learned parameter vector	$\mathbb{R}^d$
$\hat u_{ui}$	Predicted utility (log-odds of click / purchase)	real
$r_i$	Revenue on purchase of item $i$	$
$K$	Slate size / recommendation capacity	integer

Contextual utility model

Utility is learned from interaction data. Simplest form is bilinear:

$$\hat u_{ui} \;=\; \theta^{\top} \phi(x_u, z_i) \;=\; \theta^{\top} (x_u \otimes z_i)$$

$\phi$ is a joint user-item feature map (often an outer product). In practice, matrix factorisation $\hat u_{ui} = p_u^{\top} q_i$ is the canonical form (Koren, Bell & Volinsky 2009). Modern practice uses deep networks on top.

Per-user MNL offline optimisation

Given the utility vector $\hat u_{u \cdot}$, the retailer picks a size-$K$ assortment $S_u$ per user. Under MNL, click/purchase probability is:

$$P_i(S_u) \;=\; \frac{e^{\hat u_{ui}}}{1 + \sum_{j \in S_u} e^{\hat u_{uj}}}$$

Expected revenue per session: $R(S_u) = \sum_{i \in S_u} r_i \cdot P_i(S_u)$. Talluri-vanRyzin revenue-ordered: sort items by $r_i$, pick a prefix. Diversity / fairness add extra constraints.

Online learning: LinUCB

When $\theta$ is unknown, balance exploration and exploitation. The LinUCB rule (Li et al. 2010) picks at each round the arm with the highest upper confidence bound:

$$i_t \;=\; \arg\max_i \;\Bigl[\, \hat{\theta}_t^{\top} \phi_{u_t, i} \;+\; \alpha \sqrt{\phi_{u_t, i}^{\top} A_t^{-1} \phi_{u_t, i}} \,\Bigr]$$

$A_t = \lambda I + \sum_{s \leq t} \phi_s \phi_s^{\top}$ is the Gram matrix; $\alpha$ tunes exploration. Regret is $\tilde O(d \sqrt{T})$ — sub-linear, meaning per-round regret vanishes. Ties this back to dynamic pricing with learning.

Diversity / fairness constraints

Pure revenue maximisation over-concentrates: same top items to every user. Real systems add:

$$\sum_{i \in S_u} \mathbb{1}\{i \in \text{category } c\} \;\leq\; C_c \qquad \forall c \qquad \text{(category cap)}$$ $$\sum_{u, i \in S_u} \mathbb{1}\{\text{producer}(i) = p\} \;\geq\; \underline F_p \qquad \forall p \qquad \text{(producer fairness)}$$

Submodular-coverage objectives (maximum marginal relevance) are a standard way to induce diversity without hard constraints — Carbonell & Goldstein 1998.

Interactive solver

6 users × 10 items · per-user MNL assortment, offline

Personalised assortment solver

Revenue-ordered per user · compare with non-personalised baseline

★★★ Exact (MNL revenue-ordered)

Users

Items

Slate size $K$

User preference strength

Variation in user utilities

Price spread

Item-price variation

Seed

—

Revenue (personalised, $/user)

—

Revenue (non-personalised, $/user)

—

Personalisation lift

—

Unique items across users

—

Avg price in personalised slates

—

Avg P(click something)

Recommended to user (high utility, chosen for slate) Not recommended (low utility or prefix skipped) High predicted utility

Under the hood

The scenario generator creates a low-rank utility matrix: each user has a latent-factor vector $p_u \in \mathbb{R}^2$, each item has $q_i \in \mathbb{R}^2$ and a price $r_i$, and predicted utility is $\hat u_{ui} = p_u^{\top} q_i$ (plus noise, scaled by “preference strength”). Per user, the solver sorts items by price, evaluates the MNL revenue of each prefix of length 1..K, and keeps the best (revenue-ordered optimum). Non-personalised baseline: sort by average utility across users, pick top-K, use that same slate for every user. Personalisation lift % is the spread between these two.

Reading the solution

Three patterns to watch for

Personalisation lift scales with preference heterogeneity. When all users have similar tastes, personalisation doesn’t help much (everyone wants the top items). Crank up “preference strength” and the lift widens.
High-price anchoring still holds per user. The revenue-ordered property is per-user: even with personalisation, the optimal slate for each user is a prefix of items sorted by price, filtered by what utility each user has for them.
Coverage widens with personalisation. The unique-items-count KPI rises when users get their own slates — a good proxy for fairness to long-tail items and to producers.

Sensitivity questions

What if I increase slate size $K$? — per-user revenue rises but marginally; adding a 5th item gains less than the 4th (cannibalisation under MNL).
What if prices narrow (price spread → 0.2)? — personalisation lift shrinks: ranking by utility converges to ranking by popularity.
What if I apply a cold-start user (random $p_u$)? — the model wastes slate slots; bandit exploration beats the offline policy for such users.

Model extensions

Matrix factorisation

Classic collaborative filtering: learn $p_u, q_i$ from past interactions to build the utility matrix. Koren-Bell-Volinsky 2009 Netflix-prize winner.

Deep-learning recommenders

Two-tower neural nets (Covington-Adams-Sargin 2016 YouTube), transformers (BERT4Rec), and multimodal embeddings. Production standard at hyperscalers.

Contextual bandits (LinUCB / Thompson)

Online learning when utilities are unknown: explore slates to reduce uncertainty; LinUCB (Li et al. 2010), Thompson sampling (Agrawal-Goyal 2013).

Diversity via MMR / DPP

Induce slate diversity via Maximum Marginal Relevance (Carbonell-Goldstein 1998) or Determinantal Point Processes (Kulesza-Taskar 2012).

CLV-aware recommendation

Objective is long-term CLV, not single-session revenue. Reinforcement-learning framing; recent research frontier.

CLV →

Assortment-under-MNL (aggregate)

Offline non-personalised assortment optimisation is the special case when all users share one utility vector.

Assortment planning →

Cross-channel personalisation

Consistent recommendations across web / app / email / push / store — joint optimisation with fatigue penalty.

Fairness + producer exposure

Long-tail items and new producers need exposure; add lower-bound constraints or post-hoc re-ranking. Biega-Gummadi-Weikum 2018.

Key references

Ricci, F., Rokach, L. & Shapira, B. (eds.) (2011, 3rd ed. 2022).

Recommender Systems Handbook.

Springer. doi:10.1007/978-1-0716-2197-4 (The field’s standard handbook.)

Koren, Y., Bell, R. & Volinsky, C. (2009).

Matrix factorization techniques for recommender systems.

IEEE Computer 42(8): 30–37. doi:10.1109/MC.2009.263 (Netflix-prize canonical paper.)

Li, L., Chu, W., Langford, J. & Schapire, R. E. (2010).

A contextual-bandit approach to personalized news article recommendation.

WWW 2010. doi:10.1145/1772690.1772758 (LinUCB.)

Talluri, K. T. & van Ryzin, G. J. (2004).

Revenue management under a general discrete choice model.

Management Science 50(1): 15–33. doi:10.1287/mnsc.1030.0147

Agrawal, S. & Goyal, N. (2013).

Thompson sampling for contextual bandits with linear payoffs.

ICML 2013.

Carbonell, J. & Goldstein, J. (1998).

The use of MMR, diversity-based reranking for reordering documents and producing summaries.

SIGIR 1998.

Kulesza, A. & Taskar, B. (2012).

Determinantal point processes for machine learning.

Foundations and Trends in Machine Learning 5(2-3). doi:10.1561/2200000044

Gomez-Uribe, C. A. & Hunt, N. (2016).

The Netflix recommender system: Algorithms, business value, and innovation.

ACM Transactions on Management Information Systems 6(4): 1–19. doi:10.1145/2843948

Covington, P., Adams, J. & Sargin, E. (2016).

Deep neural networks for YouTube recommendations.

RecSys 2016. doi:10.1145/2959100.2959190

Back to the retail domain

Personalisation sits in the Promotion × Operational cell — millions of micro-decisions per minute, powering the 4th P with algorithms.

Open Retail Landing

Educational solver · low-rank synthetic utilities, offline MNL · production systems layer learning (bandits, transformers, fairness) on top.

Symbol	Meaning	Unit
\(u \in \mathcal{U}\)	User (current visitor)	finite
\(i \in \mathcal{N}\)	Candidate item in the catalog	finite
\(x_u\)	User context vector (features, history, device)	\(\mathbb{R}^d\)
\(z_i\)	Item feature vector	\(\mathbb{R}^d\)
\(\theta\)	Learned parameter vector	\(\mathbb{R}^d\)
\(\hat u_{ui}\)	Predicted utility (log-odds of click / purchase)	real
\(r_i\)	Revenue on purchase of item \(i\)	$
\(K\)	Slate size / recommendation capacity	integer