The role of entropy in discrete choice models
The role of entropy in discrete choice models, part 1: the logit case
January 5, 2026
Part of the blog “Math of Choice”, based on Alfred Galichon’s forthcoming book, Discrete Choice Models: Mathematical Methods, Econometrics, and Data Science, Princeton University Press, April 2026.
Welcome to the first installment of a blog series exploring the rich mathematical landscape of discrete choice analysis. This series aims to go beyond the standard econometric recipes to uncover structural, geometric, and interdisciplinary connections—spanning optimal transport, convex analysis, and information theory—that often remain hidden in traditional textbook treatments. Whether you are an economist, a data scientist, or a mathematician, these posts will offer complementary insights into the mechanics of choice.
In this first blog post, we review the basics of random utility models with a focus on the logit model, and demonstrate the role played by entropy.
Basics of random utility models
In a random utility model, a consumer \(i\) chooses one option \(y\) from a set of alternatives \([Y] = \{1, \dots, Y\}\). The utility that consumer \(i\) derives from option \(y\) is defined as:
$$ U_{iy} = U_y + \varepsilon_{iy},$$
where \(U_y\) is the systematic utility—the part of utility based on observable attributes that is common to all consumers, and \(\varepsilon_{iy}\) is the random utility—an individual-specific random shock, drawn from a known probability distribution \(\mathcal P\). While the econometrician only sees the aggregate outcome (the “macro level”), the individual knows their specific \(\varepsilon_{iy}\) term (the “micro level”) and acts rationally to maximize their own utility.
Welfare. Consumer \(i\) solves the following optimization problem, which defines the indirect utility of consumer \(i\)
$$ u_i = \max_{y \in [Y]} \{ U_y + \varepsilon_{iy} \}. $$
To define the social welfare, we aggregate (or rather, we average) these values across the population. The welfare function \(G(U)\) is defined as the expectation with respect to \( \mathcal{P} \), the distribution of the vector of utility shocks \( (\varepsilon_{y}) \), of the indirect utility:
$$ G(U) = \mathbb{E}_{\mathcal{P}} \left[ \max_{y \in [Y]} \{ U_y + \varepsilon_{y} \} \right].$$
Market shares. The market share map \(\boldsymbol{\pi}(U) \) associates the market shares \(\pi_y\) of each option \(y\) to a vector of systematic utilities \(U\). Under suitable assumptions which ensure that the agent is almost never indifferent between two options, it is defined by:
$$\boldsymbol{\pi}_y(U) = \mathbb{P}_{\mathcal{P}} \left( U_y + \varepsilon_y \geq U_z + \varepsilon_z, \forall z \in [Y] \right).$$
There is an important connection between the welfare function \(G(U)\) and the market shares map \(\boldsymbol{\pi}(U) \). Imagine we increase \(U_y\) for a single \( y\in [Y]\) by a tiny amount \(\delta\). Then:
- consumers who were already choosing option \(y\) (a proportion \(\pi_y\)) will see their welfare increase by exactly \(\delta\);
- however, some consumers might switch to option \(y\) from other options because of this increase. But since they were previously almost indifferent between their old choice and \(y\), the net gain in welfare from switching is of a second-order magnitude.
Therefore, to a first-order approximation, the total increase in social welfare is simply the proportion of existing users multiplied by the utility increase: \(\Delta G \approx \pi_y \cdot \delta\). Taking the limit as \(\delta \to 0\) shows that \({\partial G} / {\partial U_y} = \pi_y\), which leads to:
Theorem (Daly-Zachary-Williams, th. 1.2.1 p. 16). Under standard regularity assumptions, the welfare function \(G\) is differentiable with respect to the systematic utilities \(U\), and its gradient is exactly the vector of market shares:
$$ \nabla G(U) = \boldsymbol{\pi}(U). $$
The logit specialization
The logit model arises from a specific assumption about these random shocks: the \(\varepsilon_{iy}\) terms are independent and follow a centered Gumbel distribution. As a reminder, the centered, i.e. zero-mean, Gumbel distribution has the c.d.f. \(F(x) = \exp(-\exp(-(x+\gamma)))\), where \(\gamma \approx 0.577\) is the Euler-Mascheroni constant. In this specific case, the integral defining the welfare function \(G(U)\) has a beautiful closed-form solution known as the “log-sum-exp” function, a.k.a. softmax formula:
$$ G(U) = \log \left( \sum_{y \in [Y]} \exp(U_y) \right) $$
and by the Daly-Zachary-Williams theorem, the gradient of this welfare function gives us the market shares, which is the familiar Gibbs distribution:
$$ {\boldsymbol\pi}_y(U) = \frac{\exp(U_y)}{\sum_{z \in [Y]} \exp(U_z)}. $$
Aggregation of random utility. This brings us to the core insight of this post: how does the random noise \(\varepsilon\) aggregate at the macroscopic (i.e. the econometrician’s) level? In the logit model, i.e., when \( \mathcal{P} \) is the distribution of i.i.d. centered Gumbel variables, we can express the social welfare \(G(U)\) as:
Claim (Proposition 1.3.2 p. 21). We have:
$$ G(U) = \mathbb{E}_{\mathcal{P}} \left[ \max_{y \in [Y]} \{ U_y + \varepsilon_{y} \} \right] = \max_{\pi \geq 0: \sum_y \pi_y =1} \left\{ \sum_{y \in [Y]} \pi_y U_y – \sum_{y \in [Y]} \pi_y \log \pi_y \right\}.$$
This equation will be explained in detailed in the next post, in a more general setting, using convex duality. But we see that it tells a powerful story: the market as a whole acts as if a representative agent is maximizing a trade-off between standard utility and entropy, made of two terms:
- the first term \( \sum_y \pi_y U_y \) represents order: expected individuals’ systematic utilities;
- the second term \(-\sum_y \pi_y \log \pi_y\) represents disorder: the Shannon entropy.
In the logit model, the Gumbel-distributed individual shocks at the microscopic level aggregate perfectly into Shannon entropy at the macroscopic level. The “randomness” of the individual has become the “entropy” of the group.
In the next post, we will see how this logic generalizes to random utility models beyond logit, where different assumptions on \(\varepsilon\) yield different forms of “generalized” entropy.