Beyond the logit model: generalized entropy
The role of entropy in discrete choice models, part 2: generalized entropy
January 12, 2026
Part of the blog “Math of Choice”, based on Alfred Galichon’s forthcoming book, Discrete Choice Models: Mathematical Methods, Econometrics, and Data Science, Princeton University Press, April 2026.
In our previous post, we explored how the popular logit model aggregates individual random utility into a clean, macroscopic entropy term—specifically, the famous Gibbs-Shannon entropy. But does this beautiful connection hold if we step outside the specific assumptions of the logit model?
The answer is yes. In this post, we show that every Random Utility Model (RUM) aggregates into a specific form of entropy. Whether the underlying noise is normal (probit), uniform, or something exotic, the society still behaves as if it is maximizing a welfare function regularized by a “Generalized Entropy of Choice”.
The general setting
Let’s move beyond the specific Gumbel distribution used in the logit case. Consider a general framework where the vector of random shocks \(\varepsilon\) is drawn from an arbitrary continuous distribution \(\mathcal{P}\). The individual still maximizes their utility \(U_y + \varepsilon_y\). At the aggregate level, we define the social welfare function \(G(U)\) as the expected maximum utility:
$$ G(U) = \mathbb{E}_{\mathcal{P}} \left[ \max_{y \in [Y]} \{ U_y + \varepsilon_y \} \right]. $$
Regardless of \(\mathcal{P}\), the gradient of this function yields the market shares \(\boldsymbol{\pi}\), a result seen in the previous post known as the Daly-Zachary-Williams theorem.
To see why entropy appears in general models, we must first look at the geometric properties of the social welfare function. A key property of this function is that \(G\) is a convex function of the systematic utilities \(U\). Why? Because the maximum function is convex, and taking an expectation (which is a weighted sum) preserves convexity. This convexity is crucial because, in mathematics, convex functions always have a “dual” representation. Just as the logit model had a dual representation involving Shannon entropy, the general welfare function \(G(U)\) has a dual function \(G^*(\pi)\) defined by the Legendre-Fenchel transform, or convex conjugate:
$$ G^*(\pi) = \max_{U \in \mathbb{R}^Y} \left\{ \sum_{y \in [Y]} \pi_y U_y – G(U) \right\}. $$
This function \(G^*(\pi)\) is what we call the generalized entropy of choice. It is defined on the set of valid market share vectors \(\pi\), that is, vectors with nonnegative entries such that \(\sum_{y \in [Y]} \pi_y \leq 1\). Outside of this set, it takes value \(+\infty\).
The variational principle
The variational principle in convex analysis states that a convex function is characterized by its convex conjugate: a convex function is the convex conjugate of its convex conjugate. This duality allows us to flip the problem around. Instead of defining welfare \(G(U)\) as an expectation of maximums, we can express it as an optimization problem over market shares. This is the aggregate choice problem:
$$ G(U) = \max_{\boldsymbol{\pi}} \left\{ \sum_{y \in [Y]} \boldsymbol{\pi}_y U_y – G^*(\boldsymbol{\pi}) \right\}. $$
or, keeping in mind what the domain of \(G^\ast\) is, we get the equivalent expression:
$$
\begin{aligned}
G(U) = \max_{\pi \in \mathbb{R}^Y} & \left\{ \sum_{y \in [Y]} \pi_y U_y – G^*(\pi) \right\} \\
\text{s.t. } & \pi_y \geq 0, \sum_{y \in [Y]} \pi_y \leq 1.
\end{aligned}
$$
This result (proposition 1.3.1 in the book), reveals the hidden structure of discrete choice. It tells us that the aggregate market behaves as if a single representative agent is maximizing a net utility consisting of two terms:
- the expected systematic utility \(\sum \pi_y U_y\): the expected utility reward from the systematic utility of the options.
- the (generalized) entropic regularization \(-G^*(\pi)\): a penalty for concentrating market share too heavily on any single option.
The meaning of generalized entropy
What does this abstract mathematical object represent physically? It turns out to have a striking interpretation. The generalized entropy \(G^*(\boldsymbol{\pi})\) is equal to minus the expected heterogeneity required to rationalize the market shares \(\boldsymbol{\pi}\).
To understand the physical meaning of \(G^\ast\), let us define \(y^\star(\varepsilon)\) as the optimal choice of an agent with utility shock \(\varepsilon\). Let us also define \(\pi^\star\) as the vector of market shares that maximizes the aggregate choice problem, which corresponds to the observed demand \(\boldsymbol{\pi}(U)\). We can write the social welfare \(G(U)\) in two equivalent ways. From the microscopic perspective, it is the expected utility of the optimal choice, \( G(U) = \mathbb{E}[ U_{y^\star(\varepsilon)} + \varepsilon_{y^\star(\varepsilon)} ] \). From the macroscopic perspective, using the variational formula with the optimal \(\pi^\star\), it is \( G(U) = \sum_{y} \pi^\star_y U_y – G^*(\pi^\star) \). Since the average systematic utility \(\sum \pi^\star_y U_y\) matches the expected individual systematic utility \(\mathbb{E}[ U_{y^\star(\varepsilon)} ]\), comparing the two expressions reveals that the entropic penalty must balance the expected random utility:
Claim (Proposition 1.3.1 p. 21). Mathematically, we have:
$$ G^*(\boldsymbol{\pi}) = – \mathbb{E} \left[ \varepsilon_{y^\star(\varepsilon)} \right], $$
where \(y^\star(\varepsilon)\) is the optimal choice for a consumer with shock \(\varepsilon\), conditional on the aggregate choice being \(\boldsymbol{\pi}\). In other words, \(-G^*(\boldsymbol{\pi})\) measures the “cost” of the random noise needed to sustain the observed behavior.
Examples of Generalized Entropy:
- Logit model: As we saw in part 1, if \(\varepsilon\) is Gumbel-distributed, \(G^*(\pi)\) is the Gibbs-Shannon entropy (up to a sign flip): \(\sum \pi_y \log \pi_y\).
- Binomial model: In a simple binary choice with cumulative distribution \(F\), the entropy takes the form of an integral of the quantile function: \( G^*(\pi_1) = -\int_{0}^{\pi_1} F^{-1}(1-m) dm \).
Conclusion
The “disorder” of individual choices always aggregates into a coherent structure at the macro level. While the logit model gives us the most familiar version (Gibbs entropy), every discrete choice model has its own unique entropy signature. By understanding this link, we can use powerful tools from convex analysis and optimal transport to analyze demand and welfare.
In the next post, we will discuss the connection between the inversion of discrete choice models and optimal transport in detail, showing how this geometric framework provides powerful computational tools for econometrics.
Reference
[DCM] Galichon, Alfred. 2026. Discrete Choice Models: Mathematical Methods, Econometrics, and Data Science. Princeton University Press. Chapter 1.