When we model probability of a variable $latex x$ by $latex p(x) = {e^{-\frac{F(x)}{T}}}$, $latex F(x)$ is often referred to as the free energy.
The name is coming from historical reason. The Gibbs-Boltzmann distribution for a configuration is proportional to $latex e^{-\frac{H}{k_B T}}$. And the closest reason I found is from here
$latex p(H,T) = We^{-\frac{U}{kT}} = e^{\frac{S}{k} – \frac{U}{kT}} = e^{-\frac{U- TS}{kT}}$
and $latex U-TS$ is the Helmholtz free energy. Unfortunately, I doubt the above explanation is actually correct since we completely ignore the partition function. In my old notes of exponential family, I also mentioned free energy and Legendre transformation. I am also aware that Legendre transformation is just convex conjugate. But I just forgot all these now. π
Update:
I read me notes and Mackay’s textbook. I doubt Lecun’s slides mentioning free energy is indeed correct. It seems the free energy $latex F = -T \ln Z$ should be correct. The sign is a bit tricky as shown below though.
With Boltzmann distribution,
$latex p(x) = \frac{e^{-\beta \mathcal{E}(x)}}{\sum_x e^{-\beta \mathcal{E}(x)}}$,
where $latex \beta = 1/T$ is the inverse temperature. As usual, we will denote the denominator as the partition function $latex Z(\beta) = \sum_x e^{-\beta \mathcal{E}(x)}$ and we will write $latex A(\beta) = \ln Z(\beta)$ as the log-partition function.
Note that $latex A(\beta)$ has nice cumulant generating property. It can be readily verified that
$latex \frac{\partial A}{\partial \beta}=-E[\mathcal{E}(X)]$ and $latex \frac{\partial^2 A}{\partial \beta^2} = Var(\mathcal{E})$. Note that since $latex \frac{\partial^2 A}{\partial \beta^2} = Var(\mathcal{E}) \ge 0$, $latex A(\beta)$ is a convex function. So the convex conjugate of $latex A(\beta)$’s convex conjugate is itself.
Consider the convex conjugate of $latex A(\beta)$,
$latex A^*(v) = \sup_\beta \beta v – A(\beta)$
And the optimal $latex \beta$ satisfies $latex v + E[\mathcal{E}(X)] = 0$. In other words, given a $latex v$, the inverse temperature, $latex \beta$, should be chosen such that the average energy is equal to $latex -v$. Let’s write $latex v(\beta)$ and $latex \beta(v)$ for $latex v$ and $latex \beta$ that satisfy the condition $latex v(\beta)=-E_\beta[\mathcal{E}(X)]$.
And what the hack is $latex A^*(v)$? It turns out that it is simply the entropy $latex -H(\beta(v))$. Because
$latex H(\beta) = -E[\ln(p(X)] = E[\beta \mathcal{E}(X)]+E[\ln Z] = -\beta v(\beta)+\ln Z =-A^*(v(\beta))$.
So we also have
$latex HT= E[\mathcal{E}(X)] + T \ln Z$, where $latex T = 1/\beta$.
Let the free energy $latex F = -T\ln Z = E[\mathcal{E}(X)] – HT$, we have
$latex \frac{\partial F}{\partial T}= -\ln Z – T \frac{\partial A}{\partial \beta} \frac{\partial \beta}{\partial T}= -\ln Z – \frac{E[\mathcal{E}(X)]}{T} = -H(\beta)$.
Remarks
I’m still not sure why free energy is called “free”. I guess the best the explanation is that $latex u – HT$ is traditionally the Helmholtz free energy. While it seems to be related to available energy for work in thermodynamics, I doubt the same interpretation can be apply in this simple model. For one thing, note that when we increase the temperature of the system. Two things happened, the system is more likely to hop to higher energy states. At the same time, since it can spread to more states, the entropy increase. Since entropy is positive, the “free” energy decreases as temperature increases since $latex \frac{\partial F}{\partial T}=-H(\beta)$.
LeCun’s definition in his lecture is very similar but is different. $latex F$ there is a function of some configuration ($latex x$ and $latex y$). The definition here essentially the weighted sum of $latex F$ in LeCun’s lecture weighted by the probabilities of the states.