As the name suggested, generalized linear model (GLM) is a generalization of the linear regression model. In a classic linear model, with input $latex X$, the output $latex Y$ is modeled by
$latex Y|X \sim \mathcal{N} (B X, \sigma^2 I)$.
GLM extends the model in two ways, first instead of having the mean as simple linear regression of $latex BX$, we can have the mean to be $latex g^{-1}(BX)$, where $latex g(\cdot)$ is known as the link function. This allows us to model $latex Y$ that does not fall in the real domain. For example, we can model $latex Y$ as the probability of finding a target, we can have a link function $latex g(\mu) = \mbox{logit}(\mu) = \frac{\mu}{1-\mu}$. This will map probability into the entire real line.
Second, we no longer model the $latex Y|X$ as a normal distribution. Actually, given the link function, using normal distribution may not actually make sense in some cases. Continue with the above example, if we model $latex Y|X$ as normal distribution, we can end up with $latex Y$ go out of the bound of $latex [0,1]$, defeating the purpose of introducing the link function to model anything beyond the entire real domain.
Actually, the link function and the model involved naturally hook together. Consider the canonical exponential family distribution $latex p_\theta(y) = \exp(\langle \theta, y \rangle – b(\theta)) f(y)$ (for simplicity, let’s ignore the dispersion parameter “$latex \phi$” here, one may absorb it into $latex \theta$ anyway, right?). Note that $latex b(\theta)$ is the log partition function and $latex \mu=b'(\theta)$. Thus, for the ith sample, we have $latex \theta_i = b’^{-1}(\mu) = b’^{-1}(g^{-1}(B X_i)\triangleq h(BX_i)$, where $latex h=(g\circ b’)^{-1}$.
If we consider the canonical link function will simply be $latex g(\cdot)=b’^{-1}(\cdot)$ (note that $latex \mu = b'(\theta)$ and “canonically” we want $latex BX=\theta=g(\mu)=g(b'(\theta))$), $latex h$ is simply an identity function. Then the overall log-likelihood that we try to maximize will simply be
$latex l = \sum_i^n \langle y_i, \theta_i \rangle – b(\theta_i)=\sum_i^n \langle y_i, BX_i \rangle – b(BX_i)$.
Logistic regression
Let’s take logistic regression as an example.
$latex l(B|Y,X) = \sum_{i=1}^n (Y_iX_i^\top B – \log (1+e^{X_i^\top B})$
One can maximize the above log-likelihood with gradient ascent or more sophisticately Newton-Raphson method. Note that the gradient is
$latex \nabla l = \sum_{i=1}^n \left( Y_i X_i – \frac{e^{X_i^T}B}{1+e^{X_i^T}B} \right)$
and Hessian is
$latex H = – \sum_{i=1}^n \frac{e^{X_i^T}B}{(1+e^{X_i^T}B)^2}X_iX_i^\top$.
Consequently, we can iterate $latex B$ with
$latex B^{(k+1)} \leftarrow B^{(k)} – H(B^{(k)})^{-1} \nabla l(B^{(k)})$
Reference
https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/lecture-slides/MIT18_650F16_GLM.pdf