So, I just showed you that empirically the likelihood will converge,
but theoretically it can also be proved that EM algorithm will
converge to a local maximum.
So here's just an illustration of what happened and a detailed explanation.
This required more knowledge about that,
some of that inequalities, that we haven't really covered yet.
So here what you see is on the X dimension, we have a c0 value.
This is a parameter that we have.
On the y axis we see the likelihood function.
So this curve is the original likelihood function,
and this is the one that we hope to maximize.
And we hope to find a c0 value at this point to maximize this.
But in the case of Mitsumoto we can not easily find an analytic solution
to the problem.
So, we have to resolve the numerical errors, and
the EM algorithm is such an algorithm.
It's a Hill-Climb algorithm.
That would mean you start with some random guess.
Let's say you start from here, that's your starting point.
And then you try to improve this by moving this to
another point where you can have a higher likelihood.
So that's the ideal hill climbing.
And in the EM algorithm, the way we achieve this is to do two things.
First, we'll fix a lower bound of likelihood function.
So this is the lower bound.
So we already know it's improving the lower bound.
So we definitely improve this original likelihood function,
which is above this lower bound.
So, in our example,
the current guess is parameter value given by the current generation.
And then the next guess is the re-estimated parameter values.
From this illustration you can see the next guess
is always better than the current guess.
Unless it has reached the maximum, where it will be stuck there.
So the two would be equal.
So, the E-step is basically
to compute this lower bound.
We don't directly just compute this likelihood function but
we compute the length of the variable values and
these are basically a part of this lower bound.
This helps determine the lower bound.
The M-step on the other hand is to maximize the lower bound.
It allows us to move parameters to a new point.
And that's why EM algorithm is guaranteed to converge to a local maximum.
Now, as you can imagine, when we have many local maxima,
we also have to repeat the EM algorithm multiple times.
In order to figure out which one is the actual global maximum.
And this actually in general is a difficult problem in numeral optimization.
So here for example had we started from here,
then we gradually just climb up to this top.
So, that's not optimal, and we'd like to climb up all the way to here,
so the only way to climb up to this gear is to start from somewhere here or here.
So, in the EM algorithm, we generally would have to start from different points
or have some other way to determine a good initial starting point.
The general idea is that we will have two steps to improve the estimate of.
In the E-step we roughly [INAUDIBLE] how many there are by predicting values
of useful hidden variables that we would use to simplify the estimation.
In our case, this is the distribution that has been used to generate the word.
In the M-step then we would exploit such augmented data which would make
it easier to estimate the distribution, to improve the estimate of parameters.
Here improve is guaranteed in terms of the likelihood function.
Note that it's not necessary that we will have a stable convergence of
parameter value even though the likelihood function is ensured to increase.
There are some properties that have to be satisfied in order for the parameters
also to convert into some stable value.
Now here data augmentation is done probabilistically.
we're not going to just say exactly what's the value of a hidden variable.
But we're going to have a probability distribution over the possible values of
these hidden variables.
So this causes a split of counts of events probabilistically.
And in our case we'll split the word counts between the two distributions.
(Appendix added on August 4th, 2014, see at the end)
Introduction to EM
EM (Expectation Maximization) is a general technique for solving model learning problem including latent variables. Given a set of observed data and a set of latent variables , we want to learn the parameters by maximizing the log-likelihood of data, i.e. , including latent variable, we have the objective function for learning:
However, this makes it difficult for us to utilize some standard optimization techniques like Gradient Descent, etc.. Facing this difficulty, we can include two ideas to help: 1, we can do the the whole learning step by step, i.e. iteratively, so that at each step we will have a old set of parameters, from the -th iteration, that we can use; 2, we can find an lower bound for objective function 1, and we optimize towards it, since it is an lower bound that we can guarantee to improve (or not decrease), we can (somehow) guarantee the objective function to be increasing (or not decreasing). Keep these two ideas in mind, we can do the derivation as below:
In the above derivation, we use last iteration estimated parameters to find a lower bound, which we can just maximize to indirectly improve the original likelihood function. The equality holds when in Jensen’s inequality is a constant variable, in above case that is , is a constant. This equality gives us a estimate of latent variable under the model:
Many textbook about EM utilizes the so-called Q-function:
Where we can view as the complete data log-likelihood ( is incomplete data, are latent variables), as predictions of latent variables by using current estimated parameters, and Q-function as an expectation of the log-likelihood of the complete data. Then we can formalize the procedure for applying the EM algorithm:
Prepare the Q-function 4
Randomly initialize the parameters
Estimate latent variables according current estimated parameters:
Maximize the Q-function 4 w.r.t. the parameters given last estimated parameters
Some properties about EM: it is proved that under EM procedure, if the lower bound Q-function is increasing (or not decreasing), then the log-likelihood function will be increasing (or not decreasing) and the iteration will finally go to a converged point (one way to see it is subtract with ). It is worth noting that EM can only find local maxima (a steady high point, but not the highest one), and so the initialization of EM matters.
Applying EM in pLSA
pLSA (Probabilistic Latent Semantic Analysis [1, 2]) is a technique for mining topics of texts. In pLSA, the log-likelihood function is:
Where indicates documents, denotes words, means topic assignments. is the number of documents, is the size of vocabulary, is number of topics.
In the log-likelihood function 5, we find a structure with latent variable that is very similar to the log-likelihood function 1 of a general EM problem, so we can utilize the similar derivation that utilized by EM to further derive the log-likelihood function 5, however, we can also do it by applying EM algorithm. To do so, we need to derive Q-function for EM. First, as the equation 5 is not the complete data log-likelihood, which is:
To apply Q-function, we have to be very careful now, as it will get very tricky as we include the random variable , which is a multivariate random variable that has slots with each slot of states. In Q-function, we have to sum over all possible , which in this case seems lead to an exponential of terms. However that is not the case, as is independent to each other. We can write the Q-function as:
(Still, the above derivation is not at the most detailed level compared to how tricky it is).
Now we have prepared the Q-function 7 and established the E step, which is to calculate (using Bayes’ Rule), we can go ahead and optimize the Q-function. Before doing so, we should notice the constraints: and , we can include Lagrange multipliers and into the Q-function to form a complete objective function:
Setting equation 8 to zero w.r.t. to parameters and yields:
By substituting equation 9 into normalization constraints and , we find a closed form to our maximized parameters estimation:
To summarize, the EM algorithm for solving pLSA is:
Initialize the parameters and (note that implementation-wise, for , random is better than uniform, which will get stuck, for , uniform is better than random)
Two more general ways of understanding EM:
Lower bound optimization: given we are optimizing an objective function includes latent variables , since this is directly difficult, we apply Jensen’s inequality in a specific way (see E.q. 2) to get a lower bound of original objective, now the trick of EM is that we improve the objective function by improving the tight lower bound: in E step, we estimate the distribution of latent variable such that (after E step) lower bound is equal to the objective (to do so, the equality condition of Jensen’s equality is used); in M step, we optimize the lower bound w.r.t. parameters (getting rid of constant, M step is optimizing Q-function), since from E step, the objective is tightly lower bounded, the improvement of the objective in M step is guaranteed to be greater or equal to the improvement of the lower bound. Repeat E and M steps, you can see 1. the objective will not decrease (as the lower bound will not decrease) 2. the objective will converge, since it will not decrease, and there must be a upper limit.
Similar to lower bound optimization point of view, what is shown on Bishop’s book “Pattern Recognition and Machine Learning”, chapter 9.4 is even more general. I very much recommend readers to take a look at it.
 T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 1999, pp. 289-296.
 T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine learning, vol. 42, iss. 1-2, pp. 177-196, 2001.