To briefly state the difference between generative models and discriminative models, I would say a generative model concerns the specification of the joint probability $p(\mathbf{x}, y)$, and a discriminative model that of the conditional probability $p(y \ \mathbf{x})$. Andrew Ng and Michael Jordan wrote a paper on this with logistic regression and naive Bayes as the example. [Ng, Jordan, 2001]

Assume that we have a set of k features $f_k(y, \mathbf{x})$. Then a naive Bayes model, as a generative model, is given by $p(y, \mathbf{x}) = \frac{\exp \left[ \sum_k \lambda_k f_k (y, \mathbf{x}) \right]}{\sum_{\tilde{\mathbf{x}}, \tilde{y}} \exp \left[ \sum_k \lambda_k f_k (\tilde{y}, \tilde{\mathbf{x}}) \right]}$.

To train a naive Bayes model, the loss function is maximum likelihood.

A logistic regression, and its discrete form as maximum entropy classifiers, is a discriminative model. But Ng and Jordan argued that they are theoretically related as the probability of y given $\mathbf{x}$ can be derived from the naive Bayes model above, i.e., $p( y | \mathbf{x}) =\frac{\exp \left[ \sum_k \lambda_k f_k (y, \mathbf{x}) \right]}{\sum_{\tilde{y}} \exp \left[ \sum_k \lambda_k f_k (\tilde{y}, \mathbf{x}) \right]}$.

To train a logistic regression, the loss function is the least mean-squared difference; and for maximum entropy classifiers, it is, of course, the entropy.

When in application, there is one more difference, which is the importance of $p(y)$ in classification in the generative model, but its absence in the discriminative model.

Another example for generative-discriminative pair is hidden Markov model (HMM) and conditional random field (CRF). [Sutton, McCallum, 2010]

• A. Y. Ng, M. I. Jordan, “On Discriminative & Generative Classifiers: A comparison of logistic regression and naive Bayes,” NIPS Proceedings (2001), 841-848. [PDF]
• C. Sutton, A. McCallum, “An Introduction to Conditional Random Fields,” arXiv:1011.4088 (2010). [arXiv]

Feature image from http://www.evolvingai.org/fooling.