Parts-based learning by Non-Negative Matrix Factorisation

Visualising the principal components of portrait facial images. ‘Eigenfaces’ are the decomposed images in the direction of largest variance.

Why we can’t relate to eigenfaces

Traditional methods like Principal Component Analysis (PCA) would decompose a dataset into some form of latent representation e.g. eigenvectors, which at times can be meaningless when visualised — what actually is my first principal component? Non-Negative Matrix Factorisation (NNMF) was a method developed in 1996 by Lee and Seung that showed data could also be deconstructed (i.e. a set of facial portraits) into parts and extract features like the nose, eyes, and a smile.

The differences between PCA and NNMF arise from a constraint in their derivation: namely, NNMF does not allow the weights of different features to be negative. On the other hand, PCA approximates each image as a linear combination of all images or basis. Although eigenfaces are directions of largest variance, they are hard to interpret: what actually is the first principal component showing me?

Eigenfaces are meaningless directions of largest variance

A linear combinations of distributed components can feel difficult to relate to when attempting to visualise the data. NNMF is interesting because it can extract parts from the data that when we visualise it, we can actually relate to it.

In the seminal paper by Lee and Seung, they take a series of facial portraits and by running the NNMF algorithm, they decompose the images to extract the following features:

An image taken from “Learning the parts of objects by non-negative matrix factorisation” (Lee, Seung, 1999) [1]

We can see (moving from the top left to the bottom right) that each feature kind of looks like a different facial part. We have the eyes, the jaw structure, the nose, the T zone, the eye brows all come out. Compare this to the figure at the top of this article, and you’ll see how differently the two models explain the structure of the same data.

Likewise, this method can also applied to articles of text where you can decompose each article into a topic:

An example of topic modelling by using NNMF: link

On the right we have our data matrix M, which is decomposed (in the same manner as the image example) into the topics matrix A and the weights matrix W. (Note: reference [3] will help you remove stop words)

To appreciate how this simple example can give such high results, let’s first go over the Mathematics:

Mathematics of Non-Negative Matrix Factorisation

The goal of NNMF is to decompose an image database (matrix V) into two smaller matrices W and H with the added constraint that W>0 and H>0 :

V is a matrix of our Image database. The r columns of W are called basis images. Each column of H is called an encoding and is in one-to-one correspondence with a face in V. Check reference [1] for more information.

Let V be our matrix of data. In the case of images, you would vectorise each image and concatenate them sideways to have something that’s (N x M) with N total pixels and M total images. The iterative procedure used to derive an approximation for V is as follows:

Source: Reference 1

We initialise W and H randomly and as follows:

  1. Normalise the H Matrix
  2. Incorporate this normalised H Matrix to W
  3. Normalise the W matrix
  4. Calculate the improvement in the following objective function (if improvement is less than a threshold, go back to 1):
Source: Reference 1

Unlike with PCA, NNMF has no closed form solution so this iterative method (similar to gradient descent) keeps working till the improvements are marginal. Note — code is at the end.

In reality the objective function or the methodology provided are not important here. The the observant will notice that the non-negativity constraint implemented is that characterises the difference between PCA and NNMF and can result in an approachable latent representation i.e. parts by learning. In NNMF, every latent feature is added (given weights) to each other to recreate each part of sample data. Many of these weights can be 0.

With PCA on the other hand, we can compare it in a NNMF framework in that it constraints the the columns of W to be orthonormal and the rows of H to be orthogonal. This separation allows for a distributed representation but not an additive representation, thus the difference in the latent representation between the two models.


The model you use makes a huge difference in its interpretability, its usefulness and the inference you draw from it. I’m a proponent of using various methods as benchmark because without measuring performance in connection with some form a null hypothesis, you can’t really monitor effective performance.

Given that, each model also has it’s marginal benefits so it also makes sense to try learn something from each model. In the above case, simply changing the negativity constraint can give the user an entirely different result. I encourage the reader to take the below code away and see what other features they can come out with!


References:

  1. DD Lee, HS Seung. 1999 “Learning the parts of objects by non-negative matrix factorization”
  2. Dempster, A. P., Laired, N. M. & Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm.
  3. Reference for list of common (“Stop”) words, referenced from: http://xpo6.com/list-of-englishstop-words/

Code for NNMF

The coding below is in Matlab (don’t ask…) but it should be easy enough to translate to any language.

[n,m]=size(V);
stopconv=40; % Stopping criterion (can be adjusted)
niter = 1000; % maximum number of iterations (can be adjusted)

cons=zeros(m,m);
consold=cons;
inc=0;

% Counter
j=0;

% initialize random w and h
W=rand(n,R);
H=rand(R,m);

for i=1:niter
% This is the update equation for H as per the Nature paper, with
% normalisation
x1=repmat(sum(W,1)',1,m);
H=H.*(W'*(V./(W*H)))./x1;

% This is the update equation for W as per the Nature paper with
% normalisation
x2=repmat(sum(H,2)',n,1);
W=W.*((V./(W*H))*H')./x2;

% test convergence every 10 iterations
if(mod(i,10)==0)
j=j+1;

% adjust small values to avoid undeflow
H=max(H,eps);
W=max(W,eps);

% construct connectivity matrix
[~,index]=max(H,[],1); %find largest factor
mat1=repmat(index,m,1); % spread index down
mat2=repmat(index',1,m); % spread index right
cons=mat1==mat2;

if(sum(sum(cons~=consold))==0) % connectivity matrix has not changed
inc=inc+1; %accumulate count
else
inc=0; % else restart count
end

% prints number of changing elements
fprintf('t%dt%dt%dn',i,inc,sum(sum(cons~=consold))),

if(inc>stopconv)
break, % assume convergence is connectivity stops changing
end

consold=cons;
end
end

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: