2

I have a n x 2 array from flow cytometry data representing forward scatter and side scatter for a cell (there are n cells). These values represent physical characteristics of the cells and I wish to filter the cells. When plotted as a scatter plot, the data shows a strong elliptical cloud and then there more dispersed cells. I wish to "gate" this data such that I keep the dominant cloud and filter out all the rest (in the image below I would like to retain the dots that are inside the gray elliptical boundary. enter image description here

What I would like is to get the a binary n x 1 array where the value at index i is 1 if this cell is within the cloud and 0 if not.

I actually don't know how to filter out the data outside the ellipse. But I tried doing K-means specifying 4 clusters. However the dominant cluster was detected as a single group (see figure below). enter image description here I need to be able to detect the dominant cluster programatically. I would be grateful if someone can help with this. The sample data is here FS_SS.txt (hosted at AnonFiles.com)

Lee Sande
  • 457
  • 5
  • 15
  • That dominant cluster from 0.5e5 to 1.5e5 definitely looks like a [multivariate normal probability function](http://www.mathworks.com/help/stats/mvnpdf.html). In fact it looks like it's bimodal, with a second distribution centered near (0.5e5,0.75e5). – Mark Mikofski Oct 30 '13 at 16:35
  • So, how can I go about finding this dominant cluster, drawing a boundary (say at mu + 2*sd) and filtering out the ones outside? – Lee Sande Oct 30 '13 at 16:59
  • 2
    Do you have an idea how many of your data points belong to that elliptical group? If its the vast majority, and the rest are outliers, you could approximately estimate the parameters of that twodimensional normal distribution simply from the whole dataset, and then select data points based on one of the isolines of that distribution. – A. Donda Oct 30 '13 at 17:07
  • 1
    From repeated fits to multivariate pdf ([see my answer below](http://stackoverflow.com/a/19689412/1020470)) it looks like about 8000 of the 10k points are in the ellipse. @A-Donda, we had the same idea! – Mark Mikofski Oct 30 '13 at 17:10
  • :) I guess this is actually expectation-maximization, but not sure. Great that it works... – A. Donda Oct 30 '13 at 17:17
  • @A.Donda Thank you for your response. I expect over 75% of the points to fall in this dominant cluster. – Lee Sande Oct 30 '13 at 18:36

1 Answers1

3

If you have the statistical toolbox, try something like this:

a = dlmread('~\downloads\-data-anonfiles-1383150325725.txt'); % read data
p = mvnpdf(a,mean(a),cov(a)); % multivariate PDF of your data
p_sample = numel(p)*p/sum(p); % normalize pdf to number of samples
thresh = 0.5; % set an arbitrary threshold to filter
idx_thresh = p_sample > thresh; % logical indices of samples that meet the threshold
a_filtered = a(idx_thresh,:);

then repeat this again with filtered data.

 p = mvnpdf(a,mean(a_filtered),cov(a_filtered));
 p_sample = numel(p)*p/sum(p); % normalize pdf to number of samples
 thresh = 0.1; % set an arbitrary threshold to filter
 idx_thresh = p_sample > thresh; % logical indices of samples that meet the threshold
 a_filtered = a_filtered (idx_thresh,:);

I was able to pull out most of the dominant distribution in just 2 iterations. But I think you would want to repeat until the mean(a_filtered) and cov(a_filtered) reached steady state values. Plot them as a function of iteration, and when they approach a flat line then you've found the correct values.

This is equivalent to filtering with an rotated ellipse, but IMO it is easier and more useful because now you actually have the 5 mvnpdf parameters (mu_x, mu_y, sigma_xx, sigma_yy, sigma_xy) required to reproduce the distribution. If you model the iso-line (p(x,y) = thresh) as an rotated ellipse, you would have to manipulate the minor and major axes (a,b), the translation coordinates (h,k) and the rotation (theta) to get the mvnpdf parameters.

Then after extracting the first distribution, you can repeat the process to find the secondary distribution.

Mark Mikofski
  • 19,398
  • 2
  • 57
  • 90
  • 1
    Nice solution! But actually, the isolines of a twodimensional normal distribution are ellipses. And both the distribution and the ellipse are described by five parameters (the covariance matrix is symmetric). – A. Donda Oct 30 '13 at 17:21
  • Filtering out cells with x-axis values less than 0.65e5 and reducing the threshold to 0.3, does a suitable job on the first try. Thanks very much @MarkMikofski for the solution – Lee Sande Oct 30 '13 at 19:00
  • @LeeSande Since you're satisfied, can you mark the questioned as answered? Thanks! – Mark Mikofski Oct 30 '13 at 22:24
  • @LeeSande Under the down arrow there is a silhouette of a check-mark, click it and it should turn green. – Mark Mikofski Oct 31 '13 at 17:23