How do I show a scatter plot in Python after doing PCA?

Question

I made a random data of my own, that comprises of a text file with 18 rows and 5 columns with all integer entries.

I successfully managed to do PCA but now stuck. I am unable to do a scatter plot. Here is my code:

f=open(r'<path>mydata.txt')
print(f.read()) #reading from a file


with open(r'<path>mydata.txt') as f:
emp= []
for line in f:
    line = line.split() 
    if line:            
        line = [int(i) for i in line]
        emp.append(line)


from sklearn.decomposition import PCA
import pylab as pl
from itertools import cycle
X = emp
pca = PCA(n_components=3, whiten=True).fit(X)
X_pca = pca.transform(X) #regular PCA

Now, with PCA done and my variances known, how do I plot?

Here is how a sample data in my data set looks:

2    1    2    3    0
2    3    2    3    0
1    3    1    1    0
1    5    2    1    0
2    3    1    1    0
3    3    0    1    0
7    1    1    1    1
7    2    2    1    1
1    1    1    4    1
3    2    3    2    1
2    2    2    2    1
1    3    2    3    1
2    3    2    1    2
2    2    1    1    2
7    5    3    2    2
3    4    2    4    2
2    1    1    1    2
7    1    3    3    2

Add some context to the problem, what are you trying to display with the scatter plot , what are the columns in your sample dataset , what code have you written to solve the problem. — Satyadev, May 22 '17 at 06:16
The last column in the sample data represents a type, I have divided the data into three types. The data is similar to Fisher's Iris dataset, with numbers fudged. I want the scatter plot to show me the different types, as a clustering. — The Doctor, May 22 '17 at 06:18
Does [this](http://stackoverflow.com/questions/10336614/scatter-plot-in-matplotlib) answer your question? — sknt, May 22 '17 at 06:19
@Skynet After PCA, since my data is now reduced to 3 dimensions, which arrays should I consider? Because, the data has been made to a list of lists if you can see my code above. Now, I want to do a scatter plot after PCA, so that the points are clustered. Data is similar to Fisher Iris data. — The Doctor, May 22 '17 at 06:23
So are you asking us, how you can visualize certain rows/columns (which ones would that be?) of your data in a scatterplot, or are you asking us, which rows/columns you should consider? In the first case, we can help you, in the second, you might be asking in the wrong place. There is a StackExchange site dedicated to statistics called [CrossValidated](https://stats.stackexchange.com/), for instance. — Thomas Kühn, May 22 '17 at 06:30
@ThomasKühn No, I am simply asking for how do I make a clustered scatter plot for the data points after I do PCA. Say my data's last column has only 3 values 0,1,2. Then, I should get clustered scatter plot in 3 different colors. — The Doctor, May 22 '17 at 06:36
Ok, I still didn't get it, but I'll try my luck. See my answer and comment there, if you need more. — Thomas Kühn, May 22 '17 at 06:37

Thomas Kühn · Accepted Answer · 2017-05-22T11:00:05.307

Is this what you are asking for?

import numpy as np
from matplotlib import pyplot as plt


data1 = [np.random.normal(0,0.1, 10), np.random.normal(0,0.1,10)]
data2 = [np.random.normal(1,0.2, 10), np.random.normal(2,0.3,10)]
data3 = [np.random.normal(-2,0.1, 10), np.random.normal(1,0.5,10)]


plt.scatter(data1[0],data1[1])
plt.scatter(data2[0],data2[1])
plt.scatter(data3[0],data3[1])

plt.show()

the result for the three different data sets would look something like this:

EDIT:

Hopefully I now understand your question better. Here the new code:

import numpy as np
from matplotlib import pyplot as plt    

with open(r'mydata.txt') as f:
    emp= []
    for line in f:
        line = line.split() 
        if line:            
            line = [int(i) for i in line]
            emp.append(line)


from sklearn.decomposition import PCA
import pylab as pl
from itertools import cycle
X = emp
pca = PCA(n_components=3, whiten=True).fit(X)
X_pca = pca.transform(X) #regular PCA

jobs = ['A', 'B', 'C']
job_id = np.array([e[4] for e in emp])

fig, axes = plt.subplots(3,3, figsize=(5,5))

for row in range(axes.shape[0]):
    for col in range(axes.shape[1]):
        ax = axes[row,col]
        if row == col:
            ax.tick_params(
                axis='both',which='both',
                bottom='off',top='off',
                labelbottom='off',
                left='off',right='off',
                labelleft='off'
            )
            ax.text(0.5,0.5,jobs[row],horizontalalignment='center')
        else:
            ax.scatter(X_pca[:,row][job_id==0],X_pca[:,col][job_id==0],c='r')
            ax.scatter(X_pca[:,row][job_id==1],X_pca[:,col][job_id==1],c='g')
            ax.scatter(X_pca[:,row][job_id==2],X_pca[:,col][job_id==2],c='b')
fig.tight_layout()
plt.show()

I named the jobs 'A', 'B', and 'C' with the ids 0, 1, and 2, respectively. From the last row of emp, I create a numpy array that holds these indices. In the crucial plotting commands, I mask the data by the job ids. Hope this helps.

The resulting plot looks like this:

EDIT 2:

If you want only one plot where you correlate, say, the first and the second column of X_pca with each other, the code becomes much more simple:

import numpy as np
from matplotlib import pyplot as plt

with open(r'mydata.txt') as f:
    emp= []
    for line in f:
        line = line.split() 
        if line:            
            line = [int(i) for i in line]
            emp.append(line)


from sklearn.decomposition import PCA
import pylab as pl
from itertools import cycle
X = emp
pca = PCA(n_components=3, whiten=True).fit(X)
X_pca = pca.transform(X) #regular PCA

jobs = ['A', 'B', 'C']
job_id = np.array([e[4] for e in emp])

row = 0
col = 1

plt.scatter(X_pca[:,row][job_id==0],X_pca[:,col][job_id==0],c='r')
plt.scatter(X_pca[:,row][job_id==1],X_pca[:,col][job_id==1],c='g')
plt.scatter(X_pca[:,row][job_id==2],X_pca[:,col][job_id==2],c='b')

plt.show()

The result looks like this:

I strongly suggest that you read the documentation of the functions used in these examples.

I am probably unable to phrase my question properly. Here is the [screenshot](http://imgur.com/a/NJAzU) of what I want to achieve. Just that my raw data has no attributes associated with it but only numbers. Can this be done? — The Doctor, May 22 '17 at 06:54
The columns in my data represent, say, company, skill, age, location and job type. After doing PCA, I want the scatter plot to cluster my data into 3 types, each associated with one type of job. Much like what Fisher's iris data does, clustering it into 3 groups based on flower species. Similarly, my scatter plot, should cluster into 3 groups based on job type 0,1 or 2. — The Doctor, May 22 '17 at 06:58
Ok, now I feel stupid (I don't know much about statistics). I ran your code with the example input, and the resulting `X_pca` is a 3x4 matrix. Do you want the scatterplot generated from the original data in `emp`, or from `X_pca` and do you want just one plot, or an array of plots like [this one](https://en.wikipedia.org/wiki/Iris_flower_data_set#/media/File:Iris_dataset_scatterplot.svg)? Am I assuming right, that the example data set you show is the content of `emp`? — Thomas Kühn, May 22 '17 at 07:16
I want the scatter plot from x_pca . A single plot will do, but if you can please help me with an array of plots, I would much appreciate. — The Doctor, May 22 '17 at 07:19
Yes, the example data I gave is an instance of the original data set. Emp is simply the list of lists of data. — The Doctor, May 22 '17 at 07:33
One more question: in your example data, the last column (supposedly `job type`, i.e. the information you want to group your clusters by) is always zero, so there would only by one cluster if I use that example data -- am I correct? — Thomas Kühn, May 22 '17 at 07:43
It can be 0,1 or 2, like a tuple can be (2, 1, 1, 1, 2) or ( 1, 3, 2, 3, 1). I just copy pasted a sample and it all happened to be zero. I have posted a few more data now in the question. Please see. — The Doctor, May 22 '17 at 07:45
Thank you. This works. Just one more, how do I obtain a single plot for all three jobs with clustered points?? — The Doctor, May 22 '17 at 09:27
Basically just take a the three rows of the form `ax.scatter(X_pca[:,row][job_id==0],X_pca[:,col][job_id==0],c='r')` (see the code) and for `row` and `col` you use the data that you want to correlate. — Thomas Kühn, May 22 '17 at 09:45
Will that work for a single graph instead of an array of graphs? — The Doctor, May 22 '17 at 10:52

score 2 · Answer 2 · answered Oct 16 '17 at 08:54

Based on your comment that you want to get this (https://i.stack.imgur.com/VsicE.jpg), here is how to do it using sklearn library:

In this example I am using the iris data:

PART 1: Plot only the scatter plot

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from numpy import linalg as LA
import pandas as pd
from scipy import stats

iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
X = stats.zscore(X)

pca = PCA()
x_new = pca.fit_transform(X)

plt.scatter(x_new[:,0], x_new[:,1], c = y)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

RESULT 1

PART 2: in case you want to plot the famous biplot

#Create the biplot function
def biplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()


#Call the function. Use only the 2 PCs.
biplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()

RESULT 2

How do I show a scatter plot in Python after doing PCA?

2 Answers2