I am new to Python. I would like to perform hierarchical clustering on N by P dataset that contains some missing values. I am planning to use scipy.cluster.hierarchy.linkage function that takes distance matrix in condensed form. Does Python have a method to compute distance matrix for missing value contained data? (In R dist function automatically takes care of missing values... but scipy.spatial.distance.pdist seems not handling missing values!)
Asked
Active
Viewed 2,217 times
4

AMR
- 584
- 1
- 6
- 16

FairyOnIce
- 2,526
- 7
- 27
- 48
-
You can take a look on the Imputer method of Sklearn. It uses some kind of interpolation based on the neighbouring cells. – Moritz Jul 15 '15 at 06:26
1 Answers
3
I could not find a method to compute distance matrix for data with missing values. So here is my naive solution using Euclidean distance.
import numpy as np
def getMissDist(x,y):
return np.nanmean( (x - y)**2 )
def getMissDistMat(dat):
Npat = dat.shape[0]
dist = np.ndarray(shape=(Npat,Npat))
dist.fill(0)
for ix in range(0,Npat):
x = dat[ix,]
if ix >0:
for iy in range(0,ix):
y = dat[iy,]
dist[ix,iy] = getMissDist(x,y)
dist[iy,ix] = dist[ix,iy]
return dist
Then assume that dat
is N (= number of cases) by P (=number of features) data matrix with missing values then one can perform hierarchical clustering on this dat
as:
distMat = getMissDistMat(dat)
condensDist = dist.squareform(distMat)
link = hier.linkage(condensDist, method='average')

FairyOnIce
- 2,526
- 7
- 27
- 48