sklearn聚集聚类链接矩阵

发布于 2021-01-29 15:56:05

我试图绘制一个完整链接scipy.cluster.hierarchy.dendrogram,但发现它scipy.cluster.hierarchy.linkage比慢sklearn.AgglomerativeClustering

但是,sklearn.AgglomerativeClustering不会返回聚类之间的距离和所需的原始观测值的数量scipy.cluster.hierarchy.dendrogram。有办法带走他们吗?

关注者
0
被浏览
50
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    我做了一个步骤,无需修改sklearn和递归函数。使用前请注意:

    • 合并距离有时会相对于子级合并距离减小。我添加了三种处理这些情况的方法:以最大,什么都不做或以l2范数增加。l2规范逻辑尚未验证。请检查自己最适合您的。

    导入软件包:

    from sklearn.cluster import AgglomerativeClustering
    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.cluster.hierarchy import dendrogram
    

    计算重量和距离的功能:

    def get_distances(X,model,mode='l2'):
        distances = []
        weights = []
        children=model.children_
        dims = (X.shape[1],1)
        distCache = {}
        weightCache = {}
        for childs in children:
            c1 = X[childs[0]].reshape(dims)
            c2 = X[childs[1]].reshape(dims)
            c1Dist = 0
            c1W = 1
            c2Dist = 0
            c2W = 1
            if childs[0] in distCache.keys():
                c1Dist = distCache[childs[0]]
                c1W = weightCache[childs[0]]
            if childs[1] in distCache.keys():
                c2Dist = distCache[childs[1]]
                c2W = weightCache[childs[1]]
            d = np.linalg.norm(c1-c2)
            cc = ((c1W*c1)+(c2W*c2))/(c1W+c2W)
    
            X = np.vstack((X,cc.T))
    
            newChild_id = X.shape[0]-1
    
            # How to deal with a higher level cluster merge with lower distance:
            if mode=='l2':  # Increase the higher level cluster size suing an l2 norm
                added_dist = (c1Dist**2+c2Dist**2)**0.5 
                dNew = (d**2 + added_dist**2)**0.5
            elif mode == 'max':  # If the previrous clusters had higher distance, use that one
                dNew = max(d,c1Dist,c2Dist)
            elif mode == 'actual':  # Plot the actual distance.
                dNew = d
    
    
            wNew = (c1W + c2W)
            distCache[newChild_id] = dNew
            weightCache[newChild_id] = wNew
    
            distances.append(dNew)
            weights.append( wNew)
        return distances, weights
    

    使用2个子群集制作2个群集的样本数据:

    # Make 4 distributions, two of which form a bigger cluster
    X1_1 = np.random.randn(25,2)+[8,1.5]
    X1_2 = np.random.randn(25,2)+[8,-1.5]
    X2_1 = np.random.randn(25,2)-[8,3]
    X2_2 = np.random.randn(25,2)-[8,-3]
    
    # Merge the four distributions
    X = np.vstack([X1_1,X1_2,X2_1,X2_2])
    
    # Plot the clusters
    colors = ['r']*25 + ['b']*25 + ['g']*25 + ['y']*25
    plt.scatter(X[:,0],X[:,1],c=colors)
    

    样本数据:

    聚类样本数据

    拟合聚类模型

    model = AgglomerativeClustering(n_clusters=2,linkage="ward")
    model.fit(X)
    

    调用该函数以查找距离,并将其传递给树状图

    distance, weight = get_distances(X,model)
    linkage_matrix = np.column_stack([model.children_, distance, weight]).astype(float)
    plt.figure(figsize=(20,10))
    dendrogram(linkage_matrix)
    plt.show()
    

    Ouput树状图:
    在此处输入图片说明



知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看