algorithms.py 文件源码-python代码片段

algorithms.py 文件源码

python

阅读 27 收藏 0 点赞 0 评论 0

项目：AbTextSumm 作者: StevenLOL 项目源码文件源码

def jaccard(v1, v2):
    '''
    Due to the idiosyncracies of my code the jaccard index is a bit 
    altered. The theory is the same but the implementation might be a bit 
    weird. I do not have two vectors containing the words of both documents
    but instead I have two equally sized vectors. The columns of the vectors 
    are the same and represent the words in the whole corpus. If an entry
    is 1 then the word is present in the document. If it is 0 then it is not present.
    SO first we find the indices of the words in each documents and then jaccard is 
    calculated based on the indices.
    '''  

    indices1 = numpy.nonzero(v1)[0].tolist()
    indices2 = numpy.nonzero(v2)[0].tolist()
    inter = len(set(indices1) & set(indices2))
    un = len(set(indices1) | set(indices2))
    dist = 1 - inter/float(un)
    return dist