def jaccard(v1, v2):
'''
Due to the idiosyncracies of my code the jaccard index is a bit
altered. The theory is the same but the implementation might be a bit
weird. I do not have two vectors containing the words of both documents
but instead I have two equally sized vectors. The columns of the vectors
are the same and represent the words in the whole corpus. If an entry
is 1 then the word is present in the document. If it is 0 then it is not present.
SO first we find the indices of the words in each documents and then jaccard is
calculated based on the indices.
'''
indices1 = numpy.nonzero(v1)[0].tolist()
indices2 = numpy.nonzero(v2)[0].tolist()
inter = len(set(indices1) & set(indices2))
un = len(set(indices1) | set(indices2))
dist = 1 - inter/float(un)
return dist
评论列表
文章目录