如何获得针对不同类别的scikit-learn分类器的大多数信息功能?

发布于 2021-01-29 18:15:17

NLTK软件包提供了一种show_most_informative_features()找到两个类最重要功能的方法,其输出如下:

   contains(outstanding) = True              pos : neg    =     11.1 : 1.0
        contains(seagal) = True              neg : pos    =      7.7 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.8 : 1.0
         contains(damon) = True              pos : neg    =      5.9 : 1.0
        contains(wasted) = True              neg : pos    =      5.8 : 1.0

正如这个问题的答案一样,如何为scikit-
learn分类器获取最丰富的信息?,这也可以在scikit-
learn中工作。但是,对于二进制分类器,该问题的答案仅输出最佳特征本身。

因此,我的问题是,如何像上面的示例一样识别要素的关联类(在pos类中,突出信息最多,而在阴性类中,seagal信息最多)?

编辑:实际上我想要的是每个班级提供最多信息的单词的列表。我怎样才能做到这一点?谢谢!

关注者
0
被浏览
141
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    在二进制分类的情况下,似乎系数数组已变平。

    让我们尝试仅用两个标签来重新标记数据:

    import codecs, re, time
    from itertools import chain
    
    import numpy as np
    
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    
    trainfile = 'train.txt'
    
    # Vectorizing data.
    train = []
    word_vectorizer = CountVectorizer(analyzer='word')
    trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
    tags = ['bs','pt','bs','pt']
    
    # Training NB
    mnb = MultinomialNB()
    mnb.fit(trainset, tags)
    
    print mnb.classes_
    print mnb.coef_[0]
    print mnb.coef_[1]
    

    [出]:

    ['bs' 'pt']
    [-5.55682806 -4.86368088 -4.86368088 -5.55682806 -5.55682806 -5.55682806
     -4.86368088 -4.86368088 -5.55682806 -5.55682806 -4.86368088 -4.86368088
     -4.1705337  -5.55682806 -4.86368088 -5.55682806 -4.86368088 -5.55682806
     -5.55682806 -5.55682806 -4.86368088 -4.45821577 -4.86368088 -4.86368088
     -4.86368088 -4.86368088 -5.55682806 -4.86368088 -5.55682806 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -4.86368088 -5.55682806 -5.55682806
     -5.55682806 -5.55682806 -5.55682806 -4.45821577 -4.86368088 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -5.55682806 -5.55682806 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
     -4.86368088 -5.55682806 -5.55682806 -5.55682806 -5.55682806 -5.55682806
     -5.55682806 -5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.86368088
     -4.86368088 -5.55682806 -5.55682806 -4.86368088 -5.55682806 -4.86368088
     -5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.45821577 -4.86368088
     -4.86368088 -4.45821577 -4.86368088 -4.86368088 -4.86368088 -5.55682806
     -4.86368088 -5.55682806 -5.55682806 -4.86368088 -5.55682806 -5.55682806
     -4.86368088 -5.55682806 -4.86368088 -4.86368088 -4.86368088 -5.55682806
     -5.55682806 -5.55682806 -4.86368088 -4.86368088 -5.55682806 -4.86368088
     -5.55682806 -4.86368088 -5.55682806 -4.86368088 -5.55682806 -5.55682806
     -5.55682806 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
     -4.86368088 -4.1705337  -4.86368088 -4.86368088 -5.55682806 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088
     -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
     -5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.86368088 -4.86368088
     -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088 -5.55682806
     -4.86368088 -4.45821577 -4.86368088 -4.86368088]
    Traceback (most recent call last):
      File "test.py", line 24, in <module>
        print mnb.coef_[1]
    IndexError: index 1 is out of bounds for axis 0 with size 1
    

    因此,让我们做一些诊断:

    print mnb.feature_count_
    print mnb.coef_[0]
    

    [出]:

    [[ 1.  0.  0.  1.  1.  1.  0.  0.  1.  1.  0.  0.  0.  1.  0.  1.  0.  1.
       1.  1.  2.  2.  0.  0.  0.  1.  1.  0.  1.  0.  0.  0.  0.  0.  2.  1.
       1.  1.  1.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  1.  0.  0.
       0.  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  1.  1.  0.  1.  0.
       1.  2.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  1.  0.  1.  1.
       0.  1.  0.  0.  0.  1.  1.  1.  0.  0.  1.  0.  1.  0.  1.  0.  1.  1.
       1.  0.  0.  1.  0.  0.  0.  4.  0.  0.  1.  0.  0.  0.  0.  0.  1.  0.
       0.  0.  1.  0.  0.  0.  0.  0.  0.  1.  0.  0.  1.  1.  0.  0.  0.  0.
       0.  0.  1.  0.  0.  1.  0.  0.  0.  0.]
     [ 0.  1.  1.  0.  0.  0.  1.  1.  0.  0.  1.  1.  3.  0.  1.  0.  1.  0.
       0.  0.  1.  2.  1.  1.  1.  1.  0.  1.  0.  1.  1.  1.  1.  1.  0.  0.
       0.  0.  0.  2.  1.  1.  1.  1.  1.  0.  0.  1.  1.  1.  1.  0.  1.  1.
       1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  1.  0.  0.  1.  0.  1.
       0.  0.  1.  1.  2.  1.  1.  2.  1.  1.  1.  0.  1.  0.  0.  1.  0.  0.
       1.  0.  1.  1.  1.  0.  0.  0.  1.  1.  0.  1.  0.  1.  0.  1.  0.  0.
       0.  1.  1.  0.  1.  1.  1.  3.  1.  1.  0.  1.  1.  1.  1.  1.  0.  1.
       1.  1.  0.  1.  1.  1.  1.  1.  1.  0.  1.  1.  0.  0.  1.  1.  1.  1.
       1.  1.  0.  1.  1.  0.  1.  2.  1.  1.]]
    [-5.55682806 -4.86368088 -4.86368088 -5.55682806 -5.55682806 -5.55682806
     -4.86368088 -4.86368088 -5.55682806 -5.55682806 -4.86368088 -4.86368088
     -4.1705337  -5.55682806 -4.86368088 -5.55682806 -4.86368088 -5.55682806
     -5.55682806 -5.55682806 -4.86368088 -4.45821577 -4.86368088 -4.86368088
     -4.86368088 -4.86368088 -5.55682806 -4.86368088 -5.55682806 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -4.86368088 -5.55682806 -5.55682806
     -5.55682806 -5.55682806 -5.55682806 -4.45821577 -4.86368088 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -5.55682806 -5.55682806 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
     -4.86368088 -5.55682806 -5.55682806 -5.55682806 -5.55682806 -5.55682806
     -5.55682806 -5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.86368088
     -4.86368088 -5.55682806 -5.55682806 -4.86368088 -5.55682806 -4.86368088
     -5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.45821577 -4.86368088
     -4.86368088 -4.45821577 -4.86368088 -4.86368088 -4.86368088 -5.55682806
     -4.86368088 -5.55682806 -5.55682806 -4.86368088 -5.55682806 -5.55682806
     -4.86368088 -5.55682806 -4.86368088 -4.86368088 -4.86368088 -5.55682806
     -5.55682806 -5.55682806 -4.86368088 -4.86368088 -5.55682806 -4.86368088
     -5.55682806 -4.86368088 -5.55682806 -4.86368088 -5.55682806 -5.55682806
     -5.55682806 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
     -4.86368088 -4.1705337  -4.86368088 -4.86368088 -5.55682806 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088
     -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088 -4.86368088
     -4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
     -5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.86368088 -4.86368088
     -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088 -5.55682806
     -4.86368088 -4.45821577 -4.86368088 -4.86368088]
    

    似乎对特征进行了计数,然后对它们进行向量化后进行了展平以节省内存,因此让我们尝试:

    index = 0
    coef_features_c1_c2 = []
    
    for feat, c1, c2 in zip(word_vectorizer.get_feature_names(), mnb.feature_count_[0], mnb.feature_count_[1]):
        coef_features_c1_c2.append(tuple([mnb.coef_[0][index], feat, c1, c2]))
        index+=1
    
    for i in sorted(coef_features_c1_c2):
        print i
    

    [出]:

    (-5.5568280616995374, u'acuerdo', 1.0, 0.0)
    (-5.5568280616995374, u'al', 1.0, 0.0)
    (-5.5568280616995374, u'alex', 1.0, 0.0)
    (-5.5568280616995374, u'algo', 1.0, 0.0)
    (-5.5568280616995374, u'andaba', 1.0, 0.0)
    (-5.5568280616995374, u'andrea', 1.0, 0.0)
    (-5.5568280616995374, u'bien', 1.0, 0.0)
    (-5.5568280616995374, u'buscando', 1.0, 0.0)
    (-5.5568280616995374, u'como', 1.0, 0.0)
    (-5.5568280616995374, u'con', 1.0, 0.0)
    (-5.5568280616995374, u'conseguido', 1.0, 0.0)
    (-5.5568280616995374, u'distancia', 1.0, 0.0)
    (-5.5568280616995374, u'doprinese', 1.0, 0.0)
    (-5.5568280616995374, u'es', 2.0, 0.0)
    (-5.5568280616995374, u'est\xe1', 1.0, 0.0)
    (-5.5568280616995374, u'eulex', 1.0, 0.0)
    (-5.5568280616995374, u'excusa', 1.0, 0.0)
    (-5.5568280616995374, u'fama', 1.0, 0.0)
    (-5.5568280616995374, u'guasch', 1.0, 0.0)
    (-5.5568280616995374, u'ha', 1.0, 0.0)
    (-5.5568280616995374, u'incident', 1.0, 0.0)
    (-5.5568280616995374, u'ispit', 1.0, 0.0)
    (-5.5568280616995374, u'istragu', 1.0, 0.0)
    (-5.5568280616995374, u'izbijanju', 1.0, 0.0)
    (-5.5568280616995374, u'ja\u010danju', 1.0, 0.0)
    (-5.5568280616995374, u'je', 1.0, 0.0)
    (-5.5568280616995374, u'jedan', 1.0, 0.0)
    (-5.5568280616995374, u'jo\u0161', 1.0, 0.0)
    (-5.5568280616995374, u'kapaciteta', 1.0, 0.0)
    (-5.5568280616995374, u'kosova', 1.0, 0.0)
    (-5.5568280616995374, u'la', 1.0, 0.0)
    (-5.5568280616995374, u'lequio', 1.0, 0.0)
    (-5.5568280616995374, u'llevar', 1.0, 0.0)
    (-5.5568280616995374, u'lo', 2.0, 0.0)
    (-5.5568280616995374, u'misije', 1.0, 0.0)
    (-5.5568280616995374, u'muy', 1.0, 0.0)
    (-5.5568280616995374, u'm\xe1s', 1.0, 0.0)
    (-5.5568280616995374, u'na', 1.0, 0.0)
    (-5.5568280616995374, u'nada', 1.0, 0.0)
    (-5.5568280616995374, u'nasilja', 1.0, 0.0)
    (-5.5568280616995374, u'no', 1.0, 0.0)
    (-5.5568280616995374, u'obaviti', 1.0, 0.0)
    (-5.5568280616995374, u'obe\u0107ao', 1.0, 0.0)
    (-5.5568280616995374, u'parecer', 1.0, 0.0)
    (-5.5568280616995374, u'pone', 1.0, 0.0)
    (-5.5568280616995374, u'por', 1.0, 0.0)
    (-5.5568280616995374, u'po\u0161to', 1.0, 0.0)
    (-5.5568280616995374, u'prava', 1.0, 0.0)
    (-5.5568280616995374, u'predstavlja', 1.0, 0.0)
    (-5.5568280616995374, u'pro\u0161losedmi\u010dnom', 1.0, 0.0)
    (-5.5568280616995374, u'relaci\xf3n', 1.0, 0.0)
    (-5.5568280616995374, u'sjeveru', 1.0, 0.0)
    (-5.5568280616995374, u'taj', 1.0, 0.0)
    (-5.5568280616995374, u'una', 1.0, 0.0)
    (-5.5568280616995374, u'visto', 1.0, 0.0)
    (-5.5568280616995374, u'vladavine', 1.0, 0.0)
    (-5.5568280616995374, u'ya', 1.0, 0.0)
    (-5.5568280616995374, u'\u0107e', 1.0, 0.0)
    (-4.863680881139592, u'aj', 0.0, 1.0)
    (-4.863680881139592, u'ajudou', 0.0, 1.0)
    (-4.863680881139592, u'alpsk\xfdmi', 0.0, 1.0)
    (-4.863680881139592, u'alpy', 0.0, 1.0)
    (-4.863680881139592, u'ao', 0.0, 1.0)
    (-4.863680881139592, u'apresenta', 0.0, 1.0)
    (-4.863680881139592, u'bl\xedzko', 0.0, 1.0)
    (-4.863680881139592, u'come\xe7o', 0.0, 1.0)
    (-4.863680881139592, u'da', 2.0, 1.0)
    (-4.863680881139592, u'decepcionantes', 0.0, 1.0)
    (-4.863680881139592, u'deti', 0.0, 1.0)
    (-4.863680881139592, u'dificuldades', 0.0, 1.0)
    (-4.863680881139592, u'dif\xedcil', 1.0, 1.0)
    (-4.863680881139592, u'do', 0.0, 1.0)
    (-4.863680881139592, u'druh', 0.0, 1.0)
    (-4.863680881139592, u'd\xe1', 0.0, 1.0)
    (-4.863680881139592, u'ela', 0.0, 1.0)
    (-4.863680881139592, u'encontrar', 0.0, 1.0)
    (-4.863680881139592, u'enfrentar', 0.0, 1.0)
    (-4.863680881139592, u'for\xe7as', 0.0, 1.0)
    (-4.863680881139592, u'furiosa', 0.0, 1.0)
    (-4.863680881139592, u'golf', 0.0, 1.0)
    (-4.863680881139592, u'golfistami', 0.0, 1.0)
    (-4.863680881139592, u'golfov\xfdch', 0.0, 1.0)
    (-4.863680881139592, u'hotelmi', 0.0, 1.0)
    (-4.863680881139592, u'hra\u0165', 0.0, 1.0)
    (-4.863680881139592, u'ide', 0.0, 1.0)
    (-4.863680881139592, u'ihr\xedsk', 0.0, 1.0)
    (-4.863680881139592, u'intranspon\xedveis', 0.0, 1.0)
    (-4.863680881139592, u'in\xedcio', 0.0, 1.0)
    (-4.863680881139592, u'in\xfd', 0.0, 1.0)
    (-4.863680881139592, u'kde', 0.0, 1.0)
    (-4.863680881139592, u'kombin\xe1cie', 0.0, 1.0)
    (-4.863680881139592, u'komplex', 0.0, 1.0)
    (-4.863680881139592, u'kon\u010diarmi', 0.0, 1.0)
    (-4.863680881139592, u'lado', 0.0, 1.0)
    (-4.863680881139592, u'lete', 0.0, 1.0)
    (-4.863680881139592, u'longo', 0.0, 1.0)
    (-4.863680881139592, u'ly\u017eova\u0165', 0.0, 1.0)
    (-4.863680881139592, u'man\u017eelky', 0.0, 1.0)
    (-4.863680881139592, u'mas', 0.0, 1.0)
    (-4.863680881139592, u'mesmo', 0.0, 1.0)
    (-4.863680881139592, u'meu', 0.0, 1.0)
    (-4.863680881139592, u'minha', 0.0, 1.0)
    (-4.863680881139592, u'mo\u017enos\u0165ami', 0.0, 1.0)
    (-4.863680881139592, u'm\xe3e', 0.0, 1.0)
    (-4.863680881139592, u'nad\u0161en\xfdmi', 0.0, 1.0)
    (-4.863680881139592, u'negativas', 0.0, 1.0)
    (-4.863680881139592, u'nie', 0.0, 1.0)
    (-4.863680881139592, u'nieko\u013ek\xfdch', 0.0, 1.0)
    (-4.863680881139592, u'para', 0.0, 1.0)
    (-4.863680881139592, u'parecem', 0.0, 1.0)
    (-4.863680881139592, u'pod', 0.0, 1.0)
    (-4.863680881139592, u'pon\xfakaj\xfa', 0.0, 1.0)
    (-4.863680881139592, u'potrebuj\xfa', 0.0, 1.0)
    (-4.863680881139592, u'pri', 0.0, 1.0)
    (-4.863680881139592, u'prova\xe7\xf5es', 0.0, 1.0)
    (-4.863680881139592, u'punham', 0.0, 1.0)
    (-4.863680881139592, u'qual', 0.0, 1.0)
    (-4.863680881139592, u'qualquer', 0.0, 1.0)
    (-4.863680881139592, u'quem', 0.0, 1.0)
    (-4.863680881139592, u'rak\xfaske', 0.0, 1.0)
    (-4.863680881139592, u'rezortov', 0.0, 1.0)
    (-4.863680881139592, u'sa', 0.0, 1.0)
    (-4.863680881139592, u'sebe', 0.0, 1.0)
    (-4.863680881139592, u'sempre', 0.0, 1.0)
    (-4.863680881139592, u'situa\xe7\xf5es', 0.0, 1.0)
    (-4.863680881139592, u'spojen\xfdch', 0.0, 1.0)
    (-4.863680881139592, u'suplantar', 0.0, 1.0)
    (-4.863680881139592, u's\xfa', 0.0, 1.0)
    (-4.863680881139592, u'tak', 0.0, 1.0)
    (-4.863680881139592, u'talianske', 0.0, 1.0)
    (-4.863680881139592, u'teve', 0.0, 1.0)
    (-4.863680881139592, u'tive', 0.0, 1.0)
    (-4.863680881139592, u'todas', 0.0, 1.0)
    (-4.863680881139592, u'tr\xe1venia', 0.0, 1.0)
    (-4.863680881139592, u've\u013ek\xfd', 0.0, 1.0)
    (-4.863680881139592, u'vida', 0.0, 1.0)
    (-4.863680881139592, u'vo', 0.0, 1.0)
    (-4.863680881139592, u'vo\u013en\xe9ho', 0.0, 1.0)
    (-4.863680881139592, u'vysok\xfdmi', 0.0, 1.0)
    (-4.863680881139592, u'vy\u017eitia', 0.0, 1.0)
    (-4.863680881139592, u'v\xe4\u010d\u0161ine', 0.0, 1.0)
    (-4.863680881139592, u'v\u017edy', 0.0, 1.0)
    (-4.863680881139592, u'zauj\xedmav\xe9', 0.0, 1.0)
    (-4.863680881139592, u'zime', 0.0, 1.0)
    (-4.863680881139592, u'\u010dasu', 0.0, 1.0)
    (-4.863680881139592, u'\u010fal\u0161\xedmi', 0.0, 1.0)
    (-4.863680881139592, u'\u0161vaj\u010diarske', 0.0, 1.0)
    (-4.4582157730314274, u'de', 2.0, 2.0)
    (-4.4582157730314274, u'foi', 0.0, 2.0)
    (-4.4582157730314274, u'mais', 0.0, 2.0)
    (-4.4582157730314274, u'me', 0.0, 2.0)
    (-4.4582157730314274, u'\u010di', 0.0, 2.0)
    (-4.1705337005796466, u'as', 0.0, 3.0)
    (-4.1705337005796466, u'que', 4.0, 3.0)
    

    现在我们看到了一些模式…似乎系数越高,则有利于一个类,而另一尾巴又有利于另一类,因此您可以简单地做到这一点:

    import codecs, re, time
    from itertools import chain
    
    import numpy as np
    
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    
    trainfile = 'train.txt'
    
    # Vectorizing data.
    train = []
    word_vectorizer = CountVectorizer(analyzer='word')
    trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
    tags = ['bs','pt','bs','pt']
    
    # Training NB
    mnb = MultinomialNB()
    mnb.fit(trainset, tags)
    
    def most_informative_feature_for_binary_classification(vectorizer, classifier, n=10):
        class_labels = classifier.classes_
        feature_names = vectorizer.get_feature_names()
        topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
        topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]
    
        for coef, feat in topn_class1:
            print class_labels[0], coef, feat
    
        print
    
        for coef, feat in reversed(topn_class2):
            print class_labels[1], coef, feat
    
    
    most_informative_feature_for_binary_classification(word_vectorizer, mnb)
    

    [出]:

    bs -5.5568280617 acuerdo
    bs -5.5568280617 al
    bs -5.5568280617 alex
    bs -5.5568280617 algo
    bs -5.5568280617 andaba
    bs -5.5568280617 andrea
    bs -5.5568280617 bien
    bs -5.5568280617 buscando
    bs -5.5568280617 como
    bs -5.5568280617 con
    
    pt -4.17053370058 que
    pt -4.17053370058 as
    pt -4.45821577303 či
    pt -4.45821577303 me
    pt -4.45821577303 mais
    pt -4.45821577303 foi
    pt -4.45821577303 de
    pt -4.86368088114 švajčiarske
    pt -4.86368088114 ďalšími
    pt -4.86368088114 času
    

    实际上,如果您已仔细阅读@larsmans的评论,他将在如何为scikit-
    learn分类器提供最丰富的功能方面对二元类的系数进行提示。



知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看