将pandas列中的关键字与另一个元素列表匹配
我有一个熊猫数据框为:
word_list
['nuclear','election','usa','baseball']
['football','united','thriller']
['marvels','hollywood','spiderman']
....................
....................
....................
我也有多个带有类别名称的列表,例如:
movies=['spiderman','marvels','thriller']'
sports=['baseball','hockey','football']
,
politics=['election','china','usa']
和许多其他类别。
所有我想匹配大熊猫列的关键字word_list
与我的类别列表,如果关键字被匹配在一起分配在单独列相应的列表名称,如果任何关键字不在任何列表,然后简单地把作为被匹配
miscellaneous
所以,输出我寻找为:-
word_list matched_list_names
['nuclear','election','usa','baseball'] politics,sports,miscellaneous
['football','united','thriller'] sports,movies,miscellaneous
['marvels','spiderman','hockey'] movies,sports
.................... .....................
.................... .....................
.................... ....................
我成功将匹配关键字获取为:-
for i in df['word_list']:
for j in movies:
if i in j:
print (i)
但这给了我匹配关键字的列表。如何获取列表名称并将其添加到pandas列?
-
您可以先展平列表字典,然后使用
.get
with查找miscellaneous
不匹配的值,然后将转换为set
s以获得唯一类别,然后string
通过转换为s
join
:movies=['spiderman','marvels','thriller'] sports=['baseball','hockey','football'] politics=['election','china','usa'] d = {'movies':movies, 'sports':sports, 'politics':politics} d1 = {k: oldk for oldk, oldv in d.items() for k in oldv} f = lambda x: ','.join(set([d1.get(y, 'miscellaneous') for y in x])) df['matched_list_names'] = df['word_list'].apply(f) print (df) word_list matched_list_names 0 [nuclear, election, usa, baseball] politics,miscellaneous,sports 1 [football, united, thriller] miscellaneous,sports,movies 2 [marvels, hollywood, spiderman, budget] miscellaneous,movies
列表理解的类似解决方案:
df['matched_list_names'] = [','.join(set([d1.get(y, 'miscellaneous') for y in x])) for x in df['word_list']]