如何匹配运行时间太长的python中的所有键值对
发布于 2021-01-29 15:02:45
用户项目的亲和力和建议:
我正在创建一个表,建议“购买此商品的客户也购买了算法”
输入数据集
productId userId
Prod1 a
Prod1 b
Prod1 c
Prod1 d
prod2 b
prod2 c
prod2 a
prod2 b
prod3 c
prod3 a
prod3 d
prod3 c
prod4 a
prod4 b
prod4 d
prod4 a
prod5 d
prod5 a
需要输出
Product1 Product2 score
Prod1 prod3
Prod1 prod4
Prod1 prod5
prod2 Prod1
prod2 prod3
prod2 prod4
prod2 prod5
prod3 Prod1
prod3 prod2
Using code :
#Get list of unique items
itemList=list(set(main["productId"].tolist()))
#Get count of users
userCount=len(set(main["productId"].tolist()))
#Create an empty data frame to store item affinity scores for items.
itemAffinity= pd.DataFrame(columns=('item1', 'item2', 'score'))
rowCount=0
#For each item in the list, compare with other items.
for ind1 in range(len(itemList)):
#Get list of users who bought this item 1.
item1Users = main[main.productId==itemList[ind1]]["userId"].tolist()
#print("Item 1 ", item1Users)
#Get item 2 - items that are not item 1 or those that are not analyzed already.
for ind2 in range(ind1, len(itemList)):
if ( ind1 == ind2):
continue
#Get list of users who bought item 2
item2Users=main[main.productId==itemList[ind2]]["userId"].tolist()
#print("Item 2",item2Users)
#Find score. Find the common list of users and divide it by the total users.
commonUsers= len(set(item1Users).intersection(set(item2Users)))
score=commonUsers / userCount
#Add a score for item 1, item 2
itemAffinity.loc[rowCount] = [itemList[ind1],itemList[ind2],score]
rowCount +=1
#Add a score for item2, item 1. The same score would apply irrespective of the sequence.
itemAffinity.loc[rowCount] = [itemList[ind2],itemList[ind1],score]
rowCount +=1
#Check final result
itemAffinity
该代码在示例数据集上运行良好,但是
该代码花费的时间太长,无法在包含100,000行的数据集中运行。请帮助我优化代码。
关注者
0
被浏览
68
1 个回答
-
此处的关键是创建productId的笛卡尔积。参见下面的代码,
方法1(适用于较小的数据集)
result=(main.drop_duplicates(['productId','userId']) .assign(cartesian_key=1) .pipe(lambda x:x.merge(x,on='cartesian_key')) .drop('cartesian_key',axis=1) .loc[lambda x:(x.productId_x!=x.productId_y) & (x.userId_x==x.userId_y)] .groupby(['productId_x','productId_y']).size() .div(data['userId'].nunique())) result Prod1 prod2 0.75 Prod1 prod3 0.75 Prod1 prod4 0.75 Prod1 prod5 0.5 prod2 Prod1 0.75 prod2 prod3 0.5 prod2 prod4 0.5 prod2 prod5 0.25 prod3 Prod1 0.75 prod3 prod2 0.5 prod3 prod4 0.5 prod3 prod5 0.5 prod4 Prod1 0.75 prod4 prod2 0.5 prod4 prod3 0.5 prod4 prod5 0.5 prod5 Prod1 0.5 prod5 prod2 0.25 prod5 prod3 0.5 prod5 prod4 0.5
方法2
result = (df.groupby(['productId','userId']).size() .clip(upper=1) .unstack() .assign(key=1) .reset_index() .pipe(lambda x:x.merge(x,on='key')) .drop('key',axis=1) .loc[lambda x:(x.productId_x!=x.productId_y)] .set_index(['productId_x','productId_y']) .pipe(lambda x:x.set_axis(x.columns.str.split('_',expand=True),axis=1,inplace=False)) .swaplevel(axis=1) .pipe(lambda x:(x['x']+x['y'])) .fillna(0) .div(2) .mean(axis=1))