Python

Python-加快将分类变量转换为其数字索引的速度

发布于 2021-01-29 14:10:56

我需要将Pandas数据框中的一列分类变量转换为与索引相对应的数值，并将其转换为该列中唯一分类变量的数组（长话短说！），这是一个实现以下目的的代码段：

import pandas as pd
import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])

for lab in uniq_lab:
    df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

转换数据帧：

    col
 0  baked
 1  beans
 2  baked
 3  baked
 4  beans

放入数据框：

如预期的。但是我的问题是，当我尝试在大数据文件上运行类似代码时，我的傻傻的for循环（我想到的唯一方式）会像糖蜜一样缓慢。我只是想知道是否有人想到是否有任何方法可以更有效地做到这一点。预先感谢您的任何想法。

关注者

被浏览

102

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

用途factorize：

df['col'] = pd.factorize(df.col)[0]
print (df)
   col
0    0
1    1
2    0
3    0
4    1

文件

编辑：

如Jeff评论中所述，那么最好是将column转换为，categorical主要是因为更少的内存使用量：

df['col'] = df['col'].astype("category")

时间：

有趣的是，在大dfpandas中，速度更快numpy。我不敢相信。

len(df)=500k：

In [29]: %timeit (a(df1))
100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))
100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))
10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))
10 loops, best of 3: 24.6 ms per loop

len(df)=5k：

In [38]: %timeit (a(df1))
1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))
The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))
The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))
1000 loops, best of 3: 294 µs per loop

len(df)=5：

In [46]: %timeit (a(df1))
1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))
1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))
The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

测试代码 ：

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
print (df)
df = pd.concat([df]*100000).reset_index(drop=True)
#test for 5k
#df = pd.concat([df]*1000).reset_index(drop=True)


df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):
    df['col'] = pd.factorize(df.col)[0]
    return df

def a1(df):
    idx,_ = pd.factorize(df.col)
    df['col'] = idx
    return df

def b(df):
    df['col'] = np.unique(df['col'],return_inverse=True)[1]
    return df

def b1(df):
    _,idx = np.unique(df['col'],return_inverse=True)
    df['col'] = idx    
    return df

print (a(df1))    
print (a1(df2))   
print (b(df3))   
print (b1(df4))

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看