Python

从pandas.dataframe删除低频值

发布于 2021-01-29 17:04:02

如何从pandas.DataFrame很少发生（即频率较低）的列中删除值？例：

In [4]: df[col_1].value_counts()

Out[4]: 0       189096
        1       110500
        2        77218
        3        61372
              ...
        2065         1
        2067         1
        1569         1
        dtype: int64

因此，我的问题是：如何删除like2065, 2067, 1569和others的值？以及如何对包含.value_counts()这样的所有列执行此操作？

更新： 关于“低”，我的意思是像2065。该值出现col_11（一）次，我想删除这样的值。

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

我看到您可能有两种方法可以执行此操作。

对于整个DataFrame

此方法删除整个DataFrame中很少出现的值。我们可以使用内置函数来加快处理速度，而无需循环。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
         columns = ['A', 'B'])

threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame 
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)

逐列

此方法删除每个列中不经常出现的条目。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
         columns = ['A', 'B'])

threshold = 10 # Anything that occurs less than this will be removed.
for col in df.columns:
    value_counts = df[col].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    df[col].replace(to_remove, np.nan, inplace=True)

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看