如何删除在特定列中的值为NaN的Pandas DataFrame行
我有这个DataFrame,只想要EPS列不是的记录NaN:
>>> df
STK_ID EPS cash
STK_ID RPT_Date
601166 20111231 601166 NaN NaN
600036 20111231 600036 NaN 12
600016 20111231 600016 4.3 NaN
601009 20111231 601009 NaN NaN
601939 20111231 601939 2.5 NaN
000001 20111231 000001 NaN NaN
…例如df.drop(....)要得到这个结果的数据框:
STK_ID EPS cash
STK_ID RPT_Date
600016 20111231 600016 4.3 NaN
601939 20111231 601939 2.5 NaN
我怎么做?
-
不要
drop
。就拿行,其中EPS
是有限的:import numpy as np df = df[np.isfinite(df['EPS'])]
-
这个问题已经解决,但是…
…还要考虑伍特(Wouter)在其原始评论中提出的解决方案。
dropna()
大熊猫内置了处理丢失数据(包括)的功能。除了通过手动执行可能会提高的性能之外,这些功能还带有多种可能有用的选项。In [24]: df = pd.DataFrame(np.random.randn(10,3)) In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan; In [26]: df Out[26]: 0 1 2 0 NaN NaN NaN 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 4 NaN NaN 0.050742 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 8 NaN NaN 0.637482 9 -0.310130 0.078891 NaN
In [27]: df.dropna() #drop all rows that have any NaN values Out[27]: 0 1 2 1 2.677677 -1.466923 -0.750366 5 -1.250970 0.030561 -2.678622 7 0.049896 -0.308003 0.823295
In [28]: df.dropna(how='all') #drop only if ALL columns are NaN Out[28]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 4 NaN NaN 0.050742 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 8 NaN NaN 0.637482 9 -0.310130 0.078891 NaN
In [29]: df.dropna(thresh=2) #Drop row if it does not have at least two values that are **not** NaN Out[29]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 5 -1.250970 0.030561 -2.678622 7 0.049896 -0.308003 0.823295 9 -0.310130 0.078891 NaN
In [30]: df.dropna(subset=[1]) #Drop only if NaN in specific column (as asked in the question) Out[30]: 0 1 2 1 2.677677 -1.466923 -0.750366 2 NaN 0.798002 -0.906038 3 0.672201 0.964789 NaN 5 -1.250970 0.030561 -2.678622 6 NaN 1.036043 NaN 7 0.049896 -0.308003 0.823295 9 -0.310130 0.078891 NaN
还有其他选项(请参见http://pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame.dropna.html上的文档),包括删除列而不是行。