Python

如何选择最后一行以及如何按索引访问PySpark数据帧？

发布于 2021-01-29 17:41:25

从像这样的PySpark SQL数据框

name age city
abc   20  A
def   30  B

如何获得最后一行。（就像df.limit（1）一样，我可以将数据帧的第一行放入新的数据帧中）。

以及如何按索引访问数据框行。12或200。

我可以在熊猫里做

df.tail(1) # for last row
df.ix[rowno or index] # by index
df.loc[] or by df.iloc[]

我只是很好奇如何以这种方式或其他方式访问pyspark数据框。

谢谢

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。
如何获得最后一行。

假设所有列都可修改的漫长而丑陋的方式：
```
from pyspark.sql.functions import (
    col, max as max_, struct, monotonically_increasing_id
)

last_row = (df
    .withColumn("_id", monotonically_increasing_id())
    .select(max(struct("_id", *df.columns))
    .alias("tmp")).select(col("tmp.*"))
    .drop("_id"))
```
如果不是所有列都可以订购，则可以尝试：
```
with_id = df.withColumn("_id", monotonically_increasing_id())
i = with_id.select(max_("_id")).first()[0]

with_id.where(col("_id") == i).drop("_id")
```
注意。/oassql.functions中有last函数，pyspark.sql.functions`但考虑到对相应表达式的描述，此处不是一个好的选择。

我如何通过index.like访问数据框行

知识点

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦