如何在PySpark的UDF中返回“元组类型”?
发布于 2021-01-29 15:06:26
输入的所有数据类型pyspark.sql.types
为:
__all__ = [
"DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType",
"TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType",
"LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"]
我必须编写一个UDF(在pyspark中),它返回一个元组数组。我应该给它第二个参数是udf方法的返回类型吗?这将是ArrayType(TupleType())
…
关注者
0
被浏览
86
1 个回答
-
TupleType
Spark中没有这样的东西。产品类型structs
用特定类型的字段表示。例如,如果您想返回一个成对的数组(整数,字符串),则可以使用如下模式:from pyspark.sql.types import * schema = ArrayType(StructType([ StructField("char", StringType(), False), StructField("count", IntegerType(), False) ]))
用法示例:
from pyspark.sql.functions import udf from collections import Counter char_count_udf = udf( lambda s: Counter(s).most_common(), schema ) df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["id", "value"]) df.select("*", char_count_udf(df["value"])).show(2, False) ## +---+-----+-------------------------+ ## |id |value|PythonUDF#<lambda>(value)| ## +---+-----+-------------------------+ ## |1 |foo |[[o,2], [f,1]] | ## |2 |bar |[[r,1], [a,1], [b,1]] | ## +---+-----+-------------------------+