How to unwrap nested Struct column into multiple columns?

发布于 2021-01-29 14:09:47

I’m trying to expand a DataFrame column with nested struct type (see below)
to multiple columns. The Struct schema I’m working with looks something like
{"foo": 3, "bar": {"baz": 2}}.

Ideally, I’d like to expand the above into two columns ("foo" and
"bar.baz"). However, when I tried using .select("data.*") (where data is
the Struct column), I only get columns foo and bar, where bar is still a
struct.

Is there a way such that I can expand the Struct for both layers?

关注者
0
被浏览
165
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    You can select data.bar.baz as bar.baz:

    df.show()
    +-------+
    |   data|
    +-------+
    |[3,[2]]|
    +-------+
    
    df.printSchema()
    root
     |-- data: struct (nullable = false)
     |    |-- foo: long (nullable = true)
     |    |-- bar: struct (nullable = false)
     |    |    |-- baz: long (nullable = true)
    

    In pyspark:

    import pyspark.sql.functions as F
    df.select(F.col("data.foo").alias("foo"), F.col("data.bar.baz").alias("bar.baz")).show()
    +---+-------+
    |foo|bar.baz|
    +---+-------+
    |  3|      2|
    +---+-------+
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看