将数据框放入randomForest pyspark

发布于 2021-01-29 15:05:41

我有一个DataFrame看起来像这样:

+--------------------+------------------+
|            features|           labels |
+--------------------+------------------+
|[-0.38475, 0.568...]|          label1  |
|[0.645734, 0.699...]|          label2  |
|     .....          |          ...     |
+--------------------+------------------+

两列都是String类型(StringType()),我想将其放入spark ml
randomForest中。为此,我需要将要素列转换为包含浮点数的向量。有谁知道怎么做吗?

关注者
0
被浏览
109
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    如果您使用的是 Spark 2.x ,我相信这就是您所需要的:

    from pyspark.sql.functions import udf
    from pyspark.mllib.linalg import Vectors
    from pyspark.ml.linalg import VectorUDT
    from pyspark.ml.feature import StringIndexer
    
    df = spark.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))
    
    def parse(s):
      try:
        return Vectors.parse(s).asML()
      except:
        return None
    
    parse_ = udf(parse, VectorUDT())
    
    parsed = df.withColumn("features", parse_("features"))
    
    indexer = StringIndexer(inputCol="label", outputCol="label_indexed")
    
    indexer.fit(parsed).transform(parsed).show()
    ## +----------------+------+-------------+
    ## |        features| label|label_indexed|
    ## +----------------+------+-------------+
    ## |[-0.38475,0.568]|label1|          0.0|
    ## |[0.645734,0.699]|label2|          1.0|
    ## +----------------+------+-------------+
    

    使用 Spark 1.6 并没有太大不同:

    from pyspark.sql.functions import udf
    from pyspark.ml.feature import StringIndexer
    from pyspark.mllib.linalg import Vectors, VectorUDT
    
    df = sqlContext.createDataFrame([("[-0.38475, 0.568]", "label1"), ("[0.645734, 0.699]", "label2")], ("features", "label"))
    
    parse_ = udf(Vectors.parse, VectorUDT())
    
    parsed = df.withColumn("features", parse_("features"))
    
    indexer = StringIndexer(inputCol="label", outputCol="label_indexed")
    
    indexer.fit(parsed).transform(parsed).show()
    ## +----------------+------+-------------+
    ## |        features| label|label_indexed|
    ## +----------------+------+-------------+
    ## |[-0.38475,0.568]|label1|          0.0|
    ## |[0.645734,0.699]|label2|          1.0|
    ## +----------------+------+-------------+
    

    Vectors具有parse可以帮助您实现所要完成的功能的功能。



知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看