Spark SQL 在字节跳动的优化实践 郭俊
2020-03-01 180浏览
- 1.Spark SQL
- 2.
- 3.• • • • • Spark SQL / Druid ETL OLAP
- 4.• Spark SQL • Spark SQL • Spark Shuffle •
- 5.Spark SQL RBO Catalog DataFrame SQL Analyzer Unresolved Logical Plan Optimizer Resolved Logical Plan Query Planner Optimized Logical Plan Catalyst AE Physical Plan Cost Model Dataset Parser CBO Selected Physical Plan DAG RDDs
- 6.Spark SQL • Bucket Join • • •
- 7.Spark SQL ——Bucket Join Shuffle Table 1 Join Table 2 Sort Merge Join partition 0 partition 0 partition 1 partition 1 partition 1 partition 2 partition 2 partition 2 partition 2 … … … … partition m partition n partition n partition k partition 0 partition 1 Shuffle Sort Shuffle Sort partition 0
- 8.Spark SQL ——Bucket Join Table 1 Shuffle Join Join Table 2 bucket 0 bucket 0 bucket 1 bucket 1 bucket 2 bucket 2 … … bucket n bucket n
- 9.Spark SQL ——BucketoutputPartitioning:'>outputPartitioning: