大数据实时计算 Flink SQL 解密 伍翀@阿里巴巴
2020-03-01 588浏览
- 1.C C T D 0 2 8 1
- 2.8 1 大数据实时计算 Flink SQL 解密 0 2 C C T D 伍 翀(云邪) 2018.05
- 3.姓名: 伍翀 花名: 云邪 阿里巴巴 C C T 0 2 硕士毕业于 新一代实时计算引擎 北京理工大学 BlinkSQL 开发与优化 2015 阿里巴巴 实时计算引擎 JStorm 的开发与设计 D 2016 8 1 Flink Committer 2017年2月 2017 Now
- 4.目录 1 Background 2 Flink SQL 基本概念 3 Flink SQL 核心功能 C C T D 5 4 0 2 8 1 Flink SQL 优化 阿里云流计算产品
- 5.Background Part I D C C T 0 2 8 1
- 6.Alibaba Blink + Apache Flink = Alibaba’s Improvements D C C T 0 2 8 1 Blink = Alibaba Blink 阿里巴巴Blink团队有 20+ flink contributor,6 名 committer,向社区贡 献了数百个Commit Blink Runtime + Flink SQL
- 7.团队工作 贡献社区 主导制定 Flink SQL 语义 • DynamicTable 2016-2017 • Retraction 2016-2017 D C C T 贡献社区 0 2 8 1 完善 Flink SQL 功能 • Agg, Join, Window 2017 • 跑通全部 TPCH Query 2018 部分贡献社区 性能提升 • 大量的查询优化 2017-2018 资源配置自动化 2018
- 8.Flink SQL Section 2 D C C T 0 2 8 1
- 9.Why SQL? 8 1 0 2 One Query, One Result Declarative Optimized D C C T Understandable Stable Unify
- 10.SQL 不是为流设计的 批处理 数据是有限的 D C C T 批计算查询返回一个结果并结束 没有Retraction 0 2 8 1 流处理 流数据是无穷的 流上的查询不断产生结果且不会结束 有对历史数据的修改(Retraction)
- 11.动态表(Dynamic Table) 动态表(Dynamic Table): 数据会随着时间变化的表 Stream 0 2 8 1 Dynamic Table user clicks Mary 1 Bob 1 Mary 2 LIz 1 Bob 2 Mary 3 D C C T Apply Changelog user clicks Mary 3 1 2 Bob 1 2 Liz 1 流与动态表的对偶性
- 12.动态表 + 连续查询 连续查询(Continuous Query):持续运行的查询 Stream Stream C C T 连续查询 D 连续查询 Stream Stream 8 1 0 2 连续查询 Stream
- 13.流计算 Retraction 输入流 without retraction clicks user url click_cnt Mary ./home user cnt Bob ./cart Mary 2 1 Mary ./prod?id=1 Bob 1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user 8 1 输出流 动态表1 D C C T 0 2 Mary, 1 Bob, 1 Mary, 2 动态表2 result cnt freq 1 2 2 1 SELECT cnt, COUNT(cnt) as freq FROM click_cnt GROUP BY cnt 应该为1
- 14.流计算 Retraction 输入流 with retraction clicks user url click_cnt Mary ./home user cnt Bob ./cart Mary 2 1 Mary ./prod?id=1 Bob 1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user 8 1 输出流 动态表1 D C C T 0 2 + Mary, 1 + Bob, 1 - Mary, 1 动态表2 result cnt freq 1 1 2 2 1 + Mary, 2 SELECT cnt, COUNT(cnt) as freq FROM click_cnt GROUP BY cnt 由查询优化器判断是否需要Retraction,用户无感知。 结果正确
- 15.0 2 8 1 世界上不需要所谓的 Stream SQL C C T 标准的 ANSI SQL 就可以用来定义流计算 D
- 16.Flink SQL 核心功能 Group Agg 8 1 0 2 UDF/UDTF/UDAF DDL & DML D C C T Window Agg Join Over Agg
- 17.-- 定义数据源表 Loading Data CREATE TABLE clicks ( user VARCHAR, cTime TIMESTAMP, url VARCHAR ) WITH ( type = 'kafka', topic = 'click_topic', … ); C C T D 8 1 0 2 SELECT * FROM clicks user cTime url Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:00:05 ./prod?id=1
- 18.-- 定义数据结果表 Saving Data CREATE TABLE last_clicks ( user VARCHAR, cTime TIMESTAMP, url VARCHAR, PRIMARY KEY (user) ) WITH ( type = 'mysql', … ); C C T D 0 2 8 1 INSERT INTO last_clicks SELECT * FROM clicks
- 19.CREATE TABLE mysql_clicks ( user VARCHAR, cTime TIMESTAMP, url VARCHAR, PRIMARY KEY (user) ) WITH ( type = 'mysql', … ); Multi Output C C T CREATE TABLE hbase_clicks ( user VARCHAR, cTime TIMESTAMP, url VARCHAR, PRIMARY KEY (user) ) WITH ( type = 'hbase', … ); 0 2 8 1 CREATE VIEW taobao_clicks AS SELECT * FROM clicks WHERE url LIKE 'http://taobao.com%’ D INSERT INTO mysql_result SELECT * FROM taobao_clicks INSERT INTO hbase_result SELECT * FROM taobao_clicks
- 20.Group Aggregate 从历史到现在每个用户点击的次数 clicks user cTime Mary 12:00:00 ./home Bob ./cart 12:00:00 C C T url Mary 12:00:05 ./prod?id=1 Mary 12:01:45 ./prod?id=7 D 8 1 0 2 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user result user cnt Mary 2 3 1 Bob 1
- 21.Window Aggregate 每小时每个用户点击的次数 clicks user cTime url Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:02:00 ./prod?id=2 Mary 12:55:00 ./home Bob 13:01:00 ./prod?id=4 Liz 13:30:00 ./cart Liz 13:59:00 ./home 0 2 8 1 SELECT user, TUMBLE_END( cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY user, TUMBLE( cTime, INTERVAL '1' HOURS) D C C T result user endT cnt Mary 13:00:00 3 Bob 13:00:00 1 Bob 14:00:00 1 Liz 14:00:00 2
- 22.双流 JOIN:支持 INNER, LEFT, RIGHT, FULL, SEMI, ANTI Orders orderId productId orderTime 5 30 10:17:00 6 10 10:17:05 9 10 11:02:00 12 10 11:24:11 Shipments orderId shipTime 5 10:55:00 6 C C T D 8 1 SELECT o.orderId, o.productId, o.orderTime, s.shipTime FROM Orders AS o JOIN Shipments AS s ON o.orderId = s.orderId 0 2 result orderId productId orderTime shipTime 10:20:00 5 30 10:17:00 10:55:00 9 11:58:00 6 10 10:17:05 10:20:00 12 11:44:00 9 10 11:02:00 11:58:00 12 10 11:24:11 11:44:00
- 23.维表 JOIN:支持 INNER, LEFT CREATE TABLE Products ( productId VARCHAR, productName VARCHAR, price DECIMAL, PRIMARY KEY (productId), PERIOD FOR SYSTEM_TIME ) WITH ( type = 'hbase' … ); C C T 8 1 0 2 D SELECT o.*, p.* FROM Orders AS o JOIN Products FOR SYSTEM_TIME AS OF PROCTIME() AS p ON o.productId = p.productId
- 24.8 1 0 2 聊几个优化 C C T D
- 25.异步维表 JOIN Sync. IO Async. IO a Wait for Response b w Concurrent Processing 0 2 b w a 8 1 a C C T DataBase D Reduced Throughput c d DataBase a b Increased Throughput c d b Send Request Receive Request Wait
- 26.异步维表 JOIN CREATE TABLE Products ( productId VARCHAR, productName VARCHAR, price DECIMAL, PRIMARY KEY (productId), PERIOD FOR SYSTEM_TIME ) WITH ( type = 'hbase', 一行配置的改动 async = 'true' … ); D C C T 8 1 0 2 SELECT o.*, p.* FROM Orders AS o JOIN Products FOR SYSTEM_TIME AS OF PROCTIME() AS p ON o.productId = p.productId
- 27.如何处理数据倾斜Data-Skew 8 1 Map C C T D 0 2 Agg Map Map Agg
- 28.如何处理数据倾斜Data-Skew 反压 8 1 Map C C T D 反压 0 2 Agg Map Agg 反压 Map Hot!!
- 29.如何处理数据倾斜Data-Skew 0 2 8 1 Local-Global Aggregation 优化 D C C T
- 30.如何处理数据倾斜Data-Skew Local-Global Aggregation Simple Aggregation Map C C T Agg Map 8 1 Map D 0 2 Map Local Agg Global Agg Local Agg Global Agg Agg Map Map Local Agg
- 31.Local-Global 带来 20X 的性能提升 D C C T 8 1 优化前 0 2 优化后
- 32.C C T 0 2 阿里云流计算产品 Section 3 D 8 1
- 33.D C C T 0 2 8 1
- 34.D C C T 0 2 8 1
- 35.拥抱开源,做世界第一的计算平台 Thanks Q&A D C C T 0 2 8 1
- 36.