基于 Blink SQL 的阿里实时计算平台(Stream Compute)
2020-02-27 1366浏览
- 1.基于 Blink SQL 的阿里实时计算平台 StreamCompute 伍 翀(云邪) 2017.10
- 2.
- 3.
- 4.
- 5.
- 6.About me
- 7.目录 1 Background 2 Blink 介绍 3 Blink SQL 4 StreamCompute 平台 5 应用案例
- 8.Background Part I
- 9.开源流计算引擎 Kafka Stream
- 10.开源流计算引擎 10+ Kafka Stream
- 11.阿里流计算引擎 Galaxy
- 12.流计算的挑战 低延迟 高吞吐 准确性 易用性
- 13.毫秒级延迟 每秒千万级吞吐 低延迟 准确性 Exactly-once 语义 Event time 处理 高吞吐 Apache Flink 易用性 SQL / Table API / DataStream API
- 14.Alibaba Blink
- 15.Blink = 企业级 Flink • Blink 拥抱 Apache Flink 开源社区 • Blink 不断将优化改进回馈给开源社区 大规模部署 提升稳定性 状态性能优化 StreamSQL 语法语义 • Blink 在 Flink 社区影响力 • 5 名 Flink Committer • 连续 3 次赞助、作为讲师参加 Flink Forward, Hadoop Summit SQL 优化
- 16.Blink Ecosystem in Alibaba 产品 阿里集团内部 搜索/推荐/广告/数据平台/反作弊等等 平台 阿里云 公有云&专有云 实时计算 StreamCompute 平台 ANSI SQL Blink DataSet API DataStream API Runtime Engine 集群管理 Resource Management Storage
- 17.Blink SQL Section 3
- 18.Why SQL? One Query, Same Result Declarative Optimized Understandable Stable Unify
- 19.SQL 不是为流设计的 数据是有限集 流数据是无穷的 DBMS 能访问到所有数据 流数据是随着时间到达的 SQL 查询返回一个结果并结束 流上的查询不断产生结果且不会结束
- 20.动态表(Dynamic Table) n 动态表(Dynamic Table): 数据会随着时间变化的表 Apply Stream Changelog 动态表
- 21.流与动态表的二象性 Stream user clicks Stream Apply Dynamic Table user clicks Changelog user clicks Mary 1 Bob 1 Mary 2 1 LIz 1 Bob 2 Bob 2 Mary 3 Mary 3 Mary 1 Bob 1 Mary 2 LIz Mary 3 1 2 Bob 1 2 Liz 1
- 22.流与动态表的二象性 Stream user Dynamic Table clicks Mary 1 Bob 1 Mary 2 LIz 1 Bob 2 Mary 3 Apply Changelog Stream user clicks Mary 3 1 2 Bob 1 2 Liz 1
- 23.动态表 + 连续查询 n 连续查询(Continuous Query):持续运行的查询 Stream 连续查询 Stream
- 24.动态表 + 连续查询 输入流 连续查询 动态表 输出流 clicks user url Mary ./home Bob ./cart Mary ./prod?id=1 Liz ./home Liz ./prod?id=3 result SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user user cnt Bob, 1 Mary 2 3 1 Mary, 2 Bob 1 Liz, 1 Liz 2 1 Liz, 2 Mary, 3 Mary ./prod?id=7 Stream Mary, 1 连续查询 Stream
- 25.Blink SQL 核心功能 nDDL:CREATE TABLE etc. nDML:INSERT etc. n 用户自定义函数:UDF/UDTF/UDAF n 基本功能:SELECT, WHERE, GROUP BY, UNION n 高级功能:JOIN, TopN n 丰富的窗口:滑动、滚动、会话、Over n 性能优化:分段优化、Micro Batch etc.
- 26.DDL CREATE FUNCTION contains AS 'com.alibaba.blink.Contains'; CREATE TABLE clicks ( user VARCHAR, cTime TIMESTAMP, url VARCHAR ) WITH ( type = 'kafka', topic = 'click_topic', … ); CREATE VIEW taobao_clicks(user, cTime, url) AS SELECT * FROM clicks WHERE contains(url, 'taobao.com');
- 27.DML CREATE TABLE large_orders ( orderId VARCHAR, productId VARCHAR, orderTime TIMESTAMP, units BIGINT, PRIMARY KEY (orderId) ) WITH ( type = 'hbase', tableName = 'large_orders', … ); INSERT INTO large_orders SELECT * FROM orders WHERE units > 1000;
- 28.Grouping Aggregate clicks user cTime url Mary 12:00:00 ./home Bob ./cart 12:00:00 Mary 12:00:05 ./prod?id=1 Liz 12:01:00 ./home Liz 12:01:30 ./prod?id=3 Mary 12:01:45 ./prod?id=7 result SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user user cnt Mary 2 3 1 Bob 1 Liz 1 2
- 29.Window Aggregate clicks user cTime url Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:02:00 ./prod?id=2 Mary 12:55:00 ./home Bob 13:01:00 ./prod?id=4 Liz 13:30:00 ./cart Liz 13:59:00 ./home Mary 14:00:00 ./prod?id=1 Liz 14:02:00 ./prod?id=8 Bob 14:30:00 ./prod?id=7 Bob 14:40:00 ./home result SELECT user, TUMBLE_END( cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE( cTime, INTERVAL '1' HOURS), user user endT cnt Mary 13:00:00 3 Bob 13:00:00 1 Bob 14:00:00 1 Liz 14:00:00 2 Mary 15:00:00 1 Bob 15:00:00 2 Liz 15:00:00 1
- 30.双流 JOIN Orders orderId productId orderTime 5 30 10:17:00 6 10 10:17:05 9 10 11:02:00 12 10 11:24:11 Shipments orderId shipTime 5 10:55:00 6 10:20:00 9 11:58:00 12 11:44:00 SELECT o.orderId, o.productId, o.orderTime, s.shipTime FROM Orders AS o JOIN Shipments AS s ON o.orderId = s.orderId result orderId productId orderTime shipTime 5 30 10:17:00 10:55:00 6 10 10:17:05 10:20:00 9 10 11:02:00 11:58:00 12 10 11:24:11 11:44:00
- 31.流 JOIN 表 —— Join 当前表 SELECT o.*, p.* FROM Orders AS o JOIN Products FOR SYSTEM_TIME AS OF PROCTIME() AS p ON o.productId = p.productId Get p.* o.* Orders Stream o.*, p.*
- 32.流 JOIN 表 —— Join 历史表 SELECT o.*, p.unitPrice FROM Orders AS o JOIN Products FOR SYSTEM_TIME AS OF o.orderTime AS p ON o.productId = p.productId Get p.* with o.orderTime o.* Orders Stream o.*, p.*
- 33.查询优化 SELECT user, ROW_NUMBER() OVER (ORDER BY lastLogin) AS rank n micro-batch FROM users; n async-join table n multi-join table merge n TopN optimization SELECT * FROM ( SELECT user, ROW_NUMBER() OVER (ORDER BY lastLogin) AS rank FROM users) WHERE rank <= 10; 我们为这种查询设计了一种特殊的算子,来计算 TopN
- 34.StreamCompute Just SQL ? SQL SQL 只解决了一部分问题
- 35.StreamCompute 实时计算平台 Section 4
- 36.
- 37.
- 38.应用案例 Section 5
- 39.应用案例 n 实时数据统计(报表、大屏)
- 40.应用场景 n 实时海量数据处理:搜索、推荐等
- 41.应用场景 n 在线机器学习
- 42.总结 n Blink SQL 使用标准的 SQL 语义,统一流和批 n Blink SQL 已经大规模应用于阿里巴巴 (1000+ Jobs) n Blink SQL 提供非常完备的功能 n StreamCompute 平台极大提高了开发效率
- 43.THANKS --------- Q&A Section --------