华为大数据设计部部长李昆——CarbonData大数据高性能交互式分析实践
2020-02-27 110浏览
- 1.CarbonData:大数据交互式分析实践 李昆 2017-05
- 2.Agenda • 为什么需要CarbonData • CarbonData介绍 • 性能测试 • 应用案例 • 未来计划 2
- 3.企业中包含多种数据应用,从商业智能、批 处理到机器学习 Report & Dashboard OLAP & Ad-hoc Batch processing Machine learning Realtime Analytics data Big Table Ex. CDR, transaction, Web log,… Small table Small table Unstructured data 3
- 4.来自数据的挑战 • Data Size • Single Table >10 B • Fast growing 百亿级数据量 • Multi-dimensional • Every record > 100 dimension • Add new dimension occasionally 多维度 • Rich of Detail • Billion level high cardinality 细粒度 • 1B terminal * 200K cell * 1440 minutes = 28800 (万亿) 4
- 5.来自应用的挑战 • Enterprise Integration 企业应用集成 • SQL 2003 Standard Syntax • BI integration, JDBC/ODBC Multi-dimensional OLAP Query • Flexible Query 灵活查询 无固定模式 • Any combination of dimensions • OLAP Vs Detail Record • Full scan Vs Small scan • Precise search & Fuzzy search 5 Full Scan Query Small Scan Query
- 6.How to choose storage? 如何构建数据平台? 6
- 7.选择1: NoSQL Database Key-Valuestore:low latency, <5ms 只能通过Key访问,一键一值 适合实时应用对接,不适合分析型应用 7
- 8.选择2:Parallel database • Parallel scan + Fast compute 细粒度控制并行计算,适合中小规模 数据分析(数据集市) • Questionable scalability and fault-tolerance • Cluster size < 100 data node 扩展能力有上限 • Not suitable for big batch job 查询内容错能力弱 不适合海量数据分析(企业级数仓) 8
- 9.选择3: Search engine •All column indexed •Fast searching •Simple aggregation 适合多条件过滤,文本分析 •Designed for search but not OLAP •Not for TopN, join, multi-level aggregation •3~4X data expansion in size •No SQL support 9 无法完成复杂计算 数据膨胀 专用语法,难以迁移
- 10.选择4: SQL on Hadoop •Modern distributed architecture, scale well in computation. •Pipelinebased:'>based: