Ozone 下一代数据湖存储 堵俊平

2020-03-01 152浏览

  • 1.Ozone – Next-Gen Storage for Data Lake 堵俊平 腾讯大数据海量存储与计算负责人 ASF Member
  • 2.自我介绍
  • 3.自我介绍 Junping Du ASF Member, Apache Hadoop PMC Member & Committer Chair of Tencent Open Source Alliance Director @ Tencent Big Data Years of open source experience focus on big data
  • 4.Agenda ● ● ● ● ● ABC about Data Lake Overview of Apache Ozone Ozone Architecture, Design and Details Current Status, Work In Progress and Release Plan Ozone in Tencent Scenario
  • 5.What is Data Lake? A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics - from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
  • 6.Why Data Lake? BI Machine Learning Data IDE Data Management Data Lake Database Data Warehouse Object Store NoSQL
  • 7.Core Features of Data Lake Solution Data Lake Solution includes Data Lake Storage, Data Lake Management and Data Lake Analytics. Difference compared to data warehouse • Follow the natural structure of data • Mainly used for ad-hoc queries and heterogeneous analytics • Not good at routine data modeling, data mart and data governance Data Lake Analytics Data Lake Analytic s Different patterns of analytic workloads to process different data Data Lake Management Metadata governance to manage the lifecycle of heterogeneous data Data Lake Storage Storage system to store different patterns of data, structured and unstructured Data Lake Mgmt Data Lake Storage
  • 8.Scenario of Data Lake Ad-hoc queries to investigate the value of data Object Store Data Scientist Data Analyst Data Lake Analytics Data exploration • Ad-hoc queries to investigate the value of data • Data exploration • Interactive queries on heterogeneous data NoSql Storage BI Data Lake Analytics Interactive queries on heterogeneous data Data Lake Analytics BI BI
  • 9.Target Persona Data Analyst Data Engineer Create different analysis • Data collect and inject model • ETL task scheduling • Create layered data mart • Data preprocess, data • Data modeling • Data visualization • governance Data Scientist • • • Interactive data exploration ML and DL Hyper-parameter tuning Project Manager • Project management • User management • Coordination
  • 10.Theme of Data Lake Storage side Scalability Cloud Machine Learning
  • 11.Retrospect for HDFS Architecture NN store all metadata in memory Low latency metadata operations Easy Scaling – IO + PBs + Clients Metadata in memory is both the strength and weakness of HDFS
  • 12.Why Ozone? ● HDFS has scaling problems ● Some users have "make your HDFS healthy” day. ● 200 million files for regular users ● companies with committers/core devs - 400-600 million ● New Opportunities and Challenges ○ Cloud ○ Streaming ○ Small files are the norm
  • 13.Scaling Challenges When scaling cluster up to 4K+ nodes with about ~500M files Namespace metadata in NN Block management in NN File operation concurrency Block reports handle Client/RPC 150K++ Slow NN startup o o o o o o Small files in HDFS make thing worse !!!
  • 14.What is Apache Ozone? ● Object Store for Big Data ● Scale both in terms of objects, IOPS. ● Name node is not a bottleneck anymore. ● A set of micro-services each capable of doing its own stuff. ● Leverage learnings from supporting HDFS across a large set of use cases. ● Apache YARN, Map Reduce, Spark, Hive are all tested and certified to work with Apache Ozone. No application changes are required to work with Ozone. ● Supports K8s, CSI and ability to run on K8s natively. ● A spiritual successor to HDFS.
  • 15.Ozone Architecture Overview NameNode’ NameNode’ NameNode’ Ozone Manager Ozone Manager Ozone Manager Storage Container Manager Datanode Apache Ratis Datanode Apache Ratis Datanode Apache Ratis
  • 16.Ozone Manager Namespace layer for Ozone Manage objects in a flat namespace Volume/Bucket/Key LSM-based K-V store for metadata LevelDB/RocksDB/… Benefits compared with HDFS Namenode Easy to manage and scale 1B keys tested in a single OM Scale independent of block layer Easily shard based on Bucket No GC pressure Not all in memory
  • 17.HDDS Storage Containers • HDDS datanode is a plug-in service running in Datanodes • Container is basic unit of replication (2-16GB) • Fully distributed block metadata in LSM-based K-V store • No centralized block map in memory like HDFS Key Value store
  • 18.Open/Close Containers • CloseContainer:'>Container: