TiDB 申砾_Design and Architecture

2020-02-27 55浏览

  • 1.关注公众号回复help, 可获取更多经典学习资 料和文档,电子书
  • 2.TiDB Design And Architecture ShenLi PingCAP
  • 3. Shen Li (申砾)  Tech Lead of TiDB, VP of Engineering  Netease / 360 / PingCAP  Infrastructure software engineer
  • 4. Why we need a new database  The goal of TiDB  Design && Architecture  Storage Layer  Scheduler  SQL Layer  Spark integration  TiDB on Kubernetes
  • 5. From scratch  What’s wrong with the existing DBs?  RDBMS  NoSQL & Middleware NewSQL:F1 & Spanner RDBMS 1970s NoSQL NewSQL 2010 2015 Present MySQL PostgreSQL Oracle DB2 ... Redis HBase Cassandra MongoDB ... Google Spanner Google F1 TiDB
  • 6. Scalability  High Availability ● ACID Transaction  SQL A Distributed, Consistent, Scalable, SQL Database that supports the best features of both traditional RDBMS and NoSQL Open source, of course
  • 7. Data storage  Data distribution  Data replication  Auto balance  ACID Transaction  SQL at scale
  • 8.SQL Layer Storage Layer Applications MySQL Drivers(e.g. JDBC) MySQL Protocol TiDB RPC TiKV
  • 9.
  • 10. Good start! RocksDB is fast and stable.  Atomic batch write  Snapshot  However… It’s a locally embedded KV store.  Can’t tolerate machine failures  Scalability depends on the capacity of the disk
  • 11.Fault Tolerance  Use Raft to replicate data  Key features of Raft  Strongleader:leader does most of the work, issue all log updates  Leader election  Membership changes Implementation: Ported from etcd  Replicas are distributed across machines/racks/data-centers
  • 12.Fault Tolerance Raft Raft RocksDB Machine 1 RocksDB Machine 2 RocksDB Machine 3
  • 13.Scalability  What if we SPLIT data into many regions?  We got many Raft groups.  Region = Contiguous Keys  Hash partitioning or Range partitioning? Redis:Hash partitioning HBase:Range partitioning RangeScan:Select * from t where c > 10 and c < 100;
  • 14.Key:Byte Array Logical Key Space  A globally ordered map  Can’t use hash partitioning  Use range partitioning  Region 1 -> [a - d]  Region 2 -> [e - h] …  Region n -> [w - z] (-∞, +∞) Sorted Map  Data is stored/replicated/scheduled in regionsMeta:[Start_key, end_key)
  • 15. That’s simple  Logical split  Just Split && Move  Split safely using Raft Region 1 Region 1 Region 2
  • 16.Region 1* Region 2 Region 3 Node A Node B Region 1 Region 2 Region 2 Region 3 Region Reg1ion 3 Node C Node D
  • 17.Node B Region 1* Region 2 Region 3 Node A New Node E Region Reg1io^n 2 Region 2 Region 3 Region Reg1ion 3 Node C Node D
  • 18.Node B Region 1 Region 2 Region 3 Node A New Node E Region Reg1io*n 2 Region 1 Region 2 Region 3 Region Reg1ion 3 Node C Node D
  • 19.Node B Region Reg1io*n 2 Region 2 Region 3 Node A New Node E Region 1 Region 2 Region 3 Region Reg1ion 3 Node C Node D
  • 20.Raft Group RPC Store 1 Region 1 Region 3 Region 5 Region 4 TiKV node 1 Client RPC RPC Store 2 Region 1 Region 2 Region 4 Region 3 Store 3 Region 2 Region 5 Region 3 TiKV node 2 TiKV node 3 RPC Store 4 Region 1 Region 2 Region 5 Region 4 TiKV node 4 Placement Driver PD 1 PD 2 PD 3
  • 21. MVCC  Data layout  key1_version2 -> value  key1_version1 -> value  key2_version3 -> value  Lock-free snapshot reads  Transaction  Inspired by Google Percolator  ‘Almost’ decentralized 2-phase commit
  • 22.● Highly layered Transaction ● Raft for consistency and scalability MVCC ● No distributed file system RaftKV ○ For better performance and lower latency Local KV Storage (RocksDB)
  • 23.
  • 24. Provide the God’s view of the entire cluster  Store the metadata  Clients have cache of placement information. Placemen t Driver  Maintain the replication constraint  3 replicas, by default Raft  Data movement for balancing the workload Placemen t  It’s a cluster too, of course. Driver  Thanks to Raft. Raft Raft Placemen t Driver
  • 25.Node 1 Region A Region B Movement Node 2 Region C HeartBeat with Info Schedulin g Command PD Cluster Info Scheduling Strategy Confi g Admin
  • 26. Replica number in a raft group  Replica geo distribution  Read/Write workload  Leaders and followers  Tables and TiKV instances  Other customized scheduling strategy
  • 27.
  • 28.● SQL is simple and very productive ● We want to write code likethis:SELECT COUNT(*) FROM user WHERE age > 20 and age < 30;
  • 29. Mapping relational model to Key-Value model  Full-featured SQL layer  Cost-based optimizer (CBO)  Distributed execution engine
  • 30. Row Key:'>Key: