Delta Lake:Open Source Reliability for Data Lake with Apache Spark 李潇

2020-03-01 204浏览

  • 1.The Delta Architect Delta Lake + Apache Spark Structured Streaming ( 0 (3 D G @A C @
  • 2.
  • 3.• Tech Lead and Engineering Manager at Databricks • Apache Spark Committer and PMC Member • Previously, IBM Master Inventor • Spark, Database Replication, Information Integration • Ph.D. in University of Florida •Github:gatorsmile
  • 4.Delta Lake Joins the Linux Foundation! +
  • 5.
  • 6.Dominique Brezinski (Apple Inc.) Michael Armbrust (Databricks) Spark + AI Summit 2017/06 2017/10 2018/06 2019/04 2019/10
  • 7.• U • Delta Lake • Delta • Delta • Delta • Delta Lake & Demo
  • 8.k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
  • 9.k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming Kinesis CSV, JSON, TXT… Data Lake AI & Reporting
  • 10.Events Stream Stream Table (Data gets written continuously) AI & Reporting p
  • 11.Events Stream Batch Table (Data gets written continuously) Table Batch AI & Reporting
  • 12.Events Stream 1p Batch Table (Data gets written continuously) Table Batch AI & Reporting
  • 13.Events Stream Stream Unified View Lambda b Batch Table (Data gets written continuously) Batch
  • 14.Events Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting
  • 15.Events Stream Stream Unified View AI & Reporting Partition Batch Table (Data gets written continuously) Batch
  • 16.Events Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting Update / Merge d Update / Merge
  • 17.Events m a L a d b Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting Update / Merge d Update / Merge
  • 18.Diff ere nt f Con cat en ield ate typ sm e g a n i d s a l o l l e c m a f a r Events f a i t u a l d e w s o l s s e y l e s m e r t c x E Stream o s n n o i t a r f e p l O i a t c a d a t t e i M n n o d g e k c o l s B s c d n a h E m m em ven Co a H t u o Stream a w Unified View AI & Reporting l d n u C t o F o t o o N e l i c F n g n i o t s t e g n i p e s e t K te n rol num c y Update / s ! t l u ! s e r ! b b o j t e n e t Merge s r i s n o o c n i : L f A C I padr Batch Batch CRIT q u ? Table ? e ? s e u t s s I e l f b a i (Data gets written T h l s e e r f e Update / Merge R s? continuously)
  • 19.m Lambda Events Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting Update / Merge d Update / Merge
  • 20.m Lambda ! Events Stream L Stream Unified View Batch Table (Data gets written continuously) Batch r AI & Reporting Update / Merge d Update / Merge
  • 21.k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming ? Kinesis AI & Reporting CSV, JSON, TXT… Data Lake d[ keng ] [ die ]
  • 22.? Kinesis CSV, JSON, TXT… P Data Lake 2M g f g 3M 4M 5M d t 1M TS AI & Reporting g S P
  • 23.Structured Streaming + = • h • • x Delta
  • 24.Delta Lake a S
  • 25.Delta On Disk (Optional) Partition Directories my_table/ _delta_log/ 00000.json 00001.json date=2019-01-01/ Data Files file-1.parquet Transaction Log Table Versions
  • 26.Table = result of a set of actions Action Types • Change Metadata – name, schema, partitioning, etc. • Add File – adds a file (with optional statistics) • Remove File – removes a fileResult:Current Metadata, List of Files, List of Txns, Version
  • 27.Atomicity C C@ G G D AA D 9A @G 000000.json GD GD D @ Add 1.parquet Add 2.parquet 000001.json Remove 1.parquet Remove 2.parquet Add 3.parquet
  • 28.] 1. Record start version 2. Record reads/writes 3. Attempt commit, check for conflicts among transactions 4. If someone else wins, check if anything you read has changed. 5. Try again.Read:'>Read: