Delta Lake:Open Source Reliability for Data Lake with Apache Spark 李潇
2020-03-01 204浏览
- 1.The Delta Architect Delta Lake + Apache Spark Structured Streaming ( 0 (3 D G @A C @
- 2.
- 3.• Tech Lead and Engineering Manager at Databricks • Apache Spark Committer and PMC Member • Previously, IBM Master Inventor • Spark, Database Replication, Information Integration • Ph.D. in University of Florida •Github:gatorsmile
- 4.Delta Lake Joins the Linux Foundation! +
- 5.
- 6.Dominique Brezinski (Apple Inc.) Michael Armbrust (Databricks) Spark + AI Summit 2017/06 2017/10 2018/06 2019/04 2019/10
- 7.• U • Delta Lake • Delta • Delta • Delta • Delta Lake & Demo
- 8.k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
- 9.k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming Kinesis CSV, JSON, TXT… Data Lake AI & Reporting
- 10.Events Stream Stream Table (Data gets written continuously) AI & Reporting p
- 11.Events Stream Batch Table (Data gets written continuously) Table Batch AI & Reporting
- 12.Events Stream 1p Batch Table (Data gets written continuously) Table Batch AI & Reporting
- 13.Events Stream Stream Unified View Lambda b Batch Table (Data gets written continuously) Batch
- 14.Events Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting
- 15.Events Stream Stream Unified View AI & Reporting Partition Batch Table (Data gets written continuously) Batch
- 16.Events Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting Update / Merge d Update / Merge
- 17.Events m a L a d b Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting Update / Merge d Update / Merge
- 18.Diff ere nt f Con cat en ield ate typ sm e g a n i d s a l o l l e c m a f a r Events f a i t u a l d e w s o l s s e y l e s m e r t c x E Stream o s n n o i t a r f e p l O i a t c a d a t t e i M n n o d g e k c o l s B s c d n a h E m m em ven Co a H t u o Stream a w Unified View AI & Reporting l d n u C t o F o t o o N e l i c F n g n i o t s t e g n i p e s e t K te n rol num c y Update / s ! t l u ! s e r ! b b o j t e n e t Merge s r i s n o o c n i : L f A C I padr Batch Batch CRIT q u ? Table ? e ? s e u t s s I e l f b a i (Data gets written T h l s e e r f e Update / Merge R s? continuously)
- 19.m Lambda Events Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting Update / Merge d Update / Merge
- 20.m Lambda ! Events Stream L Stream Unified View Batch Table (Data gets written continuously) Batch r AI & Reporting Update / Merge d Update / Merge
- 21.k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming ? Kinesis AI & Reporting CSV, JSON, TXT… Data Lake d[ keng ] [ die ]
- 22.? Kinesis CSV, JSON, TXT… P Data Lake 2M g f g 3M 4M 5M d t 1M TS AI & Reporting g S P
- 23.Structured Streaming + = • h • • x Delta
- 24.Delta Lake a S
- 25.Delta On Disk (Optional) Partition Directories my_table/ _delta_log/ 00000.json 00001.json date=2019-01-01/ Data Files file-1.parquet Transaction Log Table Versions
- 26.Table = result of a set of actions Action Types • Change Metadata – name, schema, partitioning, etc. • Add File – adds a file (with optional statistics) • Remove File – removes a fileResult:Current Metadata, List of Files, List of Txns, Version
- 27.Atomicity C C@ G G D AA D 9A @G 000000.json GD GD D @ Add 1.parquet Add 2.parquet 000001.json Remove 1.parquet Remove 2.parquet Add 3.parquet
- 28.] 1. Record start version 2. Record reads/writes 3. Attempt commit, check for conflicts among transactions 4. If someone else wins, check if anything you read has changed. 5. Try again.Read:'>Read: