eBay资深主任工程师郭跃鹏 - Apache Griffin-分布式系统的数据质量方案
2020-02-27 535浏览
- 1.Apache Griffin Data Quality Solution for both streaming and batch 郭跃鹏 eBay资深主任工程师 数据服务部门 guoyp@apache.org
- 2.Agenda • About us • Apache Griffin • Demo • What is coming • How to contribute • Q/A
- 3.eBay Marketplace at a Glance One of the world’s largest and most vibrant marketplaces
- 4.Velocity Stats
- 5.Mobile Velocity Stats
- 6.Big Data @
- 7.Apache Griffin • Platform approach • Common data quality dimensions • Extensible, pluggable, scalable • Trusted d atasets
- 8.A story -‐ problem One day, personalization team found a large decrease in data quality metrics
- 9.A story -‐ analysis For large decrease in metrics, candidatesinclude:• The report are broken after streaming transfer due to minor changes in fields • We are missing data from pipeline • Our data queue is not working P We didn’t change anything. PHey Streamingteam:Could you check from your side for this issue? S Let’s check our metrics. S Oh… ok, we need to add a new metrics for this? S uh,… it seems data are already lost, let me check with our upper streaming
- 10.A story – analysis continued Hi Mobile, Can we temporarily switch/restore to old version? M What is your logic for data quality from your side, show us your sql… M uh, … Select * from … a left outer join b on … where… and … … M mobiles app will never send rq in version 4.1.5 Right! That’s root cause. S S S S
- 11.A story – analysis continued • Isolated system looks good from their own perspective. • Communication is always hard when crossing teams. • We took 1 week to find the root cause.
- 12.A story -‐ conclusion • No unified view of data quality across multiple systems and teams • No platform approach to manage data quality • No systematic way to measure near real-‐time data quality
- 13.Apache Griffin • Data Quality Platform built on Hadoop and Spark ➢ Batch data ➢ Real-‐time data • A unified process to detect DQ issues ➢ Inaccurate ➢ Incomplete ➢ Invalid ➢ …… • An open source solutionhttps://github.com/apache/incubator-‐griffin
- 14.Griffin Goal A solution with all the below capabilities Capability Support eBay ’s scale Data Quality measurement Support real-‐time data Support unstructured data Service based API Data Profiling Pluggable measurement types Commercial DQ software x √ x x √ √ x Open source Apache DQ software Griffin x√ x√ x√ x√ x√ √√ x√
- 15.What is Data Quality?
- 16.Virtuous Cycle of Data Quality • Define the scope, dimensions, goals, thresholds, etc. • Measure data quality values • Analyze data quality results • Improve data quality
- 17.Apache Griffin Architecture
- 18.Apache Griffin – Tech Stack
- 19.Apache Griffin – Measure insights • Uniform Data Quality DSL • DSL support both streaming and batch • Configurable data source connectors Accuracy DSLexample:Where:“$source.uid = $target.uid and $souce.itemid = $target.itemid and $source.tmp > $target.tmp”
- 20.Apache Griffin – Accuracy Measure ~300M customer view item events per day
- 21.Apache Griffin – Validity Measure
- 22.Apache Griffin – Time Series Metrics Elasticsearch : • Offer aggregations • Visualization(kibana, Grafana) • Restful to integrate with
- 23.Apache Griffin Life is easier after Griffin …
- 24.Apache Griffin – Tech Challenges • Unified model for both streaming and batch • Stable • Easily adaption • Scalable, extensible algorithms
- 25.Demo
- 26.Use Cases Griffin has been deployed in production at eBay and provided the centralized data quality service for several eBay systems.
- 27.What is coming • DSL to support more dimensions üCompleteness üConsistency üAnomaly detection • Provide more data source connectors üRaw Hadoop data üHybrid data source connectors
- 28.How to Contribute • Community over code • Meritocracy
- 29.How to Contribute We are open source and PR are welcomed GitHub :https://github.com/apache/incubator-‐griffin Website :https://griffin.incubator.apache.orgContact:mailto://subscribe-‐dev@griffin.incubator.apache.org Apache GriffinJIRA:https://issues.apache.org/jira/browse/GRIFFINApache Griffin Wiki :https://cwiki.apache.org/confluence/display/GRIFFIN/Griffin
- 30.Q / A
- 31.