eBay资深主任工程师郭跃鹏 - Apache Griffin-分布式系统的数据质量方案

2020-02-27 535浏览

  • 1.Apache  Griffin   Data   Quality   Solution  for  both  streaming  and  batch 郭跃鹏 eBay资深主任工程师 数据服务部门 guoyp@apache.org
  • 2.Agenda • About  us • Apache  Griffin   • Demo • What  is  coming • How  to  contribute • Q/A
  • 3.eBay  Marketplace  at  a  Glance One  of  the  world’s  largest  and  most  vibrant  marketplaces
  • 4.Velocity  Stats
  • 5.Mobile  Velocity  Stats
  • 6.Big  Data  @
  • 7.Apache  Griffin • Platform  approach • Common  data  quality  dimensions • Extensible,  pluggable,  scalable • Trusted d  atasets
  • 8.A  story  -­‐ problem One  day,  personalization  team  found  a  large  decrease   in  data  quality  metrics
  • 9.A  story  -­‐ analysis For  large  decrease  in  metrics,  candidatesinclude:• The  report  are  broken  after  streaming  transfer  due  to  minor  changes  in  fields • We  are  missing  data  from  pipeline • Our  data  queue  is  not  working P We  didn’t  change  anything. PHey  Streamingteam:Could  you  check  from  your  side  for  this  issue? S Let’s  check  our  metrics. S Oh…  ok,  we  need  to  add  a  new  metrics  for  this? S uh,…  it  seems  data  are  already  lost,  let  me  check  with  our  upper  streaming
  • 10.A  story  – analysis  continued Hi  Mobile, Can  we  temporarily  switch/restore to  old  version? M What  is  your  logic  for  data  quality  from  your  side,  show  us  your  sql… M uh,  … Select  *  from  …  a  left  outer  join  b  on  …   where…  and  … … M mobiles  app  will  never  send  rq  in  version  4.1.5 Right!  That’s  root  cause. S S S S
  • 11.A  story  – analysis  continued • Isolated system  looks  good  from  their  own  perspective. • Communication is  always  hard  when  crossing  teams. • We  took  1 week  to  find  the  root  cause.
  • 12.A  story  -­‐ conclusion • No  unified view  of  data  quality  across  multiple   systems  and  teams • No  platform  approach   to  manage  data  quality • No  systematic  way  to  measure  near  real-­‐time data  quality
  • 13.Apache  Griffin • Data  Quality  Platform  built  on  Hadoop  and  Spark ➢ Batch  data ➢ Real-­‐time  data • A  unified  process  to  detect  DQ  issues ➢ Inaccurate ➢ Incomplete ➢ Invalid ➢ …… • An  open  source  solutionhttps://github.com/apache/incubator-­‐griffin
  • 14.Griffin  Goal A  solution   with  all  the   below   capabilities Capability Support  eBay ’s  scale Data  Quality  measurement Support  real-­‐time   data Support  unstructured  data Service  based  API Data  Profiling Pluggable  measurement   types Commercial   DQ  software x √ x x √ √ x Open  source   Apache   DQ  software Griffin x√ x√ x√ x√ x√ √√ x√
  • 15.What  is  Data  Quality?
  • 16.Virtuous  Cycle  of  Data  Quality • Define  the  scope,  dimensions,  goals,  thresholds,  etc. • Measure  data  quality  values • Analyze  data  quality  results • Improve  data  quality
  • 17.Apache  Griffin  Architecture
  • 18.Apache  Griffin  – Tech  Stack
  • 19.Apache  Griffin  – Measure  insights • Uniform  Data  Quality  DSL • DSL  support   both  streaming  and  batch • Configurable   data  source   connectors Accuracy  DSLexample:Where:“$source.uid = $target.uid and $souce.itemid = $target.itemid and $source.tmp > $target.tmp”
  • 20.Apache  Griffin  – Accuracy  Measure ~300M  customer   view  item  events  per  day
  • 21.Apache  Griffin  – Validity  Measure
  • 22.Apache  Griffin  – Time  Series  Metrics Elasticsearch   : • Offer  aggregations • Visualization(kibana,  Grafana) • Restful  to  integrate  with
  • 23.Apache  Griffin   Life   is  easier   after  Griffin  …
  • 24.Apache  Griffin  – Tech  Challenges • Unified  model  for  both  streaming  and  batch • Stable • Easily  adaption • Scalable,  extensible  algorithms
  • 25.Demo
  • 26.Use  Cases Griffin  has  been  deployed   in  production  at  eBay  and  provided  the   centralized   data  quality   service   for  several   eBay  systems.
  • 27.What  is  coming • DSL  to  support   more  dimensions üCompleteness üConsistency üAnomaly   detection • Provide  more  data  source  connectors üRaw  Hadoop  data üHybrid  data  source  connectors
  • 28.How  to  Contribute • Community   over  code • Meritocracy
  • 29.How  to  Contribute We  are  open  source   and  PR  are  welcomed GitHub  :https://github.com/apache/incubator-­‐griffin Website   :https://griffin.incubator.apache.orgContact:mailto://subscribe-­‐dev@griffin.incubator.apache.org Apache  GriffinJIRA:https://issues.apache.org/jira/browse/GRIFFINApache  Griffin  Wiki  :https://cwiki.apache.org/confluence/display/GRIFFIN/Griffin
  • 30.Q  /  A
  • 31.