Large Scaled Telematics Analytics in Apache Spark

2020-02-27 56浏览

  • 1.Large-Scaled Telematics Analytics in Apache Spark Wayne Zhang, Uber Neil Parker, Uber #DS3SAIS
  • 2.Agenda • Telematics introduction • Eng pipeline 2
  • 3.Telematics - Wide availability - Cheap - Short upgrade cycle - Lower quality - Measure phone motionSource:Smartphone-based Vehicle Telematics - A Ten-Year Anniversary 3
  • 4.Core Pipeline Sensor Data Collection Preprocessing & Transformation Vehicle Movement Inference Driving Behavior Inference 4
  • 5.Phone Sensor Data ● ● GPS ○ Absolute location, velocity and time ○ Low frequency (<= 1 point per second) IMU ○ Relative motion of phone ■Accelerometer:3D linear acceleration ■Gyroscope:3D angular velocity ○ High frequency (>20 points per second) 5
  • 6.GPS Map-Matching 6
  • 7.Phone Re-Orientation 7
  • 8.Long Stop Detection Long Stop Long Stop 8
  • 9.Vehicle Movement 9
  • 10.Phone Mounting 10
  • 11.Engineering • Pipeline – Past – Present • Problems How big is our data? xPbs per year xTbs sensor data per day millions trips per day 11
  • 12.Data Pipeline - Past (Streaming) Input Transform Output Schema Topic (Kafka) Samza Job Schema Topic (Kafka) • Realtime 12
  • 13.Data Pipeline - Present (Batch) Select logic in SparkSQL Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) • Flexible 13
  • 14.Data Pipeline - Actuality (λ) Input Transform Output Schema Topic (Kafka) Samza Job Schema Topic (Kafka) Business Logic (JVM) Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) 14
  • 15.Data Pipeline - Present Select logic in SparkSQL Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) 15
  • 16.Data Pipeline Join logic in SparkSQL Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Hive Table (HDFS) 16
  • 17.Data Pipeline Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Output Hive Table (HDFS) Hive Table (HDFS) 17
  • 18.Data Pipeline Scheduler (Every 24hrs) Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Output Hive Table (HDFS) Hive Table (HDFS) 18
  • 19.Data Pipeline Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) 19
  • 20.Data Pipeline Input Transform Output / Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Spark Job Hive Table (HDFS) 20
  • 21.Data Pipeline - Actuality 21
  • 22.Eng Problems • Data Sources • OOM Errors • Too many Namenodes 22
  • 23.Eng Problems - Data Sources Join logic in SparkSQL Doesn’t work Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Thrift Binaries (S3) 23
  • 24.Eng Problems - Data Sources Input Transform Output Thrift Binaries (S3) Spark Job Hive Table (HDFS) github.com/airbnb/airbnb-spark-thrift 24
  • 25.Aside:Encode Decode Invariant Java Thrift Class Instance Spark SQL Row 25
  • 26.Aside:Encode Decode Invariant Java Thrift Class Instance Spark SQL Row Generate random data to test (ScalaCheck Library) 26
  • 27.Eng Problems - Data Sources Join logic in SparkSQL Works Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Transform Output Thrift Binaries (S3) Spark Job Hive Table (HDFS) 27
  • 28.Eng Problems - OOM Errors • If hitting OOM related issues, usually increasing partitions help – `spark.sql.shuffle.partitions=X Where x > 200 (default) • Might also play with executor-cores and memory settings • 28
  • 29.Eng Problems - # Namenode • If we have more partitions, we create more files •Solution:Merge file after job run – github.com/apache/parquet-mr/tree/master/parquet-tools 29
  • 30.Looking Towards the Future 30
  • 31.Thank You! (And thanks for everyone @ Uber who helped us) Email us if you have anyquestions:actuaryzhang@uber.com nwparker@uber.com 31