Large Scaled Telematics Analytics in Apache Spark

2020-02-27 56浏览

1.Large-Scaled Telematics Analytics in Apache Spark Wayne Zhang, Uber Neil Parker, Uber #DS3SAIS
2.Agenda • Telematics introduction • Eng pipeline 2
3.Telematics - Wide availability - Cheap - Short upgrade cycle - Lower quality - Measure phone motionSource:Smartphone-based Vehicle Telematics - A Ten-Year Anniversary 3
4.Core Pipeline Sensor Data Collection Preprocessing & Transformation Vehicle Movement Inference Driving Behavior Inference 4
5.Phone Sensor Data ● ● GPS ○ Absolute location, velocity and time ○ Low frequency (<= 1 point per second) IMU ○ Relative motion of phone ■Accelerometer:3D linear acceleration ■Gyroscope:3D angular velocity ○ High frequency (>20 points per second) 5
6.GPS Map-Matching 6
7.Phone Re-Orientation 7
8.Long Stop Detection Long Stop Long Stop 8
9.Vehicle Movement 9
10.Phone Mounting 10
11.Engineering • Pipeline – Past – Present • Problems How big is our data? xPbs per year xTbs sensor data per day millions trips per day 11
12.Data Pipeline - Past (Streaming) Input Transform Output Schema Topic (Kafka) Samza Job Schema Topic (Kafka) • Realtime 12
13.Data Pipeline - Present (Batch) Select logic in SparkSQL Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) • Flexible 13
14.Data Pipeline - Actuality (λ) Input Transform Output Schema Topic (Kafka) Samza Job Schema Topic (Kafka) Business Logic (JVM) Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) 14
15.Data Pipeline - Present Select logic in SparkSQL Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) 15
16.Data Pipeline Join logic in SparkSQL Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Hive Table (HDFS) 16
17.Data Pipeline Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Output Hive Table (HDFS) Hive Table (HDFS) 17
18.Data Pipeline Scheduler (Every 24hrs) Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Output Hive Table (HDFS) Hive Table (HDFS) 18
19.Data Pipeline Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) 19
20.Data Pipeline Input Transform Output / Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Spark Job Hive Table (HDFS) 20
21.Data Pipeline - Actuality 21
22.Eng Problems • Data Sources • OOM Errors • Too many Namenodes 22
23.Eng Problems - Data Sources Join logic in SparkSQL Doesn’t work Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Thrift Binaries (S3) 23
24.Eng Problems - Data Sources Input Transform Output Thrift Binaries (S3) Spark Job Hive Table (HDFS) github.com/airbnb/airbnb-spark-thrift 24
25.Aside:Encode Decode Invariant Java Thrift Class Instance Spark SQL Row 25
26.Aside:Encode Decode Invariant Java Thrift Class Instance Spark SQL Row Generate random data to test (ScalaCheck Library) 26
27.Eng Problems - Data Sources Join logic in SparkSQL Works Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Transform Output Thrift Binaries (S3) Spark Job Hive Table (HDFS) 27
28.Eng Problems - OOM Errors • If hitting OOM related issues, usually increasing partitions help – `spark.sql.shuffle.partitions=X Where x > 200 (default) • Might also play with executor-cores and memory settings • 28
29.Eng Problems - # Namenode • If we have more partitions, we create more files •Solution:Merge file after job run – github.com/apache/parquet-mr/tree/master/parquet-tools 29
30.Looking Towards the Future 30
31.Thank You! (And thanks for everyone @ Uber who helped us) Email us if you have anyquestions:actuaryzhang@uber.com nwparker@uber.com 31