Large Scaled Telematics Analytics in Apache Spark
2020-02-27 56浏览
- 1.Large-Scaled Telematics Analytics in Apache Spark Wayne Zhang, Uber Neil Parker, Uber #DS3SAIS
- 2.Agenda • Telematics introduction • Eng pipeline 2
- 3.Telematics - Wide availability - Cheap - Short upgrade cycle - Lower quality - Measure phone motionSource:Smartphone-based Vehicle Telematics - A Ten-Year Anniversary 3
- 4.Core Pipeline Sensor Data Collection Preprocessing & Transformation Vehicle Movement Inference Driving Behavior Inference 4
- 5.Phone Sensor Data ● ● GPS ○ Absolute location, velocity and time ○ Low frequency (<= 1 point per second) IMU ○ Relative motion of phone ■Accelerometer:3D linear acceleration ■Gyroscope:3D angular velocity ○ High frequency (>20 points per second) 5
- 6.GPS Map-Matching 6
- 7.Phone Re-Orientation 7
- 8.Long Stop Detection Long Stop Long Stop 8
- 9.Vehicle Movement 9
- 10.Phone Mounting 10
- 11.Engineering • Pipeline – Past – Present • Problems How big is our data? xPbs per year xTbs sensor data per day millions trips per day 11
- 12.Data Pipeline - Past (Streaming) Input Transform Output Schema Topic (Kafka) Samza Job Schema Topic (Kafka) • Realtime 12
- 13.Data Pipeline - Present (Batch) Select logic in SparkSQL Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) • Flexible 13
- 14.Data Pipeline - Actuality (λ) Input Transform Output Schema Topic (Kafka) Samza Job Schema Topic (Kafka) Business Logic (JVM) Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) 14
- 15.Data Pipeline - Present Select logic in SparkSQL Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) 15
- 16.Data Pipeline Join logic in SparkSQL Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Hive Table (HDFS) 16
- 17.Data Pipeline Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Output Hive Table (HDFS) Hive Table (HDFS) 17
- 18.Data Pipeline Scheduler (Every 24hrs) Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Output Hive Table (HDFS) Hive Table (HDFS) 18
- 19.Data Pipeline Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) 19
- 20.Data Pipeline Input Transform Output / Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Spark Job Hive Table (HDFS) 20
- 21.Data Pipeline - Actuality 21
- 22.Eng Problems • Data Sources • OOM Errors • Too many Namenodes 22
- 23.Eng Problems - Data Sources Join logic in SparkSQL Doesn’t work Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Thrift Binaries (S3) 23
- 24.Eng Problems - Data Sources Input Transform Output Thrift Binaries (S3) Spark Job Hive Table (HDFS) github.com/airbnb/airbnb-spark-thrift 24
- 25.Aside:Encode Decode Invariant Java Thrift Class Instance Spark SQL Row 25
- 26.Aside:Encode Decode Invariant Java Thrift Class Instance Spark SQL Row Generate random data to test (ScalaCheck Library) 26
- 27.Eng Problems - Data Sources Join logic in SparkSQL Works Input Transform Output Hive Table (HDFS) Spark Job Hive Table (HDFS) Input Transform Output Thrift Binaries (S3) Spark Job Hive Table (HDFS) 27
- 28.Eng Problems - OOM Errors • If hitting OOM related issues, usually increasing partitions help – `spark.sql.shuffle.partitions=X Where x > 200 (default) • Might also play with executor-cores and memory settings • 28
- 29.Eng Problems - # Namenode • If we have more partitions, we create more files •Solution:Merge file after job run – github.com/apache/parquet-mr/tree/master/parquet-tools 29
- 30.Looking Towards the Future 30
- 31.Thank You! (And thanks for everyone @ Uber who helped us) Email us if you have anyquestions:actuaryzhang@uber.com nwparker@uber.com 31