Driving Data Processing Blog

The data processing algorithm created is the core part of the Driver Scoring function of the Dashphone app. Ultimately, it will be used in the AWS Server as a lambda that will trigger any time a set of data is received from a Dashphone user. When driving data collected using the Dashphone or DriveCheck app is exported to the AWS server, the S3 trigger goes off and an instance of this algorithm is created as a lambda to process the data received and determine new threshold values for what defines an unsafe driving event. Using the old threshold values which are saved to the user’s profile, and the new threshold values that are created using the lambda, a new set of updated threshold values are created and returned to the user’s profile. The next time the user’s Dashphone app exports a data set to the AWS Server, the app will pull the updated threshold value to store. 

The Lambda function itself is a Python program that accepts a file containing the collected driving data which holds values for a car’s acceleration, speed, course, and other data that was collected using the DriveCheck or Dashphone App and processed using the SensorManager function built into the app which transforms collected data from the phone’s reference frame to the car’s reference frame. The data received is then processed to extract values that define a bad driving event by the driver. 

The first step in this process is extracting significant statistics from the entire data set by running through the file. These significant statistics include the average acceleration, maximum acceleration value, and minimum acceleration value for both X Acceleration and Y Acceleration across the entire data set. 

The program then continues by using these collected statistics to normalize the data set and load the data into different data structures. The data is normalized by shifting the average for X Acceleration and Y Acceleration to 0, this is to account for inaccuracies in data collection by the phone itself. The program runs through each acceleration value collected and shifts it down by the average acceleration value so that, in the end, the average acceleration for the entire data set is now 0.

The normalized Acceleration values are added to lists for their type of Acceleration (X, Y, or Z) so that each individual acceleration value can be stored, as well as to a histogram, which will be used later to find outliers in the data set. The histogram is a Frequency vs. Acceleration Value histogram that is represented by a list of size “(Max Acceleration Value) – (Min Acceleration Value) * (A Decimal Accuracy Value)+1”. The Decimal Accuracy Value is a constant used throughout the program that controls how many decimal places the acceleration values are stored too and only needs to be changed at its declaration to update the entire program into a higher or lower decimal accuracy. A Decimal Accuracy Value of 100 would make the program store all acceleration values to 2 decimal places, a value of 1000 to 3 decimal places, and so on. For example, in a data set with X Accelerations of “-10, 0, 10” the maximum value would be 10.00 and the minimum value -10.00. Say the Decimal Accuracy Value is 100, a list of size 2000 would be made since 10-(-10)*100=2000. Each index in this size 2000 list would be a value to the .01 decimal place so that the index 0 corresponds to an acceleration of -10.00, index 1000 to an acceleration of 0.00, and index 2001 to an acceleration of 10.00. 

Once this histogram is created, the acceleration values, which have already been normalized, are loaded into the histogram. This is done by subtracting each normalized value by the minimum value in the data set, and multiplying the value by the Decimal Accuracy Value. This value becomes the index that corresponds to the acceleration value so that the lowest value in the data set corresponds with the index 0. The histogram value at this index is incremented up one, to signify that another instance of that specific acceleration value has been found. This is repeated for every value in the data set, being added to a list that stores the normalized acceleration values for either X, Y, or Z Acceleration and having the value stored at the acceleration value’s corresponding index being incremented, for both X Acceleration values and Y Acceleration values which each have their own histograms. Using the example above, the histogram would be a large list with mostly zeros, except for the indices 0, 1000, and 2000, which correspond to -10, 0, and 10 respectively, which would have the value 1 since the frequency of these values are 1. 

Histograms created using Test Data

The program then runs through the histogram created to find the outlier values of the entire data set. Starting at the beginning of the X Acceleration Histogram which is the minimum value in the data set, the program iterates through the histogram of X Acceleration values until it reaches the 5th percentile (or any other percentage value since it can be adjusted easily at the start of the program) of the data set. For each value the program encounters with a frequency of 1 or greater, it adds that X Acceleration value to a list that stores outlier values for Hard X Acceleration as well as the acceleration value’s corresponding Z Value to a outlier list for Hard Acceleration Z Values as many times as it’s frequency. The Z Value for each X Acceleration value is found easily since in the original list of normalized acceleration values, the Z values are held at the same index as its corresponding X Acceleration value, so by adding each Z Value that is at the same index as the X Acceleration value being looked at currently in the histogram, the X Acceleration’s corresponding Z Acceleration Value can be found. This process is repeated until 5% of the data set has been added to the outlier list. It then repeats this process from the end of the X Acceleration Histogram to get outliers on the other end, as well as for the start and end of the Y Acceleration Histogram. 

Scatter Plot of Test Data, Outliers shown in blue

A correlation between X and Z Accelerations and Y and X Accelerations was attempted to be found since a car dips when braking abruptly and rises when accelerating abruptly and when turning, cars generally brake going into the turn and accelerate going out. There is some support for this however large data sets would be needed to prove it. 

Finally, these outlier values are clustered using the K-Means library to find centroids for the outlier values. These centroids are the determined threshold values for the entire data set and, when averaged with the old threshold values, will be what is returned to the AWS Server. 

Scatter Plots of Test Data
Outliers shown in blue, Cluster Centroids shown in red