Detecting Warranty Claim Anomalies - Under-fitting

One of projects I undertook was aimed at developing a robust model to accurately detect anomalies in warranty claims. The primary goal was to identify unusual patterns that might indicate fraud or systematic issues, ensuring the integrity and efficiency of the warranty claims process. The challenge was that this project was actually an on-demand Ad Hoc report in which the PA department needed as soon as possible. This put a time-crunch on my workflow and I needed to quickly create a model that was suitable for their needs.

** Many parts of this project including the dataset has been altered due to privacy and security reasons. The concepts, techniques, and analysis were kept the same.

Data Extraction

The first step was to extract a comprehensive dataset from Hyundai's warranty claims Oracle database. Using PostgreSQL (Originally used WinSQL which uses MySQL syntax) I wrote a complex SQL query to gather detailed claim information along with aggregated statistics, ensuring the data was rich and suitable for feature engineering and model training.

Initial Approach and Challenges

Goal: Create a reliable model to detect anomalies in warranty claims data.

So I initially created a Model using Linear Regression with Minimal Features

Features Used: claim_cost, repair_time
Model Type: Linear Regression
Dataset Size: 1000 samples (reduced for quicker experimentation since this was a quick on-demand Ad Hoc Report)

After running the script, the outcome showed that:

Mean Squared Error (MSE): 0.00347
The Problem: The model under-fitted the data, failing to capture the underlying complexity and variability in the warranty claims.
The scatter plot of the initial model showed significant deviation from the perfect prediction line, indicating that the model was not accurately predicting the parts failure rate.

Improved Approach with Feature Engineering and Hyperparameter Tuning

To address the underfitting issue, I implemented a more sophisticated approach using a RandomForestRegressor with feature engineering and hyperparameter tuning.

Updated Model: RandomForestRegressor with Hyperparameter Tuning

Features Used: A comprehensive set of features, including polynomial and interaction terms:

claim_cost, repair_time
Polynomial features: claim_cost^2, repair_time^2, log_claim_cost^2, cost_per_repair_time^2, etc.
Interaction terms: claim_cost log_claim_cost, log_claim_cost repair_time, etc.

Dataset Size: Augmented to 5000 samples for better variability and complexity.

Hyperparameters Tuned:

n_estimators: Number of trees in the forest
max_depth: Maximum depth of the trees
min_samples_split: Minimum number of samples required to split an internal node
min_samples_leaf: Minimum number of samples required to be at a leaf node

Techniques and Best Practices Used:

Data Preprocessing: Filled missing values, handled categorical data, and normalized numerical features.
Feature Engineering: Created new features based on domain knowledge, including polynomial and interaction terms.
Data Augmentation: Increased dataset size by duplicating records and adding slight noise to introduce variability.
Model Selection: Used a tree-based model (RandomForestRegressor) suitable for capturing complex patterns.
Hyper parameter Tuning: Employed GridSearchCV to find the best hyperparameters for the model.
Evaluation Metrics: Used Mean Squared Error (MSE) to evaluate model performance.

New Outcome:

Mean Squared Error (MSE): 0.00027
The updated model showed a substantial improvement in capturing the variability in the data, with the fitted regression line closely aligned with the perfect prediction line.

The enhanced model successfully detected anomalies in warranty claims, providing a more accurate and reliable tool for the warranty department.

For a deeper understanding of the model and the hyperparameter tuning process. Below are some mathematical concepts and calculations related to the RandomForest model and hyperparameter tuning:

1. Mean Squared Error (MSE)

The MSE is a common metric used to evaluate regression models. It measures the average squared difference between the actual and predicted values.

Where:

nnn is the number of observations.
yiy_iyi is the actual value.
y^i\hat{y}_iy^i is the predicted value.

2. Random Forest Regressor

Random Forest is an ensemble learning method that combines multiple decision trees to improve the model's performance and reduce overfitting. The main calculations involve averaging predictions from multiple trees.

3. Hyperparameter Tuning

Hyperparameter tuning involves searching for the optimal hyperparameters that minimize the MSE. The GridSearchCV method performs an exhaustive search over a specified parameter grid.

4. Feature Engineering Calculations

Feature engineering involved creating new features such as log_claim_cost and cost_per_repair_time:

This project demonstrated the importance of iterative model development, starting with a simple model and progressively improving it through feature engineering, data augmentation, and hyperparameter tuning. By employing best practices in machine learning, I was able to enhance the anomaly detection model, achieving a significant reduction in MSE and improving the overall accuracy of the predictions.

The implementation of this advanced anomaly detection model had a significant beneficial impact on Hyundai's PA (Product Assurance) department in several ways:

Improved Fraud Detection: By accurately identifying unusual patterns and anomalies in warranty claims, the model helped in early detection of potential fraud. This allowed the PA department to take proactive measures, reducing financial losses and maintaining the integrity of the warranty claims process.

Enhanced Efficiency: The model automated the process of detecting anomalies, significantly reducing the time and effort required for manual reviews. This allowed the PA department to allocate resources more effectively and focus on higher-value tasks.

Data-Driven Insights: The feature engineering and hyperparameter tuning provided deeper insights into the factors influencing warranty claims. These insights enabled the PA department to make more informed decisions regarding warranty policies, claim approvals, and part recalls.

Increased Accuracy: The improved model accuracy, reflected in the reduced Mean Squared Error (MSE) from 0.00347 to 0.00027, ensured that the predictions were more reliable. This accuracy minimized false positives and false negatives, leading to more precise anomaly detection.