top of page

Detecting Warranty Claim Anomalies - Under-fitting

One of projects I undertook was aimed at developing a robust model to accurately detect anomalies in warranty claims. The primary goal was to identify unusual patterns that might indicate fraud or systematic issues, ensuring the integrity and efficiency of the warranty claims process. The challenge was that this project was actually an on-demand Ad Hoc report in which the PA department needed as soon as possible. This put a time-crunch on my workflow and I needed to quickly create a model that was suitable for their needs.

** Many parts of this project including the dataset has been altered due to privacy and security reasons. The concepts, techniques, and analysis were kept the same.

SQL.png

Data Extraction

The first step was to extract a comprehensive dataset from Hyundai's warranty claims Oracle database. Using PostgreSQL (Originally used WinSQL which uses MySQL syntax) I wrote a complex SQL query to gather detailed claim information along with aggregated statistics, ensuring the data was rich and suitable for feature engineering and model training. 

​

Initial Approach and Challenges

Goal: Create a reliable model to detect anomalies in warranty claims data.

So I initially created a Model using Linear Regression with Minimal Features

  • Features Used: claim_cost, repair_time

  • Model Type: Linear Regression

  • Dataset Size: 1000 samples (reduced for quicker experimentation since this was a quick on-demand Ad Hoc Report)

Before.png
Screenshot 2024-07-12 at 1.54.23 AM.png

After running the script, the outcome showed that:

  • Mean Squared Error (MSE): 0.00347

  • The Problem: The model under-fitted the data, failing to capture the underlying complexity and variability in the warranty claims.

  • The scatter plot of the initial model showed significant deviation from the perfect prediction line, indicating that the model was not accurately predicting the parts failure rate.

after1.png
after2.png
after3.png

Improved Approach with Feature Engineering and Hyperparameter Tuning

To address the underfitting issue, I implemented a more sophisticated approach using a RandomForestRegressor with feature engineering and hyperparameter tuning.

​

Updated Model: RandomForestRegressor with Hyperparameter Tuning

​

Features Used: A comprehensive set of features, including polynomial and interaction terms:

  • claim_cost, repair_time

  • Polynomial features: claim_cost^2, repair_time^2, log_claim_cost^2, cost_per_repair_time^2, etc.

  • Interaction terms: claim_cost log_claim_cost, log_claim_cost repair_time, etc.

​

Dataset Size: Augmented to 5000 samples for better variability and complexity.

​

Hyperparameters Tuned:

  • n_estimators: Number of trees in the forest

  • max_depth: Maximum depth of the trees

  • min_samples_split: Minimum number of samples required to split an internal node

  • min_samples_leaf: Minimum number of samples required to be at a leaf node

​

Techniques and Best Practices Used:

  • Data Preprocessing:             Filled missing values, handled categorical data, and normalized numerical features.

  • Feature Engineering:            Created new features based on domain knowledge, including polynomial and interaction terms.

  • Data Augmentation:              Increased dataset size by duplicating records and adding slight noise to introduce variability.

  • Model Selection:                   Used a tree-based model (RandomForestRegressor) suitable for capturing complex patterns.

  • Hyper parameter Tuning:      Employed GridSearchCV to find the best hyperparameters for the model.

  • Evaluation Metrics:               Used Mean Squared Error (MSE) to evaluate model performance.

Screenshot 2024-07-12 at 1.37.04 AM.png

New Outcome:

  • Mean Squared Error (MSE): 0.00027

  • The updated model showed a substantial improvement in capturing the variability in the data, with the fitted regression line closely aligned with the perfect prediction line.

 

The enhanced model successfully detected anomalies in warranty claims, providing a more accurate and reliable tool for the warranty department.

For a deeper understanding of the model and the hyperparameter tuning process. Below are some mathematical concepts and calculations related to the RandomForest model and hyperparameter tuning:

​

1. Mean Squared Error (MSE)

The MSE is a common metric used to evaluate regression models. It measures the average squared difference between the actual and predicted values.

 

 

​

 

Where:

  • nnn is the number of observations.

  • yiy_iyi​ is the actual value.

  • y^i\hat{y}_iy^​i​ is the predicted value.

​

2. Random Forest Regressor

Random Forest is an ensemble learning method that combines multiple decision trees to improve the model's performance and reduce overfitting. The main calculations involve averaging predictions from multiple trees.

​

3. Hyperparameter Tuning

Hyperparameter tuning involves searching for the optimal hyperparameters that minimize the MSE. The GridSearchCV method performs an exhaustive search over a specified parameter grid.

​

4. Feature Engineering Calculations

​

Feature engineering involved creating new features such as log_claim_cost and cost_per_repair_time:

​

Screenshot 2024-07-12 at 3.12.57 AM.png
Screenshot 2024-07-12 at 3.13.11 AM.png

This project demonstrated the importance of iterative model development, starting with a simple model and progressively improving it through feature engineering, data augmentation, and hyperparameter tuning. By employing best practices in machine learning, I was able to enhance the anomaly detection model, achieving a significant reduction in MSE and improving the overall accuracy of the predictions.

​

The implementation of this advanced anomaly detection model had a significant beneficial impact on Hyundai's PA (Product Assurance) department in several ways:

​

  • Improved Fraud Detection: By accurately identifying unusual patterns and anomalies in warranty claims, the model helped in early detection of potential fraud. This allowed the PA department to take proactive measures, reducing financial losses and maintaining the integrity of the warranty claims process.

​

  • Enhanced Efficiency: The model automated the process of detecting anomalies, significantly reducing the time and effort required for manual reviews. This allowed the PA department to allocate resources more effectively and focus on higher-value tasks.

​

  • Data-Driven Insights: The feature engineering and hyperparameter tuning provided deeper insights into the factors influencing warranty claims. These insights enabled the PA department to make more informed decisions regarding warranty policies, claim approvals, and part recalls.

​

  • Increased Accuracy: The improved model accuracy, reflected in the reduced Mean Squared Error (MSE) from 0.00347 to 0.00027, ensured that the predictions were more reliable. This accuracy minimized false positives and false negatives, leading to more precise anomaly detection.

bottom of page