Credit Card Fraud Detection: In-Depth Study: Data Preparation

Nikhil Thapa
5 min readDec 25, 2020

Overview:

The dataset was taken from Kaggle. This Credit Fraud Detection Dataset contains credit card transactions made in September 2012 by European Cardholders. It incorporates only 2 days of transaction data, its highly imbalanced dataset as it contains 492 Fraud out of 284,807 Transactions. This infers that fraud accounts for 0.17% of the total transaction.

Photo by Avery Evans on Unsplash

Due to Confidentiality issues, the variable features and information are undisclosed to us. PCA Transformation was performed on input variables. PCA (Principal Component Analysis) identifies the hyperplane that lies closest to the data and then projects the data onto it, and this is how V1, V2, V3, …, V28 Features are obtained.

The Only Features that were untransformed are ‘Time’ and ‘Amount’. ‘Time’ represents seconds elapsed between each transaction and the first transaction in the dataset. ‘Amount’ represents transaction amount, and this can be used for example-dependent cost-sensitive learning.

import pandas as pdimport matplotlib.pyplot as pltimport numpy as np%matplotlib inlinedf = pd.read_csv("creditcard.csv")df.head()

As we observe the ‘Time’ and ‘Amount’ features need to be scaled before we proceed to build machine learning models

Feature Scaling:

The goal is to make sure features are on almost the same scale so that each feature is equally important and makes it easier to process by most Machine Learning Algorithms. It is generally used to bring all values to the same magnitude.

There are two primary Feature Scaling methods:

1. Standardization

2. Normalization

Standardization: Commonly referred to as Z-Score Normalization. In this, the feature will be scaled to ensure the Mean and Standard Deviation are to be 0 and 1 respectively.

Xstandardization = X — Mean(x)/ SD(X)

This Technique rescales features value with the distribution value between 0 & 1 and useful for optimizing algorithms, such as Gradient Descent, that is used within Machine Learning Algorithm that weights input (Regression and Neural Network). Rescaling is also an algorithm that uses distance measurements(E.g. KNN )

Max-Min Normalization:

It is a technique to rescale features with a distribution value between 0 and 1. For every feature, the minimum value of that feature gets transformed into 0, and the maximum value gets transformed into 1.

Xnorm = x — min(x)/ max(x) — min(x)

What’s best for the credit card dataset:

Max-min Normalization features will have a small Standard deviation compared to Standardization. Normalization will scale most of the data to a small interval, which means all features will have a small scale but do not handle outliers well.

Whereas, Standardization is robust to outliers. They transform the probability distribution for an input variable to standard Gaussian Standardization can become skewed or biased if the input variable contains outlier values.

To overcome this, the median and Interquartile range can be used when standardizing numerical input variables which technique is referred to as robust scaling.

Robust scaling uses percentile to scale numerical input variables that contain outliers by scaling numerical input variables using the median and interquartile range.

It calculated the median, 25th, and 75th percentiles. The value of each variable is then subtracted with median and divided by Interquartile range (IQR)

Value = (value- median) / (p75 — p25)

This results in variable having mean to 0, median and standard deviation to 1

from sklearn.preprocessing import RobustScaler#RobustScaler is less prone to outliers.
std_scaler = StandardScaler()
rob_scaler = RobustScaler()
df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] =
rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))
df.drop(['Time','Amount'], axis=1, inplace=True)
df.head()

Imbalanced Classes:

print(f'Classes Count in Credit card Fraud Dataset \n', pd.value_counts(df['Class'], sort = True).sort_index())credit_classes = pd.value_counts(df['Class'], sort = True).sort_index()credit_classes.plot(kind = 'bar')plt.title("Fraud class histogram")plt.xlabel("Class")plt.ylabel("Frequency")

One approach to addressing imbalanced datasets is to overcome the minority classes. Before we proceed to sampling techniques we need to create independent and dependent features.

#Create independent and dependent featurescolumns = df.columns.tolist()# Filter the columnscolumns = [c for c in columns if c not in ['Class']]# print(columns)#Store the variable we are predictingtarget = 'Class'#define X and Y dataframeX = df[columns]Y= df[target]print(X.shape)print(Y.shape)

There are two Sampling techniques to deal with the imbalanced datasets.

1. Under Sampling

2. Over Sampling

Under Sampling:

Undersampling refers to a group of techniques designed to balance the class distribution for a classification dataset that has a skewed class distribution. It removes examples from the training dataset that belong to the majority class in order to better balance the class distribution.

from imblearn.under_sampling import NearMissfrom sklearn.model_selection import train_test_splitnm = NearMiss(random_state=42)X_us, y_us = nm.fit_sample(X,Y)X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X_us, y_us, test_size=0.2)

Over Sampling:

It solves imbalance dataset by oversampling the examples in the minority classes. This can be achieved by simply duplicating examples from the minority class in the training dataset.

from imblearn.over_sampling import RandomOverSamplerfrom sklearn.model_selection import train_test_splitos = RandomOverSampler(ratio=1)X_os, y_os= os.fit_sample(X,Y)X_train_os, X_test_os, y_train_os, y_test_os = train_test_split(X_os , y_os, test_size=0.2)

In the next part, we will be looking at what metrics are well suitable for the credit card fraud detection problem and will work on a few supervised machine learning models and perform hyper tuning to get the best results.

Click the below link to learn more about metrics to solve the credit card fraud detection problem:

Credit Card Fraud Detection: In-Depth Study: Evaluating the Classification Model

--

--

Nikhil Thapa

Data Science Enthusiast. I love developing data products and solving challenging real world problems using data