Abstract

Anomaly detection is the identification of events which deviate from what is considered a norm. Identifying not regular events is very important in, e.g. fraud detection.

Detecting anomalies in credit card transactions have to be performed quickly and accurately without too much interaction with the client. One of the most successful techniques used to automate anomaly detection are autoencoders.

Here, I use an autoencoder to successfully detect 83% of the fraudulent credit card transaction, with only 12% false-positive rate.

Data

I use the Credit card Fraud Detection Dataset which contains transactions made by credit cards in September 2013 by European cardholders. It includes 284,807 transactions, where 492 of them are fraudulent. The data columns are anonymised by projecting the original data onto 28 dimensions using PCA technique. Hence data columns V1 to V28 do not have an explainable meaning.

I log transform the Time and Amount values in order to lessen the long right tails of their densities.

In this application, I take advantage of the PCA property and use only the first five eigenvectors, which should explain most of the variance in the data.

Time V1 V2 V3 V4 V25 V26 V27 V28 Amount Class
84537 -0.55 0.05 1.66 -2.56 -0.41 -1.11 0.20 0.17 1.00 0
160382 0.18 0.93 -1.15 0.91 1.03 -0.30 0.13 0.29 83.00 0
62482 1.09 -0.10 0.12 0.23 0.02 0.59 -0.12 0.02 103.93 0
153741 -0.71 -0.20 2.16 -2.33 -0.27 -0.33 0.30 -0.12 1.00 0
136147 -0.80 0.36 0.66 -2.27 0.89 0.80 -0.18 0.01 50.52 0

The dataset is highly unbalanced, with only 0.17% on transactions marked as fraudulent. The autoencoders, however, need to be trained only on genuine transactions, which makes them a perfect choice for this problem.

Autoencoders

I use an autoencoder based on the neural network which consists of several fully connected layers, where the output has the same dimension as the input.

https://dev.to/kayis/building-an-autoencoder-for-generative-models-3e1i

The training phase for this kind of a neural network is unsupervised, i.e. we do not use the labels for the data. We train the network is such a way that the input and output information has to be as similar as possible. The similarity is measured as an error, such as the Euclidean distance between the input and output.

Providing the network with data, that the model did not see should increase the reconstruction error. Thresholding the error will allow us to determine if the transaction is genuine or not.

Analysis

I use Gaussian Mixtures to classify the log error score into two groups, fraudulent and genuine.

## number of iterations= 224

Then, I use the 45th quantile (empirical value) of the Gaussian distribution with parameters found in the GM stage, as the threshold. As a result, I get 83% accurately determined fraudulent transactions and 13% false-positive rate.

0 1
0 248855 35460
1 85 407

The AUC score, used commonly in scoring the dichotomous classification models, is high at 0.851.

Conclusion

Autoencoders can be successfuly trained to detect fraudulent credit card transactions, as long as we have enough genuine transaction data to train the network.