AMEX competition EDA¶

This EDA analyzes the data and gives some insight which is useful for designing a machine learning pipeline and selecting a model.

In [11]:
import pandas as pd
import numpy as np
import pickle, gc
import os
import polars as pl
from matplotlib import pyplot as plt
In [4]:
# create the input_dir(input directory)
source_path = os.path.dirname(os.path.abspath('__file__'))
# source_path = '/scratch/bell/sido/m&a'
INPUT_DIR = os.path.join(source_path, 'data')

# if INPUT_DIR has not been created yet, create it
if not os.path.isdir(INPUT_DIR):
    os.mkdir(INPUT_DIR)

# output_dir(output directory) creation
OUTPUT_DIR = os.path.join(source_path, 'outputs')

# if OUTPUT_DIR has not been created yet, create it
if not os.path.isdir(OUTPUT_DIR):
    os.mkdir(OUTPUT_DIR)
In [5]:
# Once you run this code, comment it out
# move csv files to `data` directory(=folder)

# unique_dir_names = []
# for f in Path(f'{source_path}').rglob('*.csv'):
#     unique_dir_names.append(f)
# for g in Path(f'{source_path}').rglob('*.xlsx'):
#     unique_dir_names.append(g)

# for file in list(set(unique_dir_names)):
#     print(f'moved file: {file}')
#     shutil.move(f'{file}', f'{INPUT_DIR}')
In [6]:
# Pandas function to let us read csv files without having to specify the directory
def read_csv(name, **kwrgs):
    path = os.path.join(INPUT_DIR, name + '.csv')
    print(f'Load: {path}')
    return pd.read_csv(path, **kwrgs)

# Polars function to let us read csv files without having to specify the directory
def read_csvpl(name, **kwrgs):
    path = os.path.join(INPUT_DIR, name + '.csv')
    print(f'Load: {path}')
    return pl.read_csv(path, **kwrgs)

def read_csvfeather(name, **kwrgs):
    path = os.path.join(INPUT_DIR, name + '.csv')
    print(f'Load: {path}')
    return pd.read_feather(path, **kwrgs)

The labels¶

We start by reading the labels for the training data. There are neither missing values nor duplicated customer_IDs. Of the 458913 customer_IDs, 340000 (74 %) have a label of 0 (good customer, no default) and 119000 (26 %) have a label of 1 (bad customer, default).

We know that the good customers have been subsampled by a factor of 20; this means that in reality there are 6.8 million good customers. 98 % of the customers are good; 2 % are bad.

Insight:

  • The classes are imbalanced. A StratifiedKFold for cross-validation is recommended.
  • Because the classes are imbalanced, accuracy would be a bad metric to evaluate a classifier. The competition metric is a mix of area under the roc curve (auc) and recall.
In [12]:
train_labels = read_csvpl("train_labels")
train_labels.head(2)
Load: /Users/satoshiido/Documents/idsts2670.github.io/assets/jupyter/data/train_labels.csv
Out[12]:
shape: (2, 2)
customer_IDtarget
stri64
"0000099d6bd597…0
"00000fd6641609…0
In [ ]:
# Check for missing data and duplicated customer_IDs
train_labels.isna().any().any(), train_labels.customer_ID.duplicated().any()
In [9]:
label_stats = pd.DataFrame({'absolute': train_labels.target.value_counts(),
              'relative': train_labels.target.value_counts() / len(train_labels)})
label_stats['absolute upsampled'] =  label_stats.absolute * np.array([20, 1])
label_stats['relative upsampled'] = label_stats['absolute upsampled'] / label_stats['absolute upsampled'].sum()
label_stats
Out[9]:
absolute relative absolute upsampled relative upsampled
target
0 340085 0.741066 6801700 0.98283
1 118828 0.258934 118828 0.01717

The data¶

The dataset of this competition has a considerable size. If you read the original csv files, the data barely fits into memory. That's why we read the data from @munumbutt's AMEX-Feather-Dataset. In this Feather file, the floating point precision has been reduced from 64 bit to 16 bit. And reading a Feather file is faster than reading a csv file because the Feather file format is binary.

There are 5.5 million rows for training and 11 million rows of test data.

In [ ]:
%%time
train = pd.read_feather('../input/amexfeather/train_data.ftr')
test = pd.read_feather('../input/amexfeather/test_data.ftr')
with pd.option_context("display.min_rows", 6):
    display(train)
    display(test)

The target column of the train dataframe corresponds to the target column of train_labels.csv. In the csv file of the train data, there is no target column; it has been joined into the Feather file as a convenience.

S_2 is the statement date. All train statement dates are between March of 2017 and March of 2018 (13 months), and no statement dates are missing. All test statement dates are between April of 2018 and October of 2019. This means that the statement dates of train and test don't overlap:

In [ ]:
print('Train statement dates: ', train.S_2.min(), train.S_2.max(), train.S_2.isna().any())
print('Test statement dates: ',  test.S_2.min(), test.S_2.max(), test.S_2.isna().any())

Insight:

  • The test data come from a different phase in the economic cycle than the training data. Our models have no way of learning the effect of the economic cycle.
In [ ]:
print(f'Train data memory usage: {train.memory_usage().sum() / 1e9} GBytes')
print(f'Test data memory usage:  {test.memory_usage().sum() / 1e9} GBytes')

The training data takes 2.2 GBytes of RAM. The test data is twice the size of the training data.

Insight:

  • With that much data, we need to have an eye on memory efficiency. Avoid keeping unnecessary copies of the data in memory, and avoid keeping unnecessary copies of models!
  • Whereas most machine learning algorithms expect the whole training data to be in memory, we don't need to load all the test data at once. The test data can be processed in batches.
  • You may want to separate training and inference code into two notebooks so that you never have training and test data in memory at the same time.

The info function shows that most other features have missing values:

In [ ]:
train.info(max_cols=200, show_counts=True)

Insight:

  • There are many columns with missing values: Dropping all columns which have missing values is not a sensible strategy.
  • There are many rows with missing values: Dropping all rows which have missing values is not a sensible strategy.
  • Many decision-tree based algorithms can deal with missing values. If we choose such a model, we don't need to change the missing values.
  • Neural networks and other estimators cannot deal with missing values. If we choose such a model, we need to impute values. See this guide for an overview of the many imputation options.
  • Most features are 16-bit floats. The original data (in the csv file) has higher precision. By rounding it to 16-bit precision, some information is lost. To make this information loss more tangible: Every float16 number between 1 and 2 is a multiple of 1/1024. These numbers have only three digits behind the decimal point! This precision is enough to start the competition; maybe we'll have to switch to higher precision towards the end.

Counting the statements per customer¶

Now we can count how many rows (credit card statements) there are per customer. We see that 80 % of the customers have 13 statements; the other 20 % of the customers have between 1 and 12 statements.

Insight: Our model will have to deal with a variable-sized input per customer (unless we simplify our life and look only at the most recent statement as @inversion suggests here or at the average over all statements).

In [ ]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
train_sc = train.customer_ID.value_counts().value_counts().sort_index(ascending=False).rename('Train statements per customer')
ax1.pie(train_sc, labels=train_sc.index)
ax1.set_title(train_sc.name)
test_sc = test.customer_ID.value_counts().value_counts().sort_index(ascending=False).rename('Test statements per customer')
ax2.pie(test_sc, labels=test_sc.index)
ax2.set_title(test_sc.name)
plt.show()

# display(train.customer_ID.value_counts().value_counts().sort_index(ascending=False).rename('Train statements per customer'))
# display(train.customer_ID.value_counts().value_counts().sort_index(ascending=False).rename('Test statements per customer'))

Let's find out when these customers got their last statement. The histogram of the last statement dates shows that every train customer got his last statement in March of 2018. The first four Saturdays (March 3, 10, 17, 24) have more statements than an average day.

The test customers are split in two: half of them got their last statement in April of 2019 and half in October of 2019. As was discussed here, the April 2019 data is used for the public leaderboard and the October 2019 data is used for the private leaderboard.

In [ ]:
temp = train.S_2.groupby(train.customer_ID).max()
plt.figure(figsize=(16, 4))
plt.hist(temp, bins=pd.date_range("2018-03-01", "2018-04-01", freq="d"),
         rwidth=0.8, color='#ffd700')
plt.title('When did the train customers get their last statements?', fontsize=20)
plt.xlabel('Last statement date per customer')
plt.ylabel('Count')
plt.gca().set_facecolor('#0057b8')
plt.show()
del temp

temp = test.S_2.groupby(test.customer_ID).max()
plt.figure(figsize=(16, 4))
plt.hist(temp, bins=pd.date_range("2019-04-01", "2019-11-01", freq="d"),
         rwidth=0.74, color='#ffd700')
plt.title('When did the test customers get their last statements?', fontsize=20)
plt.xlabel('Last statement date per customer')
plt.ylabel('Count')
plt.gca().set_facecolor('#0057b8')
plt.show()
del temp

Insight: Although the data are a kind of time series, we cannot cross-validate with a TimeSeriesSplit because all training happens in the same month.

For most customers, the first and last statement is about a year apart. Together with the fact that we typically have 13 statements per customer, this indicates that the customers get one credit card statement every month.

In [ ]:
temp = train.S_2.groupby(train.customer_ID).agg(['max', 'min'])
plt.figure(figsize=(16, 3))
plt.hist((temp['max'] - temp['min']).dt.days, bins=400, color='#ffd700')
plt.xlabel('days')
plt.ylabel('count')
plt.title('Number of days between first and last statement of customer (train)', fontsize=20)
plt.gca().set_facecolor('#0057b8')
plt.show()

temp = test.S_2.groupby(test.customer_ID).agg(['max', 'min'])
plt.figure(figsize=(16, 3))
plt.hist((temp['max'] - temp['min']).dt.days, bins=400, color='#ffd700')
plt.xlabel('days')
plt.ylabel('count')
plt.title('Number of days between first and last statement of customer (test)', fontsize=20)
plt.gca().set_facecolor('#0057b8')
plt.show()
del temp

If we color every statement (i.e. row of train or test) according to the dataset it belongs (training, public lb, and private lb), we see that every dataset covers thirteen months. Train and test don't overlap, but public and private lb periods overlap.

In [ ]:
temp = pd.concat([train[['customer_ID', 'S_2']], test[['customer_ID', 'S_2']]], axis=0)
temp.set_index('customer_ID', inplace=True)
temp['last_month'] = temp.groupby('customer_ID').S_2.max().dt.month
last_month = temp['last_month'].values

plt.figure(figsize=(16, 4))
plt.hist([temp.S_2[temp.last_month == 3],   # ending 03/18 -> training
          temp.S_2[temp.last_month == 4],   # ending 04/19 -> public lb
          temp.S_2[temp.last_month == 10]], # ending 10/19 -> private lb
         bins=pd.date_range("2017-03-01", "2019-11-01", freq="MS"),
         label=['Training', 'Public leaderboard', 'Private leaderboard'],
         stacked=True)
plt.xticks(pd.date_range("2017-03-01", "2019-11-01", freq="QS"))
plt.xlabel('Statement date')
plt.ylabel('Count')
plt.title('The three datasets over time', fontsize=20)
plt.legend()
plt.show()

Now we'll look at the distribution of missing values over time. B_29 is most interesting. Given the each of the three datasets has almost half a million customers, we see that until May of 2019 fewer than a tenth of the customers have a value for B_29. The other nine tenths are missing. Starting in June of 2019, we have B_29 data for almost every customer.

Insight: The distribution of the missing B_29 differs between train and test datasets. Whereas in the training and public leaderboard data >90 % are missing, during the last five months of private leaderboard, we have B_29 data for almost every customer. If we use this feature, we should be prepared for surprises in the private leaderboard. Is it better to drop the feature?

In [ ]:
for f in [ 'B_29', 'S_9','D_87']:#, 'D_88', 'R_26', 'R_27', 'D_108', 'D_110', 'D_111', 'B_39', 'B_42']:
    temp = pd.concat([train[[f, 'S_2']], test[[f, 'S_2']]], axis=0)
    temp['last_month'] = last_month
    temp['has_f'] = ~temp[f].isna() 

    plt.figure(figsize=(16, 4))
    plt.hist([temp.S_2[temp.has_f & (temp.last_month == 3)],   # ending 03/18 -> training
              temp.S_2[temp.has_f & (temp.last_month == 4)],   # ending 04/19 -> public lb
              temp.S_2[temp.has_f & (temp.last_month == 10)]], # ending 10/19 -> private lb
             bins=pd.date_range("2017-03-01", "2019-11-01", freq="MS"),
             label=['Training', 'Public leaderboard', 'Private leaderboard'],
             stacked=True)
    plt.xticks(pd.date_range("2017-03-01", "2019-11-01", freq="QS"))
    plt.xlabel('Statement date')
    plt.ylabel(f'Count of {f} non-null values')
    plt.title(f'{f} non-null values over time', fontsize=20)
    plt.legend()
    plt.show()

The categorical features¶

According to the data description, there are eleven categorical features. We plot histograms for target=0 and target=1. For the ten features which have missing values, the missing values are represented by the rightmost bar of the histogram.

In [ ]:
cat_features = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
plt.figure(figsize=(16, 16))
for i, f in enumerate(cat_features):
    plt.subplot(4, 3, i+1)
    temp = pd.DataFrame(train[f][train.target == 0].value_counts(dropna=False, normalize=True).sort_index().rename('count'))
    temp.index.name = 'value'
    temp.reset_index(inplace=True)
    plt.bar(temp.index, temp['count'], alpha=0.5, label='target=0')
    temp = pd.DataFrame(train[f][train.target == 1].value_counts(dropna=False, normalize=True).sort_index().rename('count'))
    temp.index.name = 'value'
    temp.reset_index(inplace=True)
    plt.bar(temp.index, temp['count'], alpha=0.5, label='target=1')
    plt.xlabel(f)
    plt.ylabel('frequency')
    plt.legend()
    plt.xticks(temp.index, temp.value)
plt.suptitle('Categorical features', fontsize=20, y=0.93)
plt.show()
del temp

Insight:

  • Every feature has at most eight categories (including a nan category). One-hot encodings are feasible.
  • The distributions for target=0 and target=1 differ. This means that every feature gives some information about the target.

The binary features¶

Two features are binary:

  • B_31 is always 0 or 1.
  • D_87 is always 1 or missing.
In [ ]:
bin_features = ['B_31', 'D_87']
plt.figure(figsize=(16, 4))
for i, f in enumerate(bin_features):
    plt.subplot(1, 2, i+1)
    temp = pd.DataFrame(train[f][train.target == 0].value_counts(dropna=False, normalize=True).sort_index().rename('count'))
    temp.index.name = 'value'
    temp.reset_index(inplace=True)
    plt.bar(temp.index, temp['count'], alpha=0.5, label='target=0')
    temp = pd.DataFrame(train[f][train.target == 1].value_counts(dropna=False, normalize=True).sort_index().rename('count'))
    temp.index.name = 'value'
    temp.reset_index(inplace=True)
    plt.bar(temp.index, temp['count'], alpha=0.5, label='target=1')
    plt.xlabel(f)
    plt.ylabel('frequency')
    plt.legend()
    plt.xticks(temp.index, temp.value)
plt.suptitle('Binary features', fontsize=20)
plt.show()
del temp

Insight: If you impute missing values for D_87, don't fall into the trap of imputing the mean - the feature would become useless...

The numerical features¶

If we plot histograms of the 175 numerical features, we see that they have all kinds of distributions:

In [ ]:
cont_features = sorted([f for f in train.columns if f not in cat_features + bin_features + ['customer_ID', 'target', 'S_2']])
print(len(cont_features))
# print(cont_features)
ncols = 4
for i, f in enumerate(cont_features):
    if i % ncols == 0: 
        if i > 0: plt.show()
        plt.figure(figsize=(16, 3))
        if i == 0: plt.suptitle('Continuous features', fontsize=20, y=1.02)
    plt.subplot(1, ncols, i % ncols + 1)
    plt.hist(train[f], bins=200)
    plt.xlabel(f)
plt.show()

Insight: Histograms with white space at the left or right end can indicate that the data contain outliers. We will have to deal with these outliers. But are these data really outliers? Maybe they are, but they could as well be legitimate traces of rare events. We do not know...

The artificial noise in the data¶

The data contain artificial noise. To understand the noise, we need to analyze the data at full precision. Because we cannot load the whole dataset at full precision into memory, we load only two interesting columns. For a complete analysis, see @raddar's discussion post.

In [ ]:
def read_columns(name, features):
    """Read the specified columns of the train/test csv at full precision"""
    chunksize = 1000000
    chunklist = []
    with pd.read_csv(f"../input/amex-default-prediction/{name}_data.csv", chunksize=chunksize) as reader:
        for i, chunk in enumerate(reader):
            chunk = chunk[features] # keep only selected columns
            chunklist.append(chunk)
            print(i, end=' ')
            if i == 5: break
        print()
    df = pd.concat(chunklist, axis=0)
    return df

df = read_columns('train', ['B_19', 'S_13'])
df.info()

Let's look at B_19. All values are between 0 and 1.01. To get a high-resolution histogram, we spread it over eleven diagrams. The histogram show only rectangles of width 0.01, which indicate that the values are uniformly distributed in intervals of width 0.01, but every interval has another probability. This means that B_19 originally had some other range, but was scaled, rounded and got added some uniform noise by applying the following function:

def anonymize(data):
    data -= data.min()
    data /= data.max()
    data = data.round(2)
    rng = np.random.default_rng()
    data += rng.uniform(0, 0.01, len(data))
    return data
In [ ]:
y = df.B_19
for i in np.linspace(0, 1, 11):
    plt.figure(figsize=(16, 3))
    plt.hist(y, bins=np.linspace(i, i+0.1, 101), rwidth=0.8, color='m')
    plt.xticks(np.linspace(i, i+0.1, 11))
    plt.title(f"B_19 histogram, part {int(i*10+1)}")
    plt.show()

Insight: We don't care about the scaling, we cannot do anything against the rounding, but we should remove the artificial noise, e.g. by applying a function such as x['B_19'] = x['B_19'].apply(lambda t: np.floor(t*100))

S_13 looks similar, the rectangles again have width 0.01, but they start at multiples of 1/1034. This means that S_13 originally had integer values in the range 0..1034 and was treated with

def anonymize(data):
    data -= data.min()
    data /= data.max() # divides by 1034
    rng = np.random.default_rng()
    data += rng.uniform(0, 0.01, len(data))
    return data

Unfortunately some rectangles overlap (e.g. between 0.68 and 0.7) so that we cannot remove the noise.

In [ ]:
y = df.S_13
for i in np.linspace(0, 1, 11):
    plt.figure(figsize=(16, 3))
    plt.hist(y, bins=np.linspace(i, i+0.1, 101), rwidth=0.8, color='c')
    plt.xticks(np.linspace(i, i+0.1, 11))
    plt.title(f"S_13 histogram, part {int(i*10+1)}")
    plt.show()