Shape Classification

Shape Classification 4 car classes: (a) Sedan, (b) Pickup, (c) Minivan, (d) SUV. http://biomecis.uta.edu/shape_data.htm

Project 59: Shape Classification

Authors: Julen Etxaniz and Ibon Urbina

Subject: Machine Learning and Neural Networks

Date: 25/10/2020

Objective: The goal of the project is to compare different classification algorithms on the solution of plane and car shape datasets.

Contents:

1.Importing the libraries

2.Reading the datasets

3.Preprocessing the datasets

4.Dividing train and test data

5.Scaling the data

6.Classification

7.Validation

8.Feature Selection

9.Feature Engineering

10.Pipeline Optimization

1. Importing the libraries

We start by importing all relevant libraries to be used in the notebook.

# Reading data
from os import listdir
from scipy.io import loadmat
from re import findall

# Preprocessing
import pandas as pd
import numpy as np

# Scaling
from sklearn.preprocessing import StandardScaler

# Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

# Validation
from sklearn.metrics import accuracy_score

# Feature Selection
from sklearn.feature_selection import SelectKBest, chi2, f_classif, SelectFromModel
from sklearn.ensemble import RandomForestClassifier

# Feature Extraction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Pipeline Optimization
from tpot import TPOTClassifier

# Plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Enables interaction with the plots
%matplotlib inline

# Images
from IPython.display import Image

import warnings
warnings.filterwarnings('ignore')

2. Reading the datasets

We read the plane and car datasets

We use this function to read all the mats of the given directory.

def read_mats(dir):
    mats = []
    mats_file_name = []

    files = listdir(dir)
    # Files ordered before appending to maintain same order
    sorted_files = sorted(files)
    for file in sorted_files:
        mats.append(loadmat(dir + file))
        # To know in which order are we reading the files
        mats_file_name.append(file)
    
    return mats, mats_file_name

2.1. Reading the plane dataset

We read the 210 files that contain the instances of the plane classification problem.

We concatenate all the instances in a unique dataframe called “plane_mats”.

plane_dir = "shape_data/plane_data/"
plane_mats, plane_mats_file_name = read_mats(plane_dir)

We check the dataset is correct, looking at the number of samples

print('The number of samples in the plane dataset is', len(plane_mats))
The number of samples in the plane dataset is 210

2.2. Reading the car dataset

We read the 120 files that contain the instances of the car classification problem.

We concatenate all the instances in a unique dataframe called “car_mats”

car_dir = "shape_data/car_data/"
car_mats, car_mats_file_name = read_mats(car_dir)

We check the dataset is correct, looking at the number of samples

print('The number of samples in the car dataset is', len(car_mats))
The number of samples in the car dataset is 120

3. Preprocessing the datasets

Create dataframe

One of the best ways to represent data are pandas DataFrames. Either for their flexibility and eassy management of information. That’s what we are going to do in the next cell: convert the list where we read all the data to a DataFrame.

def get_dataframe(mats):
    df = pd.DataFrame(mats)
    # Remove unnecessary columns
    df = df.drop(['__header__', '__version__', '__globals__'], axis=1)
    return df

Get class and sample numbers

# Remember we have the names of the files read (in order) in our list called 
# Lets divide that array in two arrays. One containing the class number and the other the sample number.
def get_samples_classes(mats_file_name):
    class_n = []
    sample_n = []

    for i in mats_file_name:
        class_n.append(int(findall(r'\d+', str(i))[0]))
        sample_n.append(int(findall(r'\d+', str(i))[1]))
    
    return class_n, sample_n
# Add sample and class numbers to dataframe
def add_samples_classes(df, class_n, sample_n):
    df['Class'] = class_n
    df['Sample'] = sample_n

Check if classes are balanced

# Print the number of samples in each class
def print_class_count(df):
    print("Quantity of samples in each class:")
    print(df['Class'].value_counts())

Add another feature

# Calculate the perimeter (number of points) and add it to dataframe
def add_perimeter(df):
    length_list = []
    for i in range(len(df)):
        length_list.append(len(df['x'][i]))

    df['Perimeter_length'] = length_list
    
    return df

Changing how x feature is represented

When learning a classifier is useful to have features as arrays of numbers, and not as arrays of sequences. In our case, x is an array of (x, y) coordinates; so we are going to separate x and y, an then create two extra features from there.

# Calculate the minimum perimeter length
def min_length(df):
    return min(df['Perimeter_length'][i] for i in range(len(df['Perimeter_length'])))
# Separate x and y coordinates and normalize number of coordinates to min_length
def separate_coordinates(df, min_length):
    x_coordinates = []
    y_coordinates = []

    for i in range(len(df['x'])):
        x_coordinates.append(np.resize((df['x'][i])[:,0], (min_length, 1)))
        y_coordinates.append(np.resize((df['x'][i])[:,1], (min_length, 1)))
    
    return x_coordinates, y_coordinates
# Get column stacks from x and y coordinate arrays
def get_stacks(x_coordinates, y_coordinates):
    x_stack = x_coordinates[0]
    y_stack = y_coordinates[0]
    
    for i in range(len(x_coordinates)-1):
        x_stack = np.column_stack((x_stack, x_coordinates[i+1]))
        y_stack = np.column_stack((y_stack, y_coordinates[i+1]))
    
    return x_stack, y_stack
# Insert those columns in the dataFrame with the point name
def insert_columns(df, x_stack, y_stack):
    for i in range(len(x_stack)):
        stringX = "x" + str(i)
        stringY = "y" + str(i)
        df[stringX] = x_stack[i]
        df[stringY] = y_stack[i]
        
    return df

Preparing data for classification

To learn the classifiers, we need to separate in two different sets the features and the classes.

# The selected features are: 'Perimeter_length', 'xJ' and 'yJ'
# Then we are going to put all Classes in a unique structure.
def get_features_target(df):
    features = df.drop(columns=['x', 'Class', 'Sample'])
    target = df['Class']
    
    return features, target

3.1. Preprocessing the plane dataset

In this problem there are four classes that correspond to the 7 types of planes: (a) Mirage, (b) Eurofighter, (c) F-14 wings closed, (d) F-14 wings opened, (e) Harrier, (f) F-22, (g) F-15. However, in the database files are written like this: “ClassX_SampleY.mat”, where X is the corresponding class number and Y the corresponding sample number.

Here is the correspondance of class number and class name (plane model name):

  • 1 = Mirage
  • 2 = Eurofighter
  • 3 = F-14 wings closed
  • 4 = F-14 wings opened
  • 5 = Harrier
  • 6 = F-22
  • 7 = F-15
Image(filename='shape_plane.png')
7 types of planes: (a) Mirage, (b) Eurofighter, (c) F-14 wings closed, (d) F-14 wings opened, (e) Harrier, (f) F-22, (g) F-15
7 types of planes: (a) Mirage, (b) Eurofighter, (c) F-14 wings closed, (d) F-14 wings opened, (e) Harrier, (f) F-22, (g) F-15

Create dataframe

plane_df = get_dataframe(plane_mats)
plane_df

x
0[[64, 235], [65, 234], [66, 234], [67, 234], [...
1[[60, 139], [61, 138], [62, 137], [63, 137], [...
2[[60, 219], [61, 218], [62, 217], [63, 217], [...
3[[54, 201], [55, 200], [55, 199], [56, 198], [...
4[[64, 275], [65, 274], [66, 274], [67, 274], [...
......
205[[33, 234], [34, 233], [35, 232], [36, 231], [...
206[[21, 155], [22, 154], [23, 153], [24, 152], [...
207[[45, 324], [46, 323], [47, 322], [48, 321], [...
208[[70, 255], [71, 254], [72, 254], [73, 253], [...
209[[48, 233], [49, 232], [49, 231], [50, 230], [...

210 rows × 1 columns

Get class and sample numbers

plane_class_n, plane_sample_n = get_samples_classes(plane_mats_file_name)
print("This is how our class_n looks like: \n")
np.array(plane_class_n)
This is how our class_n looks like: 






array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7])
print("This is how our sample_n looks like: \n")
np.array(plane_sample_n)
This is how our sample_n looks like: 






array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12,
       13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28,
       29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,
        5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,
        9,  1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23,
       24, 25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11,
       12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27,
       28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,
        4,  5,  6,  7,  8,  9])

Lets add those lists to the car DataFrame.

add_samples_classes(plane_df, plane_class_n, plane_sample_n)
print("This is, finally, how our plane dataFrame looks like: \n")
plane_df
This is, finally, how our plane dataFrame looks like: 

xClassSample
0[[64, 235], [65, 234], [66, 234], [67, 234], [...11
1[[60, 139], [61, 138], [62, 137], [63, 137], [...110
2[[60, 219], [61, 218], [62, 217], [63, 217], [...111
3[[54, 201], [55, 200], [55, 199], [56, 198], [...112
4[[64, 275], [65, 274], [66, 274], [67, 274], [...113
............
205[[33, 234], [34, 233], [35, 232], [36, 231], [...75
206[[21, 155], [22, 154], [23, 153], [24, 152], [...76
207[[45, 324], [46, 323], [47, 322], [48, 321], [...77
208[[70, 255], [71, 254], [72, 254], [73, 253], [...78
209[[48, 233], [49, 232], [49, 231], [50, 230], [...79

210 rows × 3 columns

Classes are balanced? Yes

Although in the description of the database it is said that each class has 30 samples, to make sure about it we are going to count them.

print_class_count(plane_df)
Quantity of samples in each class:
7    30
6    30
5    30
4    30
3    30
2    30
1    30
Name: Class, dtype: int64

Add another feature

As we mentioned before, the only feature descriptor of the shapes is x, which refers to cartesian coordinates of each point on the perimeter of the shape. However, how many points are in each contour perimeter is not taken as a unique feature. It is implicitly measure in the length of each x sample, but, we prefer make it explicit.

plane_df = add_perimeter(plane_df)
print("This is how our plane dataFrame looks like: \n")
plane_df
This is how our plane dataFrame looks like: 

xClassSamplePerimeter_length
0[[64, 235], [65, 234], [66, 234], [67, 234], [...111433
1[[60, 139], [61, 138], [62, 137], [63, 137], [...1101540
2[[60, 219], [61, 218], [62, 217], [63, 217], [...1111587
3[[54, 201], [55, 200], [55, 199], [56, 198], [...1121511
4[[64, 275], [65, 274], [66, 274], [67, 274], [...1131489
...............
205[[33, 234], [34, 233], [35, 232], [36, 231], [...751801
206[[21, 155], [22, 154], [23, 153], [24, 152], [...761943
207[[45, 324], [46, 323], [47, 322], [48, 321], [...771876
208[[70, 255], [71, 254], [72, 254], [73, 253], [...781661
209[[48, 233], [49, 232], [49, 231], [50, 230], [...791844

210 rows × 4 columns

Changing how x feature is represented

When learning a classifier is useful to have features as arrays of numbers, and not as arrays of sequences. In our case, x is an array of (x, y) coordinates; so we are going to separate x and y, an then create two extra features from there.

min_len = min_length(plane_df)
print(min_len)
x_coordinates, y_coordinates = separate_coordinates(plane_df, min_len)
x_stack, y_stack = get_stacks(x_coordinates, y_coordinates)
plane_df = insert_columns(plane_df, x_stack, y_stack)
plane_df
890

xClassSamplePerimeter_lengthx0y0x1y1x2y2...x885y885x886y886x887y887x888y888x889y889
0[[64, 235], [65, 234], [66, 234], [67, 234], [...111433642356523466234...471264471265471266471267471268
1[[60, 139], [61, 138], [62, 137], [63, 137], [...1101540601396113862137...560304559303558303557302556301
2[[60, 219], [61, 218], [62, 217], [63, 217], [...1111587602196121862217...564246563246562246561246560246
3[[54, 201], [55, 200], [55, 199], [56, 198], [...1121511542015520055199...502227501228500228499228498228
4[[64, 275], [65, 274], [66, 274], [67, 274], [...1131489642756527466274...490234490235490236490237491238
..................................................................
205[[33, 234], [34, 233], [35, 232], [36, 231], [...751801332343423335232...533202533203534204534205534206
206[[21, 155], [22, 154], [23, 153], [24, 152], [...761943211552215423153...586260585259584258583259582259
207[[45, 324], [46, 323], [47, 322], [48, 321], [...771876453244632347322...597157597158597159597160596161
208[[70, 255], [71, 254], [72, 254], [73, 253], [...781661702557125472254...531296530297529298528299528300
209[[48, 233], [49, 232], [49, 231], [50, 230], [...791844482334923249231...579195578195577195576195575195

210 rows × 1784 columns

Preparing data for classification

plane_features, plane_target = get_features_target(plane_df)
# The selected features are: 'Perimeter_length', 'xJ' and 'yJ'  (J -> [0, 889])
plane_features

Perimeter_lengthx0y0x1y1x2y2x3y3x4...x885y885x886y886x887y887x888y888x889y889
014336423565234662346723468...471264471265471266471267471268
115406013961138621376313764...560304559303558303557302556301
215876021961218622176321764...564246563246562246561246560246
315115420155200551995619857...502227501228500228499228498228
414896427565274662746727468...490234490235490236490237491238
..................................................................
20518013323434233352323623137...533202533203534204534205534206
20619432115522154231532415225...586260585259584258583259582259
20718764532446323473224832149...597157597158597159597160596161
20816617025571254722547325374...531296530297529298528299528300
20918444823349232492315023051...579195578195577195576195575195

210 rows × 1781 columns

We have put all Classes in a unique structure.

plane_target
0      1
1      1
2      1
3      1
4      1
      ..
205    7
206    7
207    7
208    7
209    7
Name: Class, Length: 210, dtype: int64

3.2. Preprocessing the car dataset

In this problem there are four classes that correspond to the 4 types of cars: (a) sedan, (b) pickup, (c) minivan, or (d) SUV. However, in the database files are written like this: “ClassX_SampleY.mat”, where X is the corresponding class number and Y the corresponding sample number.

Here is the correspondance of class number and class name (car model name):

  • 1 = sedan
  • 2 = pickup
  • 3 = minivan
  • 4 = SUV
Image(filename='shape_car.png')
4 types of cars: (a) sedan, (b) pickup, (c) minivan, or (d) SUV
4 types of cars: (a) sedan, (b) pickup, (c) minivan, or (d) SUV

Create dataframe

car_df = get_dataframe(car_mats)
car_df

x
0[[113, 181], [114, 180], [114, 179], [114, 178...
1[[98, 180], [99, 179], [99, 178], [100, 177], ...
2[[70, 180], [71, 180], [72, 179], [73, 178], [...
3[[54, 184], [55, 183], [56, 183], [57, 183], [...
4[[44, 180], [45, 179], [46, 179], [47, 178], [...
......
115[[101, 182], [102, 182], [103, 182], [104, 182...
116[[46, 180], [47, 180], [48, 179], [48, 178], [...
117[[31, 173], [32, 173], [33, 174], [34, 174], [...
118[[20, 170], [21, 171], [22, 170], [23, 170], [...
119[[36, 175], [37, 174], [37, 173], [37, 172], [...

120 rows × 1 columns

Get class and sample numbers

Now, the only attribute available in our car DataFrame is x, which refers to cartesian coordinates of each point on the perimeter of the shape. We need more information to include there, such as class value and sample number.

# Remember we have the names of the files read (in order) in our list called car_mats_file_name.
# Lets, divide that array in two arrays. One containing the class number and the other the sample number.
car_class_n, car_sample_n = get_samples_classes(car_mats_file_name)
print("This is how our class_n looks like: \n")
np.array(car_class_n)
This is how our class_n looks like: 






array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4])
print("This is how our sample_n looks like: \n")
np.array(car_sample_n)
This is how our sample_n looks like: 






array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12,
       13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28,
       29,  3, 30,  4,  5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19,  2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,
        5,  6,  7,  8,  9,  1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29,  3, 30,  4,  5,  6,  7,  8,
        9])

Lets add those lists to the car DataFrame.

add_samples_classes(car_df, car_class_n, car_sample_n)
print("This is, finally, how our car dataFrame looks like: \n")
car_df
This is, finally, how our car dataFrame looks like: 

xClassSample
0[[113, 181], [114, 180], [114, 179], [114, 178...11
1[[98, 180], [99, 179], [99, 178], [100, 177], ...110
2[[70, 180], [71, 180], [72, 179], [73, 178], [...111
3[[54, 184], [55, 183], [56, 183], [57, 183], [...112
4[[44, 180], [45, 179], [46, 179], [47, 178], [...113
............
115[[101, 182], [102, 182], [103, 182], [104, 182...45
116[[46, 180], [47, 180], [48, 179], [48, 178], [...46
117[[31, 173], [32, 173], [33, 174], [34, 174], [...47
118[[20, 170], [21, 171], [22, 170], [23, 170], [...48
119[[36, 175], [37, 174], [37, 173], [37, 172], [...49

120 rows × 3 columns

Classes are balanced? Yes

Although in the description of the database it is said that each class has 30 samples, to make sure about it we are going to count them.

print_class_count(car_df)
Quantity of samples in each class:
4    30
3    30
2    30
1    30
Name: Class, dtype: int64

Let’s add another feature to our database

As we mentioned before, the only feature descriptor of the shapes is x, which refers to cartesian coordinates of each point on the perimeter of the shape. However, how many points are in each contour perimeter is not taken as a unique feature. It is implicitly measure in the length of each x sample, but, we prefer make it explicit.

car_df = add_perimeter(car_df)
print("This is how our car dataFrame looks like: \n")
car_df
This is how our car dataFrame looks like: 

xClassSamplePerimeter_length
0[[113, 181], [114, 180], [114, 179], [114, 178...11310
1[[98, 180], [99, 179], [99, 178], [100, 177], ...110331
2[[70, 180], [71, 180], [72, 179], [73, 178], [...111344
3[[54, 184], [55, 183], [56, 183], [57, 183], [...112334
4[[44, 180], [45, 179], [46, 179], [47, 178], [...113322
...............
115[[101, 182], [102, 182], [103, 182], [104, 182...45373
116[[46, 180], [47, 180], [48, 179], [48, 178], [...46358
117[[31, 173], [32, 173], [33, 174], [34, 174], [...47374
118[[20, 170], [21, 171], [22, 170], [23, 170], [...48356
119[[36, 175], [37, 174], [37, 173], [37, 172], [...49333

120 rows × 4 columns

Changing how x feature is represented

When learning a classifier is useful to have features as arrays of numbers, and not as arrays of sequences. In our case, x is an array of (x, y) coordinates; so we are going to separate x and y, an then create two extra features from there.

min_len = min_length(car_df)
print(min_len)
x_coordinates, y_coordinates = separate_coordinates(car_df, min_len)
x_stack, y_stack = get_stacks(x_coordinates, y_coordinates)
car_df = insert_columns(car_df, x_stack, y_stack)
car_df
272

xClassSamplePerimeter_lengthx0y0x1y1x2y2...x267y267x268y268x269y269x270y270x271y271
0[[113, 181], [114, 180], [114, 179], [114, 178...11310113181114180114179...150189149189148190147191146191
1[[98, 180], [99, 179], [99, 178], [100, 177], ...110331981809917999178...140188139188138189139190138190
2[[70, 180], [71, 180], [72, 179], [73, 178], [...111344701807118072179...131186130187129187128187127187
3[[54, 184], [55, 183], [56, 183], [57, 183], [...112334541845518356183...108186107187106187105187104188
4[[44, 180], [45, 179], [46, 179], [47, 178], [...113322441804517946179...8418983189821908119182192
..................................................................
115[[101, 182], [102, 182], [103, 182], [104, 182...45373101182102182103182...186188185188184188183188182188
116[[46, 180], [47, 180], [48, 179], [48, 178], [...46358461804718048179...131186130186129186128186127186
117[[31, 173], [32, 173], [33, 174], [34, 174], [...47374311733217333174...111187110188109188108189107189
118[[20, 170], [21, 171], [22, 170], [23, 170], [...48356201702117122170...7618975189741897318972189
119[[36, 175], [37, 174], [37, 173], [37, 172], [...49333361753717437173...7618675187741887418973188

120 rows × 548 columns

Preparing data for classification

car_features, car_target = get_features_target(car_df)
# The selected features are: 'Perimeter_length', 'xJ' and 'yJ'  (J -> [0, 271])
car_features

Perimeter_lengthx0y0x1y1x2y2x3y3x4...x267y267x268y268x269y269x270y270x271y271
0310113181114180114179114178114...150189149189148190147191146191
1331981809917999178100177101...140188139188138189139190138190
23447018071180721797317872...131186130187129187128187127187
33345418455183561835718358...108186107187106187105187104188
43224418045179461794717848...8418983189821908119182192
..................................................................
115373101182102182103182104182105...186188185188184188183188182188
1163584618047180481794817848...131186130186129186128186127186
1173743117332173331743417435...111187110188109188108189107189
1183562017021171221702317024...7618975189741897318972189
1193333617537174371733717238...7618675187741887418973188

120 rows × 545 columns

We have put all Classes in a unique structure.

car_target
0      1
1      1
2      1
3      1
4      1
      ..
115    4
116    4
117    4
118    4
119    4
Name: Class, Length: 120, dtype: int64

4. Dividing train and test data

Also, to evaluate the accuracy of the classifiers in the dataset we will split the data in two sets. Train and Test data. Each set will have the same number of samples of each class (15).

Divide train and test features

def train_test_features(features):
    train_features = features[0::2]
    test_features = features[1::2]

    return train_features, test_features

Divide train and test target

def train_test_target(target):
    train_target = target[0::2]
    test_target = target[1::2]

    return train_target, test_target

4.1. Dividing the plane data

plane_train_features, plane_test_features = train_test_features(plane_features)
plane_train_target, plane_test_target = train_test_features(plane_target)

4.2. Dividing the car data

car_train_features, car_test_features = train_test_features(car_features)
car_train_target, car_test_target = train_test_features(car_target)

5. Scaling the data

5.1. Scaling the plane data

plane_scaler = StandardScaler()
plane_scaler = plane_scaler.fit(plane_train_features)
plane_train_features_scaled = plane_scaler.transform(plane_train_features)
plane_test_features_scaled = plane_scaler.transform(plane_test_features)
plane_train_features_scaled
array([[-0.73999423, -0.24373978, -0.30372496, ...,  0.0414752 ,
        -0.87300239,  0.0528201 ],
       [ 0.13049222, -0.35786543, -0.62363589, ..., -0.33028419,
         0.5277887 , -0.33576091],
       [-0.42345371, -0.24373978,  0.49605237, ..., -0.48960965,
        -0.55821787, -0.47706309],
       ...,
       [ 2.30670835, -1.09968214,  2.01562929, ..., -2.29529812,
         0.59074561, -2.29632872],
       [ 2.14278557, -1.4705905 , -1.90327961, ..., -0.10014743,
         0.87405167, -0.10614486],
       [ 0.54877792, -0.0725513 ,  0.0961637 , ...,  0.6079657 ,
         0.02413348,  0.61802884]])

5.2. Scaling the car data

car_scaler = StandardScaler()
car_scaler = car_scaler.fit(car_train_features)
car_train_features_scaled = car_scaler.transform(car_train_features)
car_test_features_scaled = car_scaler.transform(car_test_features)
car_train_features_scaled
array([[-1.03401769,  0.63862483,  0.65780993, ...,  0.88122895,
        -0.64015655,  0.87599671],
       [-0.75473846, -0.16471601,  0.58205447, ...,  0.67629199,
        -0.8871205 ,  0.67008171],
       [-0.93544855, -0.65045698,  0.58205447, ...,  0.88122895,
        -1.47203511,  0.92747546],
       ...,
       [-0.55760017,  0.73203656,  0.12752169, ...,  0.72752623,
        -0.14622865,  0.72156046],
       [-0.63974112, -0.61309229,  0.58205447, ...,  0.62505775,
        -0.8871205 ,  0.61860296],
       [-0.65616931, -1.09883327, -0.17550015, ...,  0.77876047,
        -1.60201613,  0.77303921]])

6. Classification

Defining the classifiers

We define the three classifiers used.

def get_classifiers():
    dt  = DecisionTreeClassifier()
    lda = LinearDiscriminantAnalysis()
    lg  = LogisticRegression(max_iter=2000)
    return dt, lda, lg

Learning the classifiers

We used the train data to learn the three classifiers

def fit_classifiers(dt, lda, lg, train_features, train_target):
    dt.fit(train_features, train_target)
    lda.fit(train_features, train_target)
    lg.fit(train_features, train_target)

Using the classifier for predictions

We predict the class of the samples in the test data with the three classifiers.

def predict_classifiers(dt, lda, lg, test_features):
    dt_test_predictions = dt.predict(test_features)
    lda_test_predictions = lda.predict(test_features)
    lg_test_predictions = lg.predict(test_features)
    
    return dt_test_predictions, lda_test_predictions, lg_test_predictions

6.1. Classification for the plane data

Not scaled data

Defining the classifiers

We define the three classifiers used.

plane_dt, plane_lda, plane_lg = get_classifiers()

Learning the classifiers

We used the train data to learn the three classifiers

fit_classifiers(plane_dt, plane_lda, plane_lg, plane_train_features, plane_train_target)

Using the classifier for predictions

We predict the class of the samples in the test data with the three classifiers.

plane_dt_test_predictions, plane_lda_test_predictions, plane_lg_test_predictions = \
predict_classifiers(plane_dt, plane_lda, plane_lg, plane_test_features)

Scaled data

Learning the classifiers

We used the train data to learn the three classifiers

fit_classifiers(plane_dt, plane_lda, plane_lg, plane_train_features_scaled, plane_train_target)

Using the classifier for predictions

We predict the class of the samples in the test data with the three classifiers.

plane_dt_test_predictions_scaled, plane_lda_test_predictions_scaled, plane_lg_test_predictions_scaled = \
predict_classifiers(plane_dt, plane_lda, plane_lg, plane_test_features_scaled)

6.2. Classification for the car data

Not scaled data

Defining the classifiers

We define the three classifiers used.

car_dt, car_lda, car_lg = get_classifiers()

Learning the classifiers

We used the train data to learn the three classifiers

fit_classifiers(car_dt, car_lda, car_lg, car_train_features, car_train_target)

Using the classifier for predictions

We predict the class of the samples in the test data with the three classifiers.

car_dt_test_predictions, car_lda_test_predictions, car_lg_test_predictions = \
predict_classifiers(car_dt, car_lda, car_lg, car_test_features)

Scaled data

Learning the classifiers

We used the train data to learn the three classifiers

fit_classifiers(car_dt, car_lda, car_lg, car_train_features_scaled, car_train_target)

Using the classifier for predictions

We predict the class of the samples in the test data with the three classifiers.

car_dt_test_predictions_scaled, car_lda_test_predictions_scaled, car_lg_test_predictions_scaled = \
predict_classifiers(car_dt, car_lda, car_lg, car_test_features_scaled)

7. Validation

Computing the accuracy

We compute the accuracy using the three classifiers and print it.

def print_accuracies(test_target, dt_test_predictions, lda_test_predictions, lg_test_predictions):
    dt_acc =  accuracy_score(test_target, dt_test_predictions)
    lda_acc =  accuracy_score(test_target, lda_test_predictions)
    lg_acc =  accuracy_score(test_target, lg_test_predictions)
    print("Accuracy for the decision tree :", dt_acc)
    print("Accuracy for LDA :", lda_acc)
    print("Accuracy for logistic regression:", lg_acc)

Computing the confusion matrices

We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report.

def print_confusion_matrices(test_target, dt_test_predictions, lda_test_predictions, lg_test_predictions):
    print("Confusion matrix decision tree")
    cm_dt = pd.crosstab(test_target, dt_test_predictions)
    print(cm_dt)
    print()
    #print(cm_dt.to_latex())
    
    print("Confusion matrix LDA")
    cm_lda = pd.crosstab(test_target, lda_test_predictions)
    print(cm_lda)
    print()
    #print(cm_lda.to_latex())
    
    print("Confusion matrix Logistic regression")
    cm_lg = pd.crosstab(test_target, lg_test_predictions)
    print(cm_lg)
    print()
    #print(cm_lg.to_latex())

7.1. Validation for the plane data

Not scaled data

Computing the accuracy

We compute the accuracy using the three classifiers and print it. Mention that accuracy score is a good measure as classes in both datasets are balanced.

print_accuracies(plane_test_target, plane_dt_test_predictions, plane_lda_test_predictions, plane_lg_test_predictions)
Accuracy for the decision tree : 0.8
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9428571428571428

Computing the confusion matrices

We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report.

print_confusion_matrices(plane_test_target, plane_dt_test_predictions, plane_lda_test_predictions, plane_lg_test_predictions)
Confusion matrix decision tree
col_0  1   2   3   4   5   6   7
Class                           
1      8   2   1   1   0   1   2
2      0  10   0   0   1   0   4
3      2   0  10   0   0   2   1
4      0   1   0  14   0   0   0
5      0   0   0   0  14   1   0
6      0   0   0   0   0  15   0
7      2   0   0   0   0   0  13

Confusion matrix LDA
col_0   1   2   3   4   5   6   7
Class                            
1      13   0   0   0   0   2   0
2       0  12   0   0   0   0   3
3       0   0  15   0   0   0   0
4       0   0   0  15   0   0   0
5       0   0   0   0  15   0   0
6       0   0   0   0   0  15   0
7       0   0   0   0   0   0  15

Confusion matrix Logistic regression
col_0   1   2   3   4   5   6   7
Class                            
1      13   0   1   0   0   1   0
2       1  11   0   0   0   0   3
3       0   0  15   0   0   0   0
4       0   0   0  15   0   0   0
5       0   0   0   0  15   0   0
6       0   0   0   0   0  15   0
7       0   0   0   0   0   0  15

Scaled data

Computing the accuracy

We compute the accuracy using the three classifiers and print it.

print_accuracies(plane_test_target, plane_dt_test_predictions_scaled, plane_lda_test_predictions_scaled, plane_lg_test_predictions_scaled)
Accuracy for the decision tree : 0.819047619047619
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9238095238095239

Computing the confusion matrices

We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report.

print_confusion_matrices(plane_test_target, plane_dt_test_predictions_scaled, plane_lda_test_predictions_scaled, plane_lg_test_predictions_scaled)
Confusion matrix decision tree
col_0  1   2   3   4   5   6   7
Class                           
1      9   1   2   0   0   1   2
2      0  10   0   0   1   0   4
3      1   0  11   0   0   2   1
4      0   1   0  14   0   0   0
5      0   0   0   0  14   1   0
6      0   0   0   0   0  15   0
7      2   0   0   0   0   0  13

Confusion matrix LDA
col_0   1   2   3   4   5   6   7
Class                            
1      13   0   0   0   0   2   0
2       0  12   0   0   0   0   3
3       0   0  15   0   0   0   0
4       0   0   0  15   0   0   0
5       0   0   0   0  15   0   0
6       0   0   0   0   0  15   0
7       0   0   0   0   0   0  15

Confusion matrix Logistic regression
col_0   1   2   3   4   5   6   7
Class                            
1      13   0   1   0   0   1   0
2       1  11   0   0   0   0   3
3       0   0  14   0   0   1   0
4       0   0   0  15   0   0   0
5       0   0   0   0  15   0   0
6       0   0   0   0   0  15   0
7       0   0   0   1   0   0  14

7.2. Validation for the car data

Not scaled data

Computing the accuracy

We compute the accuracy using the three classifiers and print it.

print_accuracies(car_test_target, car_dt_test_predictions, car_lda_test_predictions, car_lg_test_predictions)
Accuracy for the decision tree : 0.75
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.8666666666666667

Computing the confusion matrices

We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report.

print_confusion_matrices(car_test_target, car_dt_test_predictions, car_lda_test_predictions, car_lg_test_predictions)
Confusion matrix decision tree
col_0   1   2   3  4
Class               
1      13   0   0  2
2       1  10   1  3
3       0   0  15  0
4       3   4   1  7

Confusion matrix LDA
col_0   1   2   3   4
Class                
1      13   0   0   2
2       0  15   0   0
3       0   1  13   1
4       0   0   2  13

Confusion matrix Logistic regression
col_0   1   2   3   4
Class                
1      14   0   1   0
2       0  14   1   0
3       0   1  12   2
4       0   0   3  12

Scaled data

Computing the accuracy

We compute the accuracy using the three classifiers and print it.

print_accuracies(car_test_target, car_dt_test_predictions_scaled, car_lda_test_predictions_scaled, car_lg_test_predictions_scaled)
Accuracy for the decision tree : 0.7833333333333333
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.9166666666666666

Computing the confusion matrices

We compute the confusion matrices for the three classifiers. We print the confusion matrices and also generate the latex code to insert it in our written report.

print_confusion_matrices(car_test_target, car_dt_test_predictions_scaled, car_lda_test_predictions_scaled, car_lg_test_predictions_scaled)
Confusion matrix decision tree
col_0   1   2   3  4
Class               
1      13   1   0  1
2       0  11   1  3
3       0   0  15  0
4       2   3   2  8

Confusion matrix LDA
col_0   1   2   3   4
Class                
1      13   0   0   2
2       0  15   0   0
3       0   1  13   1
4       0   0   2  13

Confusion matrix Logistic regression
col_0   1   2   3   4
Class                
1      14   0   0   1
2       0  15   0   0
3       0   1  14   0
4       0   0   3  12

8. Feature Selection

8.1 Feature Selection with SelectKBest

At this point we have so many features: two features for each point in the contour perimeter. We normalized the length of each car image to 890 points, so 890 * 2 = 1780 features for representing each image. We normalized the length of each car image to 272 points, so 272 * 2 = 544 features for representing each image. Let’s try reducing this amount of features…

We are going to use SelectKBest + f_classif function for our feature selection process.

Feature selection

def select_train_test(train_features, train_target, test_features, k):
    select_best = SelectKBest(f_classif, k=k).fit(train_features, train_target)
    plane_reduced_train = select_best.transform(train_features)
    plane_reduced_test = select_best.transform(test_features)
    return train_features, test_features

Plane Dataset

Feature selection

As always, we need to separate the data in three: features data for training, features data for testing and class target:

# Not Scaled data
plane_reduced_train_100, plane_reduced_test_100 = select_train_test(plane_train_features, plane_train_target, plane_test_features, k=100)
plane_reduced_train_200, plane_reduced_test_200 = select_train_test(plane_train_features, plane_train_target, plane_test_features, k=200)
plane_reduced_train_300, plane_reduced_test_300 = select_train_test(plane_train_features, plane_train_target, plane_test_features, k=300)
plane_reduced_train_400, plane_reduced_test_400 = select_train_test(plane_train_features, plane_train_target, plane_test_features, k=400)
plane_reduced_train_500, plane_reduced_test_500 = select_train_test(plane_train_features, plane_train_target, plane_test_features, k=500)
# Scaled data
plane_reduced_train_100_scaled, plane_reduced_test_100_scaled = \
    select_train_test(plane_train_features_scaled, plane_train_target, plane_test_features_scaled, k=100)
plane_reduced_train_200_scaled, plane_reduced_test_200_scaled = \
    select_train_test(plane_train_features_scaled, plane_train_target, plane_test_features_scaled, k=200)
plane_reduced_train_300_scaled, plane_reduced_test_300_scaled = \
    select_train_test(plane_train_features_scaled, plane_train_target, plane_test_features_scaled, k=300)
plane_reduced_train_400_scaled, plane_reduced_test_400_scaled = \
    select_train_test(plane_train_features_scaled, plane_train_target, plane_test_features_scaled, k=400)
plane_reduced_train_500_scaled, plane_reduced_test_500_scaled = \
    select_train_test(plane_train_features_scaled, plane_train_target, plane_test_features_scaled, k=500)

Define classifiers

It is better to create one classifier of each type not to mix them:

plane_dt_100, plane_lda_100, plane_lg_100 = get_classifiers()
plane_dt_200, plane_lda_200, plane_lg_200 = get_classifiers()
plane_dt_300, plane_lda_300, plane_lg_300 = get_classifiers()
plane_dt_400, plane_lda_400, plane_lg_400 = get_classifiers()
plane_dt_500, plane_lda_500, plane_lg_500 = get_classifiers()

Fit Classifiers

# Not Scaled data
fit_classifiers(plane_dt_100, plane_lda_100, plane_lg_100, plane_reduced_train_100, plane_train_target)
fit_classifiers(plane_dt_200, plane_lda_200, plane_lg_200, plane_reduced_train_200, plane_train_target)
fit_classifiers(plane_dt_300, plane_lda_300, plane_lg_300, plane_reduced_train_300, plane_train_target)
fit_classifiers(plane_dt_400, plane_lda_400, plane_lg_400, plane_reduced_train_400, plane_train_target)
fit_classifiers(plane_dt_500, plane_lda_500, plane_lg_500, plane_reduced_train_500, plane_train_target)

Predict classifiers

# Not Scaled data
plane_dt_test_predictions_100, plane_lda_test_predictions_100, plane_lg_test_predictions_100 = \
predict_classifiers(plane_dt_100, plane_lda_100, plane_lg_100, plane_reduced_test_100)

plane_dt_test_predictions_200, plane_lda_test_predictions_200, plane_lg_test_predictions_200 = \
predict_classifiers(plane_dt_200, plane_lda_200, plane_lg_200, plane_reduced_test_200)

plane_dt_test_predictions_300, plane_lda_test_predictions_300, plane_lg_test_predictions_300 = \
predict_classifiers(plane_dt_300, plane_lda_300, plane_lg_300, plane_reduced_test_300)

plane_dt_test_predictions_400, plane_lda_test_predictions_400, plane_lg_test_predictions_400 = \
predict_classifiers(plane_dt_400, plane_lda_400, plane_lg_400, plane_reduced_test_400)

plane_dt_test_predictions_500, plane_lda_test_predictions_500, plane_lg_test_predictions_500 = \
predict_classifiers(plane_dt_500, plane_lda_500, plane_lg_500, plane_reduced_test_500)

Fitting classifiers scaled

# Scaled data
fit_classifiers(plane_dt_100, plane_lda_100, plane_lg_100, plane_reduced_train_100_scaled, plane_train_target)
fit_classifiers(plane_dt_200, plane_lda_200, plane_lg_200, plane_reduced_train_200_scaled, plane_train_target)
fit_classifiers(plane_dt_300, plane_lda_300, plane_lg_300, plane_reduced_train_300_scaled, plane_train_target)
fit_classifiers(plane_dt_400, plane_lda_400, plane_lg_400, plane_reduced_train_400_scaled, plane_train_target)
fit_classifiers(plane_dt_500, plane_lda_500, plane_lg_500, plane_reduced_train_500_scaled, plane_train_target)

Predicting classifiers scaled

# Scaled data
plane_dt_test_predictions_100_scaled, plane_lda_test_predictions_100_scaled, plane_lg_test_predictions_100_scaled = \
predict_classifiers(plane_dt_100, plane_lda_100, plane_lg_100, plane_reduced_test_100_scaled)

plane_dt_test_predictions_200_scaled, plane_lda_test_predictions_200_scaled, plane_lg_test_predictions_200_scaled = \
predict_classifiers(plane_dt_200, plane_lda_200, plane_lg_200, plane_reduced_test_200_scaled)

plane_dt_test_predictions_300_scaled, plane_lda_test_predictions_300_scaled, plane_lg_test_predictions_300_scaled = \
predict_classifiers(plane_dt_300, plane_lda_300, plane_lg_300, plane_reduced_test_300_scaled)

plane_dt_test_predictions_400_scaled, plane_lda_test_predictions_400_scaled, plane_lg_test_predictions_400_scaled = \
predict_classifiers(plane_dt_400, plane_lda_400, plane_lg_400, plane_reduced_test_400_scaled)

plane_dt_test_predictions_500_scaled, plane_lda_test_predictions_500_scaled, plane_lg_test_predictions_500_scaled = \
predict_classifiers(plane_dt_500, plane_lda_500, plane_lg_500, plane_reduced_test_500_scaled)

Calculating accuracy

Let’s see if the evolution of the accuracy depending on the quantity of fetures selected:

print("100 features + Not scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_100, plane_lda_test_predictions_100, plane_lg_test_predictions_100)

print("\n")

print("100 features + Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_100_scaled, plane_lda_test_predictions_100_scaled, plane_lg_test_predictions_100_scaled)
100 features + Not scaled: 

Accuracy for the decision tree : 0.780952380952381
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9428571428571428


100 features + Scaled: 

Accuracy for the decision tree : 0.819047619047619
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9238095238095239
print("200 features + Not scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_200, plane_lda_test_predictions_200, plane_lg_test_predictions_200)

print("\n")

print("200 features + Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_200_scaled, plane_lda_test_predictions_200_scaled, plane_lg_test_predictions_200_scaled)
200 features + Not scaled: 

Accuracy for the decision tree : 0.8
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9428571428571428


200 features + Scaled: 

Accuracy for the decision tree : 0.8285714285714286
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9238095238095239
print("300 features + Not Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_300, plane_lda_test_predictions_300, plane_lg_test_predictions_300)

print("\n")

print("300 features + Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_300_scaled, plane_lda_test_predictions_300_scaled, plane_lg_test_predictions_300_scaled)
300 features + Not Scaled: 

Accuracy for the decision tree : 0.7428571428571429
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9428571428571428


300 features + Scaled: 

Accuracy for the decision tree : 0.8095238095238095
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9238095238095239
print("400 features + Not scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_400, plane_lda_test_predictions_400, plane_lg_test_predictions_400)

print("\n")

print("400 features + Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_400_scaled, plane_lda_test_predictions_400_scaled, plane_lg_test_predictions_400_scaled)
400 features + Not scaled: 

Accuracy for the decision tree : 0.780952380952381
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9428571428571428


400 features + Scaled: 

Accuracy for the decision tree : 0.7523809523809524
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9238095238095239
print("500 features + Not scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_500, plane_lda_test_predictions_500, plane_lg_test_predictions_500)

print("\n")

print("500 features+ Scaled: \n")
print_accuracies(plane_test_target, plane_dt_test_predictions_500_scaled, plane_lda_test_predictions_500_scaled, plane_lg_test_predictions_500_scaled)
500 features + Not scaled: 

Accuracy for the decision tree : 0.7904761904761904
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9428571428571428


500 features+ Scaled: 

Accuracy for the decision tree : 0.780952380952381
Accuracy for LDA : 0.9523809523809523
Accuracy for logistic regression: 0.9238095238095239

Car Dataset

Feature selection

As always, we need to separate the data in three: features data for training, features data for testing and class target:

# Not Scaled data
car_reduced_train_100, car_reduced_test_100 = select_train_test(car_train_features, car_train_target, car_test_features, k=100)
car_reduced_train_200, car_reduced_test_200 = select_train_test(car_train_features, car_train_target, car_test_features, k=200)
car_reduced_train_300, car_reduced_test_300 = select_train_test(car_train_features, car_train_target, car_test_features, k=300)
car_reduced_train_400, car_reduced_test_400 = select_train_test(car_train_features, car_train_target, car_test_features, k=400)
car_reduced_train_500, car_reduced_test_500 = select_train_test(car_train_features, car_train_target, car_test_features, k=500)
# Scaled data
car_reduced_train_100_scaled, car_reduced_test_100_scaled = \
    select_train_test(car_train_features_scaled, car_train_target, car_test_features_scaled, k=100)
car_reduced_train_200_scaled, car_reduced_test_200_scaled = \
    select_train_test(car_train_features_scaled, car_train_target, car_test_features_scaled, k=200)
car_reduced_train_300_scaled, car_reduced_test_300_scaled = \
    select_train_test(car_train_features_scaled, car_train_target, car_test_features_scaled, k=300)
car_reduced_train_400_scaled, car_reduced_test_400_scaled = \
    select_train_test(car_train_features_scaled, car_train_target, car_test_features_scaled, k=400)
car_reduced_train_500_scaled, car_reduced_test_500_scaled = \
    select_train_test(car_train_features_scaled, car_train_target, car_test_features_scaled, k=500)

Define classifiers

It is better to create one classifier of each type not to mix them:

car_dt_100, car_lda_100, car_lg_100 = get_classifiers()
car_dt_200, car_lda_200, car_lg_200 = get_classifiers()
car_dt_300, car_lda_300, car_lg_300 = get_classifiers()
car_dt_400, car_lda_400, car_lg_400 = get_classifiers()
car_dt_500, car_lda_500, car_lg_500 = get_classifiers()

Fit Classifiers

# Not Scaled data
fit_classifiers(car_dt_100, car_lda_100, car_lg_100, car_reduced_train_100, car_train_target)
fit_classifiers(car_dt_200, car_lda_200, car_lg_200, car_reduced_train_200, car_train_target)
fit_classifiers(car_dt_300, car_lda_300, car_lg_300, car_reduced_train_300, car_train_target)
fit_classifiers(car_dt_400, car_lda_400, car_lg_400, car_reduced_train_400, car_train_target)
fit_classifiers(car_dt_500, car_lda_500, car_lg_500, car_reduced_train_500, car_train_target)

Predict classifiers

# Not Scaled data
car_dt_test_predictions_100, car_lda_test_predictions_100, car_lg_test_predictions_100 = \
predict_classifiers(car_dt_100, car_lda_100, car_lg_100, car_reduced_test_100)

car_dt_test_predictions_200, car_lda_test_predictions_200, car_lg_test_predictions_200 = \
predict_classifiers(car_dt_200, car_lda_200, car_lg_200, car_reduced_test_200)

car_dt_test_predictions_300, car_lda_test_predictions_300, car_lg_test_predictions_300 = \
predict_classifiers(car_dt_300, car_lda_300, car_lg_300, car_reduced_test_300)

car_dt_test_predictions_400, car_lda_test_predictions_400, car_lg_test_predictions_400 = \
predict_classifiers(car_dt_400, car_lda_400, car_lg_400, car_reduced_test_400)

car_dt_test_predictions_500, car_lda_test_predictions_500, car_lg_test_predictions_500 = \
predict_classifiers(car_dt_500, car_lda_500, car_lg_500, car_reduced_test_500)

Fitting classifiers scaled

# Scaled data
fit_classifiers(car_dt_100, car_lda_100, car_lg_100, car_reduced_train_100_scaled, car_train_target)
fit_classifiers(car_dt_200, car_lda_200, car_lg_200, car_reduced_train_200_scaled, car_train_target)
fit_classifiers(car_dt_300, car_lda_300, car_lg_300, car_reduced_train_300_scaled, car_train_target)
fit_classifiers(car_dt_400, car_lda_400, car_lg_400, car_reduced_train_400_scaled, car_train_target)
fit_classifiers(car_dt_500, car_lda_500, car_lg_500, car_reduced_train_500_scaled, car_train_target)

Predicting classifiers scaled

# Scaled data
car_dt_test_predictions_100_scaled, car_lda_test_predictions_100_scaled, car_lg_test_predictions_100_scaled = \
predict_classifiers(car_dt_100, car_lda_100, car_lg_100, car_reduced_test_100_scaled)

car_dt_test_predictions_200_scaled, car_lda_test_predictions_200_scaled, car_lg_test_predictions_200_scaled = \
predict_classifiers(car_dt_200, car_lda_200, car_lg_200, car_reduced_test_200_scaled)

car_dt_test_predictions_300_scaled, car_lda_test_predictions_300_scaled, car_lg_test_predictions_300_scaled = \
predict_classifiers(car_dt_300, car_lda_300, car_lg_300, car_reduced_test_300_scaled)

car_dt_test_predictions_400_scaled, car_lda_test_predictions_400_scaled, car_lg_test_predictions_400_scaled = \
predict_classifiers(car_dt_400, car_lda_400, car_lg_400, car_reduced_test_400_scaled)

car_dt_test_predictions_500_scaled, car_lda_test_predictions_500_scaled, car_lg_test_predictions_500_scaled = \
predict_classifiers(car_dt_500, car_lda_500, car_lg_500, car_reduced_test_500_scaled)

Calculating accuracy

Let’s see if the evolution of the accuracy depending on the quantity of fetures selected:

print("100 features + Not scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_100, car_lda_test_predictions_100, car_lg_test_predictions_100)

print("\n")

print("100 features + Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_100_scaled, car_lda_test_predictions_100_scaled, car_lg_test_predictions_100_scaled)
100 features + Not scaled: 

Accuracy for the decision tree : 0.8666666666666667
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.8666666666666667


100 features + Scaled: 

Accuracy for the decision tree : 0.7166666666666667
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.9166666666666666
print("200 features + Not scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_200, car_lda_test_predictions_200, car_lg_test_predictions_200)

print("\n")

print("200 features + Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_200_scaled, car_lda_test_predictions_200_scaled, car_lg_test_predictions_200_scaled)
200 features + Not scaled: 

Accuracy for the decision tree : 0.7
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.8666666666666667


200 features + Scaled: 

Accuracy for the decision tree : 0.8166666666666667
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.9166666666666666
print("300 features + Not Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_300, car_lda_test_predictions_300, car_lg_test_predictions_300)

print("\n")

print("300 features + Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_300_scaled, car_lda_test_predictions_300_scaled, car_lg_test_predictions_300_scaled)
300 features + Not Scaled: 

Accuracy for the decision tree : 0.8333333333333334
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.8666666666666667


300 features + Scaled: 

Accuracy for the decision tree : 0.8
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.9166666666666666
print("400 features + Not scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_400, car_lda_test_predictions_400, car_lg_test_predictions_400)

print("\n")

print("400 features + Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_400_scaled, car_lda_test_predictions_400_scaled, car_lg_test_predictions_400_scaled)
400 features + Not scaled: 

Accuracy for the decision tree : 0.8
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.8666666666666667


400 features + Scaled: 

Accuracy for the decision tree : 0.8833333333333333
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.9166666666666666
print("500 features + Not scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_500, car_lda_test_predictions_500, car_lg_test_predictions_500)

print("\n")

print("500 features+ Scaled: \n")
print_accuracies(car_test_target, car_dt_test_predictions_500_scaled, car_lda_test_predictions_500_scaled, car_lg_test_predictions_500_scaled)
500 features + Not scaled: 

Accuracy for the decision tree : 0.7666666666666667
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.8666666666666667


500 features+ Scaled: 

Accuracy for the decision tree : 0.8333333333333334
Accuracy for LDA : 0.9
Accuracy for logistic regression: 0.9166666666666666

8.2. Feature Selection with RandomForest

Apart from SelectKBest algorithm, exists the ensemble of decision trees called RandomForest that can be used as a feature selector. Let’s see how does this algoritm performs comparing to the algorithm above.

Plane dataset

## We are going to make 10 trials, and then. take the mean of 1)trees quantity and 2) accuracy percentage
for j in (range(1, 11)):
    max = 0
    n_estim = 0
    result1 = 0
    result2 = 0
    result3 = 0
    classifier = " "
    selected_features_final = []
    
    ## We are going to measure best accuracies obtained from 1 tree to 60 trees. But, we have one constrain: 
    ## we can continue computing with a superior tree quantity if with the actual tree quantity we obtain a better 
    ## accuracy regarding to the previous tree quantity.
    for i in (range(1, 60)):
        sel = SelectFromModel(RandomForestClassifier(n_estimators = i))
        sel.fit(plane_train_features, plane_train_target)
        # Obtain the selected features by the RandomForestClassifier
        selectedFeaturesBoolean= sel.get_support()
    
        # Take the selected features from each train and test sets
        selectedFeaturesTrainNames = plane_train_features.columns[(selectedFeaturesBoolean)]
        selectedFeaturesTestNames = plane_test_features.columns[(selectedFeaturesBoolean)]
    
        selectedTrainData = plane_train_features[selectedFeaturesTrainNames]
        selectedTestData = plane_test_features[selectedFeaturesTestNames]
        np.array(selectedTestData).shape
        
        # Create the classifiers we are going to use
        dt  = DecisionTreeClassifier()
        lda = LinearDiscriminantAnalysis()
        lg  = LogisticRegression(max_iter=2000)
    
        # Fit the classifiers with the selected features in the train set
        dt.fit(selectedTrainData, plane_train_target)
        lda.fit(selectedTrainData, plane_train_target)
        lg.fit(selectedTrainData, plane_train_target)
    
        # Predict labels with the selected features in the test set
        dt_selected_prediction = dt.predict(selectedTestData)
        lda_selected_prediction = lda.predict(selectedTestData)
        lg_selected_prediction = lg.predict(selectedTestData)
    
        # If we don't obtain now a better accuracy_score comparing to 
        # the previous accuracy_score's -> stop the process
        if (accuracy_score(plane_test_target, dt_selected_prediction)  < result1 and
            accuracy_score(plane_test_target, lda_selected_prediction) < result2 and
            accuracy_score(plane_test_target, lg_selected_prediction)  < result3):
            break; 
        else:  
            result1 = accuracy_score(plane_test_target, dt_selected_prediction)
            result2 = accuracy_score(plane_test_target, lda_selected_prediction)
            result3 = accuracy_score(plane_test_target, lg_selected_prediction)
    
            if (result1>max or result2>max or result3>max):
                n_estim = i
                selected_features_final = plane_train_features.columns[(selectedFeaturesBoolean)]
                if result1>max:
                    max = result1
                    classifier = "DT"
                elif result2>max:
                    max = result2
                    classifier = "LDA"
                elif result3>max:
                    max = result3
                    classifier = "LG"         
    print("TRIAL: " + str(j) + "\n")
    print("CLASSIFIER: " + classifier + ", " + "TREE_QUANTITY: " + str(n_estim) + ", "
          + "MAX_ACCURACY:" + str(max) + "\n" + "FEATURE_QUANTITY: " + str(len(selected_features_final)) + "\n")
    print("SELECTED_FEATURES: " + str(plane_train_features.columns[(selectedFeaturesBoolean)]) + "\n")
    print("\n")
          
          
    #(classifier, n_estim, max, selected_features_final)
    
TRIAL: 1

CLASSIFIER: LDA, TREE_QUANTITY: 19, MAX_ACCURACY:0.9714285714285714
FEATURE_QUANTITY: 202

SELECTED_FEATURES: Index(['x3', 'y6', 'x14', 'x16', 'y24', 'x25', 'x29', 'x37', 'x38', 'x41',
       ...
       'y837', 'x846', 'x851', 'y852', 'x854', 'y855', 'x863', 'x873', 'x875',
       'x880'],
      dtype='object', length=306)



TRIAL: 2

CLASSIFIER: LDA, TREE_QUANTITY: 3, MAX_ACCURACY:0.9333333333333333
FEATURE_QUANTITY: 33

SELECTED_FEATURES: Index(['x9', 'x57', 'y134', 'y160', 'y216', 'x224', 'x225', 'y225', 'x231',
       'y234', 'y245', 'y298', 'y317', 'y325', 'y332', 'y334', 'x339', 'x370',
       'x375', 'x386', 'x396', 'x442', 'x443', 'x452', 'y516', 'y597', 'y598',
       'x609', 'x621', 'y622', 'x629', 'x644', 'x647', 'y650', 'x659', 'y662',
       'y672', 'y710', 'x730', 'y735', 'x750', 'x761', 'x763', 'x799', 'x802',
       'x813'],
      dtype='object')



TRIAL: 3

CLASSIFIER: LDA, TREE_QUANTITY: 21, MAX_ACCURACY:0.9714285714285714
FEATURE_QUANTITY: 221

SELECTED_FEATURES: Index(['x5', 'y5', 'y9', 'x13', 'x19', 'x26', 'x27', 'x28', 'x32', 'x38',
       ...
       'x826', 'x827', 'y831', 'y837', 'y839', 'x843', 'y844', 'x846', 'x848',
       'x889'],
      dtype='object', length=270)



TRIAL: 4

CLASSIFIER: LDA, TREE_QUANTITY: 6, MAX_ACCURACY:0.9714285714285714
FEATURE_QUANTITY: 65

SELECTED_FEATURES: Index(['x4', 'y8', 'y15', 'x17', 'x26', 'x28', 'x36', 'x49', 'x51', 'x62',
       ...
       'x772', 'y772', 'x791', 'x820', 'x823', 'x829', 'x837', 'y837', 'x844',
       'x871'],
      dtype='object', length=126)



TRIAL: 5

CLASSIFIER: LDA, TREE_QUANTITY: 18, MAX_ACCURACY:0.9619047619047619
FEATURE_QUANTITY: 196

SELECTED_FEATURES: Index(['Perimeter_length', 'x1', 'x22', 'x30', 'x44', 'x48', 'x53', 'y57',
       'x68', 'x82',
       ...
       'x818', 'x820', 'y821', 'y823', 'y826', 'x836', 'x857', 'y861', 'y870',
       'y885'],
      dtype='object', length=192)



TRIAL: 6

CLASSIFIER: LDA, TREE_QUANTITY: 4, MAX_ACCURACY:0.9619047619047619
FEATURE_QUANTITY: 40

SELECTED_FEATURES: Index(['x0', 'x21', 'y22', 'y34', 'x53', 'x81', 'y87', 'y116', 'x130', 'y157',
       'x160', 'y162', 'x176', 'y190', 'x194', 'y195', 'x199', 'y215', 'x217',
       'x256', 'y283', 'y315', 'y318', 'y325', 'x337', 'y338', 'y344', 'y361',
       'x366', 'y368', 'x373', 'x374', 'x376', 'x392', 'x403', 'x421', 'x424',
       'y427', 'x438', 'x494', 'x504', 'x515', 'y515', 'x535', 'y535', 'x554',
       'x566', 'x587', 'x610', 'y620', 'x650', 'y653', 'y674', 'y684', 'x692',
       'x700', 'x711', 'y714', 'x717', 'y735', 'x743', 'y745', 'y751', 'x756',
       'x762', 'y806', 'x824', 'x855'],
      dtype='object')



TRIAL: 7

CLASSIFIER: LDA, TREE_QUANTITY: 5, MAX_ACCURACY:0.9714285714285714
FEATURE_QUANTITY: 61

SELECTED_FEATURES: Index(['Perimeter_length', 'x1', 'y16', 'x23', 'x26', 'x33', 'x39', 'y39',
       'x41', 'x48',
       ...
       'x804', 'y808', 'x815', 'x824', 'y834', 'y854', 'x859', 'x873', 'x880',
       'y889'],
      dtype='object', length=238)



TRIAL: 8

CLASSIFIER: LDA, TREE_QUANTITY: 4, MAX_ACCURACY:0.9714285714285714
FEATURE_QUANTITY: 43

SELECTED_FEATURES: Index(['x6', 'x27', 'y27', 'x155', 'y172', 'y173', 'y205', 'y215', 'y218',
       'y286', 'x297', 'y314', 'x322', 'x327', 'x328', 'y337', 'y354', 'x362',
       'y374', 'x390', 'x408', 'x428', 'x443', 'y448', 'y464', 'x469', 'x474',
       'y481', 'x505', 'x506', 'x512', 'y525', 'x526', 'y527', 'x560', 'x565',
       'y581', 'y658', 'x661', 'x670', 'x676', 'y677', 'x680', 'x696', 'x710',
       'y710', 'x728', 'x734', 'y768', 'y782', 'y797', 'y809', 'x820', 'x827',
       'x828', 'y829', 'y835', 'y863'],
      dtype='object')



TRIAL: 9

CLASSIFIER: LDA, TREE_QUANTITY: 6, MAX_ACCURACY:0.9619047619047619
FEATURE_QUANTITY: 70

SELECTED_FEATURES: Index(['x15', 'x21', 'x23', 'x24', 'y41', 'x43', 'y46', 'x48', 'x53', 'x55',
       ...
       'x789', 'x813', 'y824', 'y825', 'x827', 'y828', 'x834', 'y864', 'x877',
       'x882'],
      dtype='object', length=149)



TRIAL: 10

CLASSIFIER: LDA, TREE_QUANTITY: 4, MAX_ACCURACY:0.9619047619047619
FEATURE_QUANTITY: 44

SELECTED_FEATURES: Index(['x3', 'y6', 'y7', 'x32', 'x38', 'y40', 'x41', 'x54', 'x63', 'x81',
       ...
       'x813', 'x814', 'y816', 'x817', 'x824', 'x829', 'y830', 'x851', 'x852',
       'x869'],
      dtype='object', length=157)

Car dataset

## We are going to make 10 trials, and then. take the mean of 1)trees quantity and 2) accuracy percentage
for j in (range(1, 11)):
    max = 0
    n_estim = 0
    result1 = 0
    result2 = 0
    result3 = 0
    classifier = " "
    selected_features_final = []
    
    ## We are going to measure best accuracies obtained from 1 tree to 60 trees. But, we have one constrain: 
    ## we can continue computing with a superior tree quantity if with the actual tree quantity we obtain a better 
    ## accuracy regarding to the previous tree quantity.
    for i in (range(1, 60)):
        sel = SelectFromModel(RandomForestClassifier(n_estimators = i))
        sel.fit(car_train_features, car_train_target)
        # Obtain the selected features by the RandomForestClassifier
        selectedFeaturesBoolean= sel.get_support()
    
        # Take the selected features from each train and test sets
        selectedFeaturesTrainNames = car_train_features.columns[(selectedFeaturesBoolean)]
        selectedFeaturesTestNames = car_test_features.columns[(selectedFeaturesBoolean)]
    
        selectedTrainData = car_train_features[selectedFeaturesTrainNames]
        selectedTestData = car_test_features[selectedFeaturesTestNames]
        np.array(selectedTestData).shape
    
        # Create the classifiers we are going to use
        dt  = DecisionTreeClassifier()
        lda = LinearDiscriminantAnalysis()
        lg  = LogisticRegression(max_iter=2000)
    
        # Fit the classifiers with the selected features in the train set
        dt.fit(selectedTrainData, car_train_target)
        lda.fit(selectedTrainData, car_train_target)
        lg.fit(selectedTrainData, car_train_target)
    
        # Predict labels with the selected features in the test set
        dt_selected_prediction = dt.predict(selectedTestData)
        lda_selected_prediction = lda.predict(selectedTestData)
        lg_selected_prediction = lg.predict(selectedTestData)
        
        # If we don't obtain now a better accuracy_score comparing to 
        # the previous accuracy_score's -> stop the process
        if (accuracy_score(car_test_target, dt_selected_prediction)  < result1 and
            accuracy_score(car_test_target, lda_selected_prediction) < result2 and
            accuracy_score(car_test_target, lg_selected_prediction)  < result3):
            break; 
        else:  
            result1 = accuracy_score(car_test_target, dt_selected_prediction)
            result2 = accuracy_score(car_test_target, lda_selected_prediction)
            result3 = accuracy_score(car_test_target, lg_selected_prediction)
    
            if (result1>max or result2>max or result3>max):
                n_estim = i
                selected_features_final = car_train_features.columns[(selectedFeaturesBoolean)]
                if result1>max:
                    max = result1
                    classifier = "DT"
                elif result2>max:
                    max = result2
                    classifier = "LDA"
                elif result3>max:
                    max = result3
                    classifier = "LG"  
    
    print("TRIAL: " + str(j) + "\n")
    print("CLASSIFIER: " + classifier + ", " + "TREE_QUANTITY: " + str(n_estim) + ", "
          + "MAX_ACCURACY:" + str(max) + "\n" + "FEATURE_QUANTITY: " + str(len(selected_features_final)) + "\n")
    print("SELECTED_FEATURES: " + str(car_train_features.columns[(selectedFeaturesBoolean)]) + "\n")
    print("\n")
          
          
    #(classifier, n_estim, max, selected_features_final)
    
TRIAL: 1

CLASSIFIER: LG, TREE_QUANTITY: 5, MAX_ACCURACY:0.9
FEATURE_QUANTITY: 39

SELECTED_FEATURES: Index(['Perimeter_length', 'y1', 'x13', 'y13', 'y16', 'y20', 'x35', 'y41',
       'y43', 'x45', 'y45', 'x51', 'y51', 'x53', 'x57', 'y70', 'y80', 'y81',
       'y82', 'y83', 'y86', 'y90', 'y105', 'y108', 'y109', 'y111', 'y113',
       'y114', 'x124', 'y144', 'y146', 'y154', 'y157', 'y160', 'y163', 'x165',
       'y171', 'x176', 'y177', 'y179', 'y184', 'x185', 'y185', 'y186', 'y187',
       'y188', 'x189', 'y189', 'x191', 'y191', 'y194', 'y195', 'y197', 'y202',
       'y223', 'x251', 'y254', 'x256', 'x269', 'y270'],
      dtype='object')



TRIAL: 2

CLASSIFIER: DT, TREE_QUANTITY: 6, MAX_ACCURACY:0.8666666666666667
FEATURE_QUANTITY: 51

SELECTED_FEATURES: Index(['x1', 'y16', 'y17', 'x23', 'y28', 'y31', 'x33', 'y35', 'y39', 'y40',
       'x42', 'y42', 'y44', 'y59', 'y63', 'x66', 'y72', 'y75', 'y84', 'y87',
       'y98', 'y102', 'y121', 'x122', 'x129', 'y137', 'y142', 'y146', 'x147',
       'y166', 'y171', 'x172', 'y173', 'y176', 'y180', 'y184', 'y185', 'y191',
       'y192', 'y195', 'y200', 'y210', 'y214', 'x221', 'y223', 'y227', 'y244',
       'y247', 'y252', 'x255', 'y271'],
      dtype='object')



TRIAL: 3

CLASSIFIER: LG, TREE_QUANTITY: 3, MAX_ACCURACY:0.9
FEATURE_QUANTITY: 25

SELECTED_FEATURES: Index(['x1', 'y3', 'x4', 'x5', 'x8', 'y9', 'x14', 'y16', 'y18', 'y19',
       ...
       'y241', 'y248', 'y249', 'x251', 'x264', 'y266', 'x267', 'y267', 'y269',
       'y271'],
      dtype='object', length=106)



TRIAL: 4

CLASSIFIER: LG, TREE_QUANTITY: 16, MAX_ACCURACY:0.9333333333333333
FEATURE_QUANTITY: 106

SELECTED_FEATURES: Index(['y1', 'y7', 'y8', 'y9', 'y10', 'x11', 'y12', 'y14', 'y17', 'y18',
       ...
       'y258', 'x260', 'x261', 'x262', 'x263', 'y266', 'y267', 'y269', 'y270',
       'y271'],
      dtype='object', length=148)



TRIAL: 5

CLASSIFIER: LG, TREE_QUANTITY: 4, MAX_ACCURACY:0.8666666666666667
FEATURE_QUANTITY: 33

SELECTED_FEATURES: Index(['x3', 'y10', 'y18', 'y23', 'y27', 'y50', 'y53', 'y56', 'y105', 'y107',
       'y111', 'y125', 'x152', 'y154', 'y155', 'y156', 'y164', 'y171', 'y172',
       'y173', 'y174', 'y175', 'y176', 'x179', 'y181', 'y182', 'x185', 'y187',
       'x194', 'y209', 'x226', 'y237', 'x251', 'y254', 'y259', 'x269'],
      dtype='object')



TRIAL: 6

CLASSIFIER: DT, TREE_QUANTITY: 1, MAX_ACCURACY:0.8333333333333334
FEATURE_QUANTITY: 10

SELECTED_FEATURES: Index(['y0', 'y43', 'y64', 'y112', 'y150', 'y155', 'y161', 'y162', 'y173',
       'y177', 'y178', 'y222', 'x263', 'y267'],
      dtype='object')



TRIAL: 7

CLASSIFIER: LG, TREE_QUANTITY: 13, MAX_ACCURACY:0.9333333333333333
FEATURE_QUANTITY: 76

SELECTED_FEATURES: Index(['Perimeter_length', 'y0', 'y2', 'x8', 'y11', 'x13', 'y13', 'y17', 'x18',
       'y18',
       ...
       'x231', 'y247', 'x249', 'x254', 'x260', 'x261', 'y262', 'x263', 'x266',
       'x269'],
      dtype='object', length=135)



TRIAL: 8

CLASSIFIER: LG, TREE_QUANTITY: 12, MAX_ACCURACY:0.9333333333333333
FEATURE_QUANTITY: 80

SELECTED_FEATURES: Index(['y0', 'x7', 'x11', 'x14', 'y15', 'y23', 'x27', 'y30', 'y35', 'y38',
       'y42', 'y43', 'y56', 'y58', 'y64', 'x66', 'y67', 'y70', 'x75', 'y78',
       'y80', 'y82', 'y87', 'y90', 'y91', 'y97', 'y98', 'x99', 'y106', 'x112',
       'y118', 'y120', 'y121', 'y124', 'x127', 'x131', 'y135', 'x136', 'y137',
       'y138', 'x139', 'y141', 'y143', 'y146', 'x148', 'x149', 'y149', 'y150',
       'y152', 'y153', 'y155', 'y156', 'x158', 'y164', 'x168', 'y168', 'y170',
       'y171', 'y173', 'x174', 'y174', 'x175', 'y175', 'y179', 'y182', 'y184',
       'y186', 'y187', 'y188', 'y192', 'y194', 'x198', 'y200', 'y206', 'y207',
       'y209', 'y210', 'y216', 'y221', 'x231', 'y239', 'y243', 'x244', 'y250',
       'y252', 'y253', 'y254', 'x258', 'x266', 'x271'],
      dtype='object')



TRIAL: 9

CLASSIFIER: DT, TREE_QUANTITY: 3, MAX_ACCURACY:0.8333333333333334
FEATURE_QUANTITY: 23

SELECTED_FEATURES: Index(['y12', 'y16', 'y28', 'y35', 'x77', 'y87', 'y99', 'x127', 'y128', 'y141',
       'y142', 'x143', 'y143', 'y146', 'y162', 'y168', 'y174', 'y180', 'y197',
       'y202', 'y208', 'x218', 'y222', 'y226', 'x240', 'x261'],
      dtype='object')



TRIAL: 10

CLASSIFIER: LG, TREE_QUANTITY: 15, MAX_ACCURACY:0.9333333333333333
FEATURE_QUANTITY: 103

SELECTED_FEATURES: Index(['Perimeter_length', 'y0', 'y3', 'y6', 'x13', 'y13', 'x14', 'y14', 'y15',
       'y18',
       ...
       'y246', 'y247', 'x250', 'x254', 'y256', 'x258', 'y259', 'x262', 'y262',
       'y270'],
      dtype='object', length=112)

9. Feature engineering

9.1. Using PCA

PCA for the plane data

Define and Fit

plane_pca = PCA(n_components=3)
plane_pca.fit(plane_train_features)
PCA(n_components=3)

Transform

plane_train_features_trans = plane_pca.transform(plane_train_features)
plane_test_features_trans = plane_pca.transform(plane_test_features)
print(plane_test_features.shape, plane_test_features_trans.shape)
(105, 1781) (105, 3)

Plotting

fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_test_features_trans[plane_test_target==i, 0], plane_test_features_trans[plane_test_target==i, 1], \
               plane_test_features_trans[plane_test_target==i, 2], label=str(i))

lgnd = plt.legend()
ax.set_xlabel('PCA dim. 1')
ax.set_ylabel('PCA dim. 2')
ax.set_zlabel('PCA dim. 3')

for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]
PCA for the plane data
PCA for the plane data
plt.close(fig)

Accuracy

plane_lda = LinearDiscriminantAnalysis()
plane_lda.fit(plane_train_features_trans, plane_train_target)
plane_lda_test_predictions = plane_lda.predict(plane_test_features_trans)
lda_acc =  accuracy_score(plane_test_target, plane_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)
Accuracy for LDA : 0.6190476190476191

PCA for the car data

Define and Fit

car_pca = PCA(n_components=3)
car_pca.fit(car_train_features)
PCA(n_components=3)

Transform

car_train_features_trans = car_pca.transform(car_train_features)
car_test_features_trans = car_pca.transform(car_test_features)
print(car_test_features.shape, car_test_features_trans.shape)
(60, 545) (60, 3)

Plotting

fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 5):
    ax.scatter(car_test_features_trans[car_test_target==i, 0], car_test_features_trans[car_test_target==i, 1], \
               car_test_features_trans[car_test_target==i, 2], label=str(i))

lgnd = plt.legend()
ax.set_xlabel('PCA dim. 1')
ax.set_ylabel('PCA dim. 2')
ax.set_zlabel('PCA dim. 3')

for i in range(4):
    lgnd.legendHandles[i]._sizes = [30]
PCA for the car data
PCA for the car data
plt.close(fig)

Accuracy

car_lda = LinearDiscriminantAnalysis()
car_lda.fit(car_train_features_trans, car_train_target)
car_lda_test_predictions = car_lda.predict(car_test_features_trans)
lda_acc =  accuracy_score(car_test_target, car_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)
Accuracy for LDA : 0.5166666666666667

9.2. Using LDA

LDA for the plane data

Not scaling the data

Define and Fit

plane_lda = LinearDiscriminantAnalysis(n_components=3)
plane_lda.fit(plane_train_features, plane_train_target)
LinearDiscriminantAnalysis(n_components=3)

Transform

plane_train_features_trans = plane_lda.transform(plane_train_features)
plane_test_features_trans = plane_lda.transform(plane_test_features)

Plotting

fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_test_features_trans[plane_test_target==i, 0], plane_test_features_trans[plane_test_target==i, 1], \
               plane_test_features_trans[plane_test_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]
LDA for the plane data
LDA for the plane data
plt.close(fig)

Accuracy

plane_lda = LinearDiscriminantAnalysis()
plane_lda.fit(plane_train_features_trans, plane_train_target)
plane_lda_test_predictions = plane_lda.predict(plane_test_features_trans)
lda_acc =  accuracy_score(plane_test_target, plane_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)
Accuracy for LDA : 0.9333333333333333

Scaling the data

plane_lda = LinearDiscriminantAnalysis(n_components=3)
plane_lda.fit(plane_train_features_scaled, plane_train_target)
LinearDiscriminantAnalysis(n_components=3)

Transform

plane_train_features_trans_scaled = plane_lda.transform(plane_train_features_scaled)
plane_test_features_trans_scaled = plane_lda.transform(plane_test_features_scaled)

Plotting

fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(plane_test_features_trans_scaled[plane_test_target==i, 0], plane_test_features_trans_scaled[plane_test_target==i, 1], \
               plane_test_features_trans_scaled[plane_test_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]
LDA for the plane data scaled
LDA for the plane data scaled
plt.close(fig)

Accuracy

plane_lda = LinearDiscriminantAnalysis()
plane_lda.fit(plane_train_features_trans_scaled, plane_train_target)
plane_lda_test_predictions = plane_lda.predict(plane_test_features_trans_scaled)
lda_acc =  accuracy_score(plane_test_target, plane_lda_test_predictions_scaled)
print("Accuracy for LDA :", lda_acc)
Accuracy for LDA : 0.9523809523809523

LDA for the car data

Not scaling the data

car_lda = LinearDiscriminantAnalysis(n_components=3)
car_lda_trans = car_lda.fit(car_train_features, car_train_target)

Transform

car_train_features_trans = car_lda.transform(car_train_features)
car_test_features_trans = car_lda.transform(car_test_features)

Plotting

fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 5):
    ax.scatter(car_test_features_trans[car_train_target==i, 0], car_test_features_trans[car_train_target==i, 1], \
               car_test_features_trans[car_train_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(4):
    lgnd.legendHandles[i]._sizes = [30]
LDA for the car data
LDA for the car data
plt.close(fig)

Accuracy

car_lda = LinearDiscriminantAnalysis()
car_lda.fit(car_train_features_trans, car_train_target)
car_lda_test_predictions = car_lda.predict(car_test_features_trans)
lda_acc =  accuracy_score(car_test_target, car_lda_test_predictions)
print("Accuracy for LDA :", lda_acc)
Accuracy for LDA : 0.9

Scaling the data

car_lda = LinearDiscriminantAnalysis(n_components=3)
car_lda.fit(car_train_features_scaled, car_train_target)
LinearDiscriminantAnalysis(n_components=3)

Transform

car_train_features_trans_scaled = car_lda.transform(car_train_features_scaled)
car_test_features_trans_scaled = car_lda.transform(car_test_features_scaled)

Plotting

fig = plt.figure()
ax = Axes3D(fig, elev=48, azim=134)

for i in range(1, 8):
    ax.scatter(car_test_features_trans_scaled[car_test_target==i, 0], car_test_features_trans_scaled[car_test_target==i, 1], \
               car_test_features_trans_scaled[car_test_target==i, 2], label=str(i))

ax.set_xlabel('LDA dim. 1')
ax.set_ylabel('LDA dim. 2')
ax.set_zlabel('LDA dim. 3')

lgnd = plt.legend()
for i in range(7):
    lgnd.legendHandles[i]._sizes = [30]
LDA for the car data scaled
LDA for the car data scaled
plt.close(fig)

Accuracy

car_lda = LinearDiscriminantAnalysis()
car_lda.fit(car_train_features_trans_scaled, car_train_target)
car_lda_test_predictions_scaled = car_lda.predict(car_test_features_trans_scaled)
lda_acc =  accuracy_score(car_test_target, car_lda_test_predictions_scaled)
print("Accuracy for LDA :", lda_acc)
Accuracy for LDA : 0.9

10. Pipeline Optimization

We use TPOT to generate an optimal pipeline to compare its accuracy with ours. This way we can know how good our classifiers are.

10.1. TPOT for the plane data

Fit

plane_tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)
plane_tpot.fit(plane_train_features, plane_train_target)
plane_tpot.fitted_pipeline_.steps
Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.



HBox(children=(HTML(value='Optimization Progress'), FloatProgress(value=0.0, max=60.0), HTML(value='')))



Generation 1 - Current best internal CV score: 0.7714285714285715
Generation 2 - Current best internal CV score: 0.9047619047619048
Generation 3 - Current best internal CV score: 0.9047619047619048
Generation 4 - Current best internal CV score: 0.9523809523809523
Generation 5 - Current best internal CV score: 0.9523809523809523
Best pipeline: ExtraTreesClassifier(LinearSVC(input_matrix, C=5.0, dual=True, loss=squared_hinge, penalty=l2, tol=0.001), bootstrap=False, criterion=entropy, max_features=0.8500000000000001, min_samples_leaf=11, min_samples_split=10, n_estimators=100)





[('stackingestimator',
  StackingEstimator(estimator=LinearSVC(C=5.0, random_state=16, tol=0.001))),
 ('extratreesclassifier',
  ExtraTreesClassifier(criterion='entropy', max_features=0.8500000000000001,
                       min_samples_leaf=11, min_samples_split=10,
                       random_state=16))]

Accuracy

plane_tpot_test_accuracy = plane_tpot.score(plane_test_features, plane_test_target)
print('The test accuracy obtained by tpot classification problem is:', plane_tpot_test_accuracy)
The test accuracy obtained by tpot classification problem is: 0.9238095238095239

10.2. TPOT for the car data

Fit

car_tpot = TPOTClassifier(generations=5, population_size=10, verbosity=2, random_state=16)
car_tpot.fit(car_train_features, car_train_target)
car_tpot.fitted_pipeline_.steps
Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.



HBox(children=(HTML(value='Optimization Progress'), FloatProgress(value=0.0, max=60.0), HTML(value='')))



Generation 1 - Current best internal CV score: 0.8
Generation 2 - Current best internal CV score: 0.8
Generation 3 - Current best internal CV score: 0.8
Generation 4 - Current best internal CV score: 0.8
Generation 5 - Current best internal CV score: 0.8
Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.5, max_depth=4, max_features=0.4, min_samples_leaf=8, min_samples_split=8, n_estimators=100, subsample=0.5)





[('gradientboostingclassifier',
  GradientBoostingClassifier(learning_rate=0.5, max_depth=4, max_features=0.4,
                             min_samples_leaf=8, min_samples_split=8,
                             random_state=16, subsample=0.5))]

Accuracy

car_tpot_test_accuracy = car_tpot.score(car_test_features, car_test_target)
print('The test accuracy obtained by tpot classification problem is:', car_tpot_test_accuracy)
The test accuracy obtained by tpot classification problem is: 0.8166666666666667
Julen Etxaniz
Julen Etxaniz
Hizkuntzaren Azterketa eta Prozesamendua Doktoregoko ikaslea

Hizkuntzaren Azterketa eta Prozesamendua Doktoregoko ikaslea HiTZ Zentroko IXA taldean (UPV/EHU). Hizkuntza ereduak baliabidea urriko hizkuntzetako hobetzeko lanean. Informatika Ingeniaritzan graduatua Software Ingeniaritza espezialitatearekin. Hizkuntzaren Azterketa eta Prozesamendua Masterra.

comments powered by Disqus

Erlazionatuta