# Centre Universitaire de Mila

## Master 1 (STIC & I2A), Matière: Apprentissage Automatique

## Travaux pratiques N°4 : Détection de la maladie de Parkinson

## Enoncé

L'ensemble de données sur la maladie de Parkinson a été créé par Max Little de l'Université d'Oxford, en collaboration avec le National Center for Voice and Speech, Denver, Colorado, qui a enregistré les signaux vocaux. L'étude originale a publié les méthodes d'extraction de caractéristiques pour les troubles généraux de la voix.


### Informations sur l'ensemble de données :

Ce dataset est composé d'un ensemble de mesures biomédicales de la voix de 31 personnes, dont 23 atteintes de la maladie de Parkinson (PD). Chaque colonne du tableau est une mesure de voix particulière, et chaque ligne correspond à l'un des 195 enregistrements vocaux de ces individus. L'objectif principal des données est de discriminer les personnes en bonne santé de celles atteintes de la maladie de Parkinson, selon la colonne "statut" qui est définie sur 0 pour les personnes en bonne santé et 1 pour les malades.


### Informations sur les attributs :

Matrix column entries (attributes):
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
RPDE,D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

status - Health status of the subject (one) - Parkinson's, (zero) - healthy


### Exercice:

1-	Étudier l’objet naive_bayes.GaussianNB de la librairie Python, scikit-learn.

https://scikit-learn.org/stable/modules/naive_bayes.html
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

2-	Utiliser une validation croisée pour calculer l’erreur de classification (ex. moyenne de 10 validations en retenant à chaque fois 10% de données pour la validation).



## Solution

### 1) Load libraries:
Let's first load the required libraries.

In [1]:
import pandas as pd
import numpy as np

### 2) Loading Data:
Let's first load the required iris dataset using parkinsons and panda's DataFrame function.

In [2]:
# read data
data = pd.read_csv('...')
... #showing the first 5 rows

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE,status
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,1
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674,1
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634,1
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975,1
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335,1


In [3]:
...
...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

(195, 24)

In [4]:
# inspecting the status column
data['status'].value_counts()
# it will be a binary classification

1    147
0     48
Name: status, dtype: int64

In [5]:
# drop name column, it will not be used in the classification
data.drop(['name'], axis = 1,inplace = True)

In [6]:
#  get the all features except "status"
X = ... # values use for array format

# get status values in array format
y = ...

In [7]:
#There is imbalanced in the data, using SMOTE
import imblearn

# Oversample and plot imbalanced dataset with SMOTE
from collections import Counter
from imblearn.over_sampling import SMOTE

# transform the dataset
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)

In [8]:
y.value_counts()

1    147
0    147
Name: status, dtype: int64

In [9]:
from sklearn.preprocessing import StandardScaler 
scaler=StandardScaler()

#transform data
# fit_transform() method fits to the data and then transforms it
X = scaler.fit_transform(X)

In [21]:
#  import train_test_split from sklearn. 
from sklearn.model_selection import train_test_split

# split the dataset into training and testing sets with 10% of testings
x_train, x_test, y_train, y_test=train_test_split(X, y, test_size=...)

In [22]:
# Load sklearn Gaussian Naive Bayes and fit
from sklearn.naive_bayes import GaussianNB

gnb = ...() 
gnb.fit(..., ...)

GaussianNB()

In [23]:
# Prediction on train data
predict_train = ...
print('Prediction on train data:', predict_train) 

Prediction on train data: [0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1
 1 0 1 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0
 1 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0
 0 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1
 0 1 0 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0
 0 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0
 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 1 0 0
 1 1 0 0 0]


In [24]:
# Accuray score on train data
from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(..., ...)
print('Accuray score on train data:', accuracy_train)

Accuray score on train data: 0.8068181818181818


In [25]:
# Prediction on test data
predict_test = ...
print('Prediction on test data:', predict_test)

Prediction on test data: [0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1]


In [26]:
# Accuracy Score on test data
accuracy_test = accuracy_score(..., ...)
print('Accuray score on test data:', accuracy_test)

Accuray score on test data: 0.7666666666666667
