{ "cells": [ { "cell_type": "markdown", "id": "3960a901", "metadata": {}, "source": [ "# Centre Universitaire de Mila" ] }, { "cell_type": "markdown", "id": "c4568d4a", "metadata": {}, "source": [ "## Master 1 (STIC & I2A), Matière: Apprentissage Automatique" ] }, { "cell_type": "markdown", "id": "a8bedc4a", "metadata": {}, "source": [ "## Travaux pratiques N°4 : Détection de la maladie de Parkinson" ] }, { "cell_type": "markdown", "id": "f57e8c63", "metadata": {}, "source": [ "## Enoncé" ] }, { "cell_type": "markdown", "id": "751221fd", "metadata": {}, "source": [ "L'ensemble de données sur la maladie de Parkinson a été créé par Max Little de l'Université d'Oxford, en collaboration avec le National Center for Voice and Speech, Denver, Colorado, qui a enregistré les signaux vocaux. L'étude originale a publié les méthodes d'extraction de caractéristiques pour les troubles généraux de la voix.\n" ] }, { "cell_type": "markdown", "id": "e69cb11f", "metadata": {}, "source": [ "### Informations sur l'ensemble de données :\n", "\n", "Ce dataset est composé d'un ensemble de mesures biomédicales de la voix de 31 personnes, dont 23 atteintes de la maladie de Parkinson (PD). Chaque colonne du tableau est une mesure de voix particulière, et chaque ligne correspond à l'un des 195 enregistrements vocaux de ces individus. L'objectif principal des données est de discriminer les personnes en bonne santé de celles atteintes de la maladie de Parkinson, selon la colonne \"statut\" qui est définie sur 0 pour les personnes en bonne santé et 1 pour les malades.\n" ] }, { "cell_type": "markdown", "id": "bca5e82a", "metadata": {}, "source": [ "### Informations sur les attributs :\n", "\n", "Matrix column entries (attributes):\n", "name - ASCII subject name and recording number\n", "MDVP:Fo(Hz) - Average vocal fundamental frequency\n", "MDVP:Fhi(Hz) - Maximum vocal fundamental frequency\n", "MDVP:Flo(Hz) - Minimum vocal fundamental frequency\n", "MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency\n", "MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude\n", "NHR,HNR - Two measures of ratio of noise to tonal components in the voice\n", "RPDE,D2 - Two nonlinear dynamical complexity measures\n", "DFA - Signal fractal scaling exponent\n", "spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation\n", "\n", "status - Health status of the subject (one) - Parkinson's, (zero) - healthy\n" ] }, { "cell_type": "markdown", "id": "0cbfedb0", "metadata": {}, "source": [ "### Exercice:\n", "\n", "1-\tÉtudier l’objet naive_bayes.GaussianNB de la librairie Python, scikit-learn.\n", "\n", "https://scikit-learn.org/stable/modules/naive_bayes.html\n", "https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html\n", "\n", "2-\tUtiliser une validation croisée pour calculer l’erreur de classification (ex. moyenne de 10 validations en retenant à chaque fois 10% de données pour la validation).\n", "\n" ] }, { "cell_type": "markdown", "id": "11ae8da9", "metadata": {}, "source": [ "## Solution" ] }, { "cell_type": "markdown", "id": "aa55f351", "metadata": {}, "source": [ "### 1) Load libraries:\n", "Let's first load the required libraries." ] }, { "cell_type": "code", "execution_count": 1, "id": "31af453b", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "2e700e39", "metadata": {}, "source": [ "### 2) Loading Data:\n", "Let's first load the required iris dataset using parkinsons and panda's DataFrame function." ] }, { "cell_type": "code", "execution_count": 2, "id": "6b364573", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameMDVP:Fo(Hz)MDVP:Fhi(Hz)MDVP:Flo(Hz)MDVP:Jitter(%)MDVP:Jitter(Abs)MDVP:RAPMDVP:PPQJitter:DDPMDVP:Shimmer...Shimmer:DDANHRHNRRPDEDFAspread1spread2D2PPEstatus
0phon_R01_S01_1119.992157.30274.9970.007840.000070.003700.005540.011090.04374...0.065450.0221121.0330.4147830.815285-4.8130310.2664822.3014420.2846541
1phon_R01_S01_2122.400148.650113.8190.009680.000080.004650.006960.013940.06134...0.094030.0192919.0850.4583590.819521-4.0751920.3355902.4868550.3686741
2phon_R01_S01_3116.682131.111111.5550.010500.000090.005440.007810.016330.05233...0.082700.0130920.6510.4298950.825288-4.4431790.3111732.3422590.3326341
3phon_R01_S01_4116.676137.871111.3660.009970.000090.005020.006980.015050.05492...0.087710.0135320.6440.4349690.819235-4.1175010.3341472.4055540.3689751
4phon_R01_S01_5116.014141.781110.6550.012840.000110.006550.009080.019660.06425...0.104700.0176719.6490.4173560.823484-3.7477870.2345132.3321800.4103351
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " name MDVP:Fo(Hz) MDVP:Fhi(Hz) MDVP:Flo(Hz) MDVP:Jitter(%) \\\n", "0 phon_R01_S01_1 119.992 157.302 74.997 0.00784 \n", "1 phon_R01_S01_2 122.400 148.650 113.819 0.00968 \n", "2 phon_R01_S01_3 116.682 131.111 111.555 0.01050 \n", "3 phon_R01_S01_4 116.676 137.871 111.366 0.00997 \n", "4 phon_R01_S01_5 116.014 141.781 110.655 0.01284 \n", "\n", " MDVP:Jitter(Abs) MDVP:RAP MDVP:PPQ Jitter:DDP MDVP:Shimmer ... \\\n", "0 0.00007 0.00370 0.00554 0.01109 0.04374 ... \n", "1 0.00008 0.00465 0.00696 0.01394 0.06134 ... \n", "2 0.00009 0.00544 0.00781 0.01633 0.05233 ... \n", "3 0.00009 0.00502 0.00698 0.01505 0.05492 ... \n", "4 0.00011 0.00655 0.00908 0.01966 0.06425 ... \n", "\n", " Shimmer:DDA NHR HNR RPDE DFA spread1 spread2 \\\n", "0 0.06545 0.02211 21.033 0.414783 0.815285 -4.813031 0.266482 \n", "1 0.09403 0.01929 19.085 0.458359 0.819521 -4.075192 0.335590 \n", "2 0.08270 0.01309 20.651 0.429895 0.825288 -4.443179 0.311173 \n", "3 0.08771 0.01353 20.644 0.434969 0.819235 -4.117501 0.334147 \n", "4 0.10470 0.01767 19.649 0.417356 0.823484 -3.747787 0.234513 \n", "\n", " D2 PPE status \n", "0 2.301442 0.284654 1 \n", "1 2.486855 0.368674 1 \n", "2 2.342259 0.332634 1 \n", "3 2.405554 0.368975 1 \n", "4 2.332180 0.410335 1 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read data\n", "data = pd.read_csv('...')\n", "... #showing the first 5 rows" ] }, { "cell_type": "code", "execution_count": 3, "id": "0ce69cbf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 195 entries, 0 to 194\n", "Data columns (total 24 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 name 195 non-null object \n", " 1 MDVP:Fo(Hz) 195 non-null float64\n", " 2 MDVP:Fhi(Hz) 195 non-null float64\n", " 3 MDVP:Flo(Hz) 195 non-null float64\n", " 4 MDVP:Jitter(%) 195 non-null float64\n", " 5 MDVP:Jitter(Abs) 195 non-null float64\n", " 6 MDVP:RAP 195 non-null float64\n", " 7 MDVP:PPQ 195 non-null float64\n", " 8 Jitter:DDP 195 non-null float64\n", " 9 MDVP:Shimmer 195 non-null float64\n", " 10 MDVP:Shimmer(dB) 195 non-null float64\n", " 11 Shimmer:APQ3 195 non-null float64\n", " 12 Shimmer:APQ5 195 non-null float64\n", " 13 MDVP:APQ 195 non-null float64\n", " 14 Shimmer:DDA 195 non-null float64\n", " 15 NHR 195 non-null float64\n", " 16 HNR 195 non-null float64\n", " 17 RPDE 195 non-null float64\n", " 18 DFA 195 non-null float64\n", " 19 spread1 195 non-null float64\n", " 20 spread2 195 non-null float64\n", " 21 D2 195 non-null float64\n", " 22 PPE 195 non-null float64\n", " 23 status 195 non-null int64 \n", "dtypes: float64(22), int64(1), object(1)\n", "memory usage: 36.7+ KB\n" ] }, { "data": { "text/plain": [ "(195, 24)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "...\n", "..." ] }, { "cell_type": "code", "execution_count": 4, "id": "410828c8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 147\n", "0 48\n", "Name: status, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# inspecting the status column\n", "data['status'].value_counts()\n", "# it will be a binary classification" ] }, { "cell_type": "code", "execution_count": 5, "id": "ac3fa557", "metadata": {}, "outputs": [], "source": [ "# drop name column, it will not be used in the classification\n", "data.drop(['name'], axis = 1,inplace = True)" ] }, { "cell_type": "code", "execution_count": 6, "id": "be975c92", "metadata": {}, "outputs": [], "source": [ "# get the all features except \"status\"\n", "X = ... # values use for array format\n", "\n", "# get status values in array format\n", "y = ..." ] }, { "cell_type": "code", "execution_count": 7, "id": "f8e57352", "metadata": {}, "outputs": [], "source": [ "#There is imbalanced in the data, using SMOTE\n", "import imblearn\n", "\n", "# Oversample and plot imbalanced dataset with SMOTE\n", "from collections import Counter\n", "from imblearn.over_sampling import SMOTE\n", "\n", "# transform the dataset\n", "oversample = SMOTE()\n", "X, y = oversample.fit_resample(X, y)" ] }, { "cell_type": "code", "execution_count": 8, "id": "fe61470a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 147\n", "0 147\n", "Name: status, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.value_counts()" ] }, { "cell_type": "code", "execution_count": 9, "id": "5c26296f", "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler \n", "scaler=StandardScaler()\n", "\n", "#transform data\n", "# fit_transform() method fits to the data and then transforms it\n", "X = scaler.fit_transform(X)" ] }, { "cell_type": "code", "execution_count": 21, "id": "10b13aab", "metadata": {}, "outputs": [], "source": [ "# import train_test_split from sklearn. \n", "from sklearn.model_selection import train_test_split\n", "\n", "# split the dataset into training and testing sets with 10% of testings\n", "x_train, x_test, y_train, y_test=train_test_split(X, y, test_size=...)" ] }, { "cell_type": "code", "execution_count": 22, "id": "78c9ed3b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GaussianNB()" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load sklearn Gaussian Naive Bayes and fit\n", "from sklearn.naive_bayes import GaussianNB\n", "\n", "gnb = ...() \n", "gnb.fit(..., ...)" ] }, { "cell_type": "code", "execution_count": 23, "id": "804cb43b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Prediction on train data: [0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1\n", " 1 0 1 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0\n", " 1 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0\n", " 0 0 1 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1\n", " 0 1 0 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0\n", " 0 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0\n", " 1 1 0 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 1 0 0\n", " 1 1 0 0 0]\n" ] } ], "source": [ "# Prediction on train data\n", "predict_train = ...\n", "print('Prediction on train data:', predict_train) " ] }, { "cell_type": "code", "execution_count": 24, "id": "760255b0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuray score on train data: 0.8068181818181818\n" ] } ], "source": [ "# Accuray score on train data\n", "from sklearn.metrics import accuracy_score\n", "accuracy_train = accuracy_score(..., ...)\n", "print('Accuray score on train data:', accuracy_train)" ] }, { "cell_type": "code", "execution_count": 25, "id": "63049284", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Prediction on test data: [0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1]\n" ] } ], "source": [ "# Prediction on test data\n", "predict_test = ...\n", "print('Prediction on test data:', predict_test)" ] }, { "cell_type": "code", "execution_count": 26, "id": "514386b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuray score on test data: 0.7666666666666667\n" ] } ], "source": [ "# Accuracy Score on test data\n", "accuracy_test = accuracy_score(..., ...)\n", "print('Accuray score on test data:', accuracy_test)" ] }, { "cell_type": "code", "execution_count": null, "id": "89d589de", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 5 }