AI/ML*DL 시작하기

주성분 분석, PCA(Principal Component Analysis) 쉽게 이해하기(2)

이전 포스팅에선 주성분 분석의 개념에 대해 알아보았습니다. (이전 포스팅 바로가기)

이번 글에선 Python을 이용하여 PCA분석 예제를 테스트 해볼 수있도록 해보겠습니다.

PCA는 Eigen Value값이 큰 Eigen Vector를 선택하여 차원을 축소하는 알고리즘입니다. 먼저 Pandas를 활용하여 테스트할 데이터 셋을 만들어 보도록 하겠습니다.

(참조:https://github.com/minsuk-heo/python_tutorial/blob/master/data_science/pca/PCA.ipynb)

import pandas as pd

df.loc[0] = [1200, 1, 0, 0, 2, 'Skinny']
df.loc[1] = [2800, 1, 1, 1, 1, 'Normal']
df.loc[2] = [3500, 2, 2, 1, 0, 'Fat']
df.loc[3] = [1400, 0, 1, 0, 3, 'Skinny']
df.loc[4] = [5000, 2, 2, 2, 0, 'Fat']
df.loc[5] = [1300, 0, 0, 1, 2, 'Skinny']
df.loc[6] = [3000, 1, 0, 1, 1, 'Normal']
df.loc[7] = [4000, 2, 2, 2, 0, 'Fat']
df.loc[8] = [2600, 0, 2, 0, 0, 'Normal']
df.loc[9] = [3000, 1, 2, 1, 1, 'Fat']
df.head(10)

테스트할 데이터 프레임을 위와 같이 확인 하실 수 있습니다.

이제 X,Y 데이터로 나누도록 하겠습니다.

# X데이터는 Featurs, Y데이터는 label

X = df[['calory', 'breakfast', 'lunch', 'dinner', 'exercise']]
Y = df[['body_shape']]

현재 X 데이터 값을 보면 범위가 균일하지 않은걸 알 수 있습니다.

Normalization 또는 Standardization과 같은 리스케일 과정이 필요합니다.

Sckit-Learn의 Standard Scaler를 사용하도록 하겠습니다.

from sklearn.preprocessing import StandardScaler
x_std = StandardScaler().fit_transform(X)

x_std

Feature값들이 Standard Scaler가 적용된 모습을 확인할 수 있습니다.

이제 5차원의 데이터를 가지는 Feature들 중에서 Eigen Value가 큰 Eigen Vector 값을 찾아보도록 하겠습니다.

import numpy as np

features = x_std.T 
covariance_matrix = np.cov(features
eig_vals, eig_vecs = np.linalg.eig(covariance_matrix)

print('\nEigenvalues \n%s' %eig_vals)

0번째 값이 Eigen Value가 가장 큰 걸 확인할 수 있습니다.

eig_vals[0] / sum(eig_vals)

위와 같이 전체 0번째 Eigen Value를 선택하여 차원축소를 할 경우 약 73% 정보를 활용할 수 있습니다.

이제 matplotlib와 seaborn 라이브러리를 활용하여 차원 축소된 데이터를 그래프로 확인해보겠습니다.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

projected_X = x_std.dot(eig_vecs.T[0])
result = pd.DataFrame(projected_X, columns=['PC1'])
result['y-axis'] = 0.0
result['label'] = Y

sns.lmplot('PC1', 'y-axis', data=result, fit_reg=False,  # x-axis, y-axis, data, no line
           scatter_kws={"s": 50}, # marker size
           hue="label") # color

plt.title('PCA result')# title

PCA분석 결과 같은 직선안에서 3가지 데이터가 나누어져 있는 모습을 확인할 수 있습니다.

Sckit Learn 라이브러리에서 PCA 라이브러리를 따로 제공을 하기때문에 아래와 같이 더 간결한 코드로 구현이 가능합니다.

from sklearn import decomposition
pca = decomposition.PCA(n_components=1)
sklearn_pca_x = pca.fit_transform(x_std)

sklearn_result = pd.DataFrame(sklearn_pca_x, columns=['PC1'])
sklearn_result['y-axis'] = 0.0
sklearn_result['label'] = Y

sns.lmplot('PC1', 'y-axis', data=sklearn_result, fit_reg=False,  # x-axis, y-axis, data, no line
           scatter_kws={"s": 50}, # marker size
           hue="label") # color

'AI > ML*DL 시작하기' 카테고리의 다른 글

모델의 성능 평가방법(2) - accuracy, precision, recall, F1 score (0)	2021.10.11
모델의 성능 평가방법(1) - Confusion Matrix (0)	2021.09.24
주성분 분석, PCA(Principal Component Analysis) 쉽게 이해하기(1) (0)	2021.09.11
KNN(K-Nearest Neighbor)과 색상 분류(2) (0)	2021.08.29
KNN(K-Nearest Neighbor)과 색상 분류(1) (0)	2021.08.21

Contents

새소식

인기 검색어

주성분 분석, PCA(Principal Component Analysis) 쉽게 이해하기(2)

'AI > ML*DL 시작하기' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바