[데이터사이언스/R] Red Wine Quality - 군집분석(Clustering Analysis) : 계층적 군집 분석 (Hierarchical Clustering), K-평균 군집 분석 (K-means Clustering)

티스토리 뷰

Data/R

[데이터사이언스/R] Red Wine Quality - 군집분석(Clustering Analysis) : 계층적 군집 분석 (Hierarchical Clustering), K-평균 군집 분석 (K-means Clustering)

ellie.strong 2021. 7. 14. 16:33

728x90

<목차>

[데이터사이언스/R] 데이터 분석해보기 6 - Red Wine Quality

<목차> 1. 데이터 확인 1-1. 데이터 소개 - 레드 와인의 물리 화학적 특징과 퀄리티 점수를 보여주는 CSV파일 데이터 1-2. 데이터 구조 분석 - 데이터를 불러온다. wine - 데이터의 structure를 확인한다. s

programmer-ririhan.tistory.com

1. 데이터 확인

1-1. 데이터 정제

덴드로그램으로 확인해보기위해 데이터의 수를 약 200개로 줄인다.

- 전체 데이터가 1599개 이므로 이 중 200개의 데이터만 랜덤으로 추출한다.

library(dplyr)

hc<-sample_n(wine, 200)

class 변수로 사용되는 quality와 rating 변수는 모델을 적용할 데이터에서는 삭제해야한다.

hc_data<-subset(hc, select=-c(quality, rating))

str(hc_data)

'data.frame':   200 obs. of  11 variables:
 $ fixed.acidity       : num  8 6.4 7 10.8 8.9 7.1 7.8 12.2 8.8 7.7 ...
 $ volatile.acidity    : num  0.45 0.885 0.975 0.26 0.75 0.61 0.56 0.34 0.45 0.26 ...
 $ citric.acid         : num  0.23 0 0.04 0.45 0.14 0.02 0.12 0.5 0.43 0.3 ...
 $ residual.sugar      : num  2.2 2.3 2 3.3 2.5 2.5 2 2.4 1.4 1.7 ...
 $ chlorides           : num  0.094 0.166 0.087 0.06 0.086 0.081 0.082 0.066 0.076 0.059 ...
 $ free.sulfur.dioxide : num  16 6 12 20 9 17 7 10 12 20 ...
 $ total.sulfur.dioxide: num  29 12 67 49 30 87 28 21 21 38 ...
 $ density             : num  0.996 0.996 0.996 0.997 0.998 ...
 $ pH                  : num  3.21 3.56 3.35 3.13 3.34 3.48 3.37 3.12 3.21 3.29 ...
 $ sulphates           : num  0.49 0.51 0.6 0.54 0.64 0.6 0.5 1.18 0.75 0.47 ...
 $ alcohol             : num  10.2 10.8 9.4 9.6 10.5 9.7 9.4 9.2 10.2 10.8 ...

1-2. 거리행렬 구하기

거리 기반이므로 변수들의 값을 표준화하고 각 점들 사이의 거리(거리행렬)를 구해준다.

hc_scale<-scale(hc_data)
hc_dist<-dist(hc_scale)

1, 2, 3 행 데이터와 다른 데이터들과의 거리를 계산 값을 확인해볼 수 있다.

as.matrix(wine_dist)[1:3,]
         1        2        3        4        5        6        7        8        9       10       11
1 0.000000 2.200616 3.378307 4.058526 3.730904 6.198812 5.242537 4.058480 5.995339 4.310446 7.280152
2 2.200616 0.000000 3.492344 4.338353 2.967422 6.412352 4.515130 3.961033 5.637104 4.432912 6.891150
3 3.378307 3.492344 0.000000 2.197622 3.973361 4.107527 4.563170 1.427273 4.780674 3.114454 6.532054
        12       13       14       15       16       17       18       19       20       21       22
1 5.998511 3.831569 6.089432 1.896099 7.171976 5.345357 4.818021 2.530091 5.012705 2.240985 2.572035
2 5.715539 3.929231 6.045429 3.132730 6.698603 4.833751 4.391930 3.599300 4.978711 2.047614 2.942882
3 4.639037 2.146202 6.306019 3.379558 5.694200 4.879030 4.638929 2.018307 2.194946 2.815712 2.610826

2. HC모델 적용 (계층적 군집화)

hclust() 함수 : HC모델을 생성해준다.

- hclust() documentation 참고

- method 옵션을 이용하여 기준법을 설정할 수 있다.

2-1. 최단거리 기준법(single)

hc_model<-hclust(hc_dist, method="single")
hc_model

Call:
hclust(d = hc_dist, method = "single")

Cluster method   : single 
Distance         : euclidean 
Number of objects: 200

NbClust() 함수를 이용하여 최적의 군집 개수를 찾는다.

library(NbClust)

nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="single")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 7 proposed 2 as the best number of clusters 
* 10 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 2 proposed 11 as the best number of clusters 
* 2 proposed 12 as the best number of clusters 
* 1 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
*******************************************************************

→ 군집 수를 3개로 설정하는 것이 10 번의 선택을 받았으므로 가장 적합하다는 결론이 나왔다.

3개의 군집으로 분류할 경우 군집별 데이터의 개수를 확인하였다.

hc_cutree<-cutree(hc_model, k=3)
table(hc_cutree)

hc_cutree
  1   2   3 
198   1   1

3개의 군집으로 분류한 결과를 덴드로그램으로 시각화 하였다.

plot(hc_model, hang=-1)
rect.hclust(hc_model, k=3)

3개의 군집으로 분류한 결과를 산점도로 시각화 하였다.

- x : volatile.acidity, y : alcohol → 상관계수가 가장 높은 상위 두 변수 사용

df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)

2-2. 최장 연결법(complete)

hc_model<-hclust(hc_dist, method="complete")

nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="complete")

if ((resCritical[ncB - min_nc + 1, 3] >= alphaBeale) && (!foundBeale)) {에서 다음과 같은 에러가 발생했습니다:
  TRUE/FALSE가 필요한 곳에 값이 없습니다
추가정보: 경고메시지(들): 
pf(beale, pp, df2)에서: NaN이 생성되

→ 오류가 발생한다.....ㅜ

2-3. 평균 연결법(average)

hc_model<-hclust(hc_dist, method="average")

nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="average")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 7 proposed 2 as the best number of clusters 
* 5 proposed 3 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 1 proposed 11 as the best number of clusters 
* 8 proposed 12 as the best number of clusters 
* 1 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  12 
 
 
*******************************************************************

hc_cutree<-cutree(hc_model, k=12)
table(hc_cutree)

hc_cutree
  1   2   3   4   5   6   7   8   9  10  11  12 
135  41   2   5   2   3   1   5   1   2   1   2

plot(hc_model, hang=-1)
rect.hclust(hc_model, k=12)

df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)

2-4. 중심 연결법(centroid)

hc_model<-hclust(hc_dist, method="centroid")

nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="centroid")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 7 proposed 2 as the best number of clusters 
* 10 proposed 3 as the best number of clusters 
* 2 proposed 9 as the best number of clusters 
* 2 proposed 11 as the best number of clusters 
* 2 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
*******************************************************************

hc_cutree<-cutree(hc_model, k=3)
table(hc_cutree)

hc_cutree
  1   2   3 
198   1   1

plot(hc_model, hang=-1)
rect.hclust(hc_model, k=3)

df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)

2-5. 와드 연결법(ward.D, ward.D2)

ward.D

hc_model<-hclust(hc_dist, method="ward.D")

nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="ward.D")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 4 proposed 2 as the best number of clusters 
* 9 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 2 proposed 7 as the best number of clusters 
* 1 proposed 8 as the best number of clusters 
* 1 proposed 10 as the best number of clusters 
* 2 proposed 13 as the best number of clusters 
* 2 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
*******************************************************************

hc_cutree<-cutree(hc_model, k=3)
table(hc_cutree)

hc_cutree
 1  2  3 
88 60 52

plot(hc_model, hang=-1)
rect.hclust(hc_model, k=3)

df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)

ward.D2

hc_model<-hclust(hc_dist, method="ward.D2")

nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="ward.D2")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 5 proposed 2 as the best number of clusters 
* 3 proposed 3 as the best number of clusters 
* 3 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 5 proposed 7 as the best number of clusters 
* 2 proposed 11 as the best number of clusters 
* 2 proposed 13 as the best number of clusters 
* 1 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
*******************************************************************

hc_cutree<-cutree(hc_model, k=2)
table(hc_cutree)

hc_cutree
  1   2 
129  71

plot(hc_model, hang=-1)
rect.hclust(hc_model, k=2)

df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)

→ 이전까지의 연결법을 사용했을 때는 모두 너무 편향적이게 분류가 되는 모습을 보였다.

→ 그나마 와드 연결법을 사용하였을 때 분류가 적용되는 듯한 모습을 보여준다.

3. K-means 모델 적용

library(graphics)

3-1. 군집 개수 결정

최적의 군집 개수를 확인해본 결과 3개의 군집 개수가 가장 적절하다고 나왔다.

nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="kmeans")

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 5 proposed 2 as the best number of clusters 
* 10 proposed 3 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 2 proposed 7 as the best number of clusters 
* 1 proposed 11 as the best number of clusters 
* 2 proposed 13 as the best number of clusters 
* 1 proposed 14 as the best number of clusters 
* 1 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3

3-2. K-means 모델 생성

kmeans() 함수를 이용하여 군집 개수를 3개로 나누는 k-평균 군집화 모델을 생성한다.

- k-평균 군집화 또한 관측치 사이의 거리를 이용하므로 평균화한 데이터를 사용한다.

km_model<-kmeans(hc_scale, 3)
km_model

K-means clustering with 3 clusters of sizes 36, 46, 118

Cluster means:
  fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides free.sulfur.dioxide
1     1.4327908       -0.6735567   1.1245064      0.2357906  0.7606881        -0.004900576
2     0.2631471       -0.7940938   0.7821667      0.1055560 -0.2284503        -0.183012545
3    -0.5397054        0.5150539  -0.6479822     -0.1130851 -0.1430175         0.072838965
  total.sulfur.dioxide    density           pH  sulphates    alcohol
1           0.02835748  1.1946828 -1.159909668  0.8535948 -0.5849508
2          -0.24242479 -0.1020296  0.001636419  0.2129711  0.7586628
3           0.08585315 -0.3247052  0.353232820 -0.3434414 -0.1172903

Clustering vector:
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25 
  3   3   3   1   3   3   3   1   2   2   3   3   2   3   3   3   1   3   1   3   2   3   3   3   1 
 26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50 
  3   3   2   2   2   2   1   1   3   2   3   3   3   3   1   2   3   2   2   2   3   3   2   1   3 
 51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75 
  2   2   3   3   3   3   3   1   2   1   3   3   3   2   1   3   2   2   1   3   2   3   3   3   3 
 76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
  3   2   3   3   2   3   1   3   3   1   3   1   1   3   3   2   1   2   3   3   3   3   1   3   3 
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 
  2   3   2   3   3   3   1   3   3   2   3   3   1   1   3   3   1   2   3   1   1   3   2   1   3 
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 
  2   3   3   2   1   1   3   3   3   2   2   3   3   3   2   2   3   3   3   3   3   3   3   3   3 
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 
  3   2   3   3   3   2   3   3   1   3   2   1   3   2   3   1   3   3   2   1   3   3   3   3   3 
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
  3   3   3   3   2   3   3   3   3   2   3   3   2   2   3   3   3   1   3   1   3   1   1   2   3 

Within cluster sum of squares by cluster:
[1] 492.8321 285.6163 853.6879
 (between_SS / total_SS =  25.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
[7] "size"         "iter"         "ifault"

→ 3개의 군집에는 각각 36, 46, 118개의 데이터가 있다.

→ 각 관측치가 몇 번째 군집에 포함되어있는지 모두 나와있다.

3-3. $cluster 확인

km_model$cluster는 군집번호(Clustering vector)를 의미한다.

- 각 관측치가 몇 번째 군집에 포함되어있는지를 알려준다.

plot(hc_scale, col=km_model$cluster)

→ 세개의 군집으로 꽤 잘 분류되어 있는 모습을 확인 할 수 있다.

3-4. $totss 확인

between_SS : 군집 간 분산 정도

→ 분산이 클 수록 군집끼리 확연히 분리된다.

km_model$betweenss

[1] 556.8638

within_SS : 군집 내 데이터의 분산 정도

→ 분산이 작을 수록 군집 내의 데이터끼리 밀집해있어 군집끼리 확연히 분리된다.

km_model$withinss

[1] 492.8321 285.6163 853.6879

→ 두번째 군집 내의 데이터들이 다른 군집의 데이터들에 비해 밀집해 있음을 알 수 있다.

total_SS = between_SS + within_SS

→ total_SS가 클 수록 클러스터가 잘 분류되었다는 의미를 가진다.

km_model$totss

[1] 2189

→ total_SS가 매우 작은 편인것 같다....

3-5. 변수간 관계 확인

plot(hc_data, col=km_model$cluster)

→ 두 변수의 상관관계가 커질수록 군집이 명확히 분류된다.

3-6. K 변화에 따른 withinss의 변화

kmeans() 함수의 centers 옵션을 1에서 10까지 돌아가면서 적용해본다.

within <- c()

for(i in 1:10){
	within[i] <- sum(kmeans(hc_scale, centers = i)$withinss)
}

plot(1:10, within, type="b", xlab = "Number of Clusters", ylab = "Within group sum of squares")

→ K가 증가할수록 withinss의 값이 작아진다.

→ K가 증가할수록 군집 내의 분산이 작아진다. = 군집 내의 밀도가 높아진다.

→ 약 6개 정도에서부터 큰 감소의 차이가 심하지 않으므로 군집의 수를 6으로 결정할 수 있을 것 같다.

3-7. K 변화에 따른 betweenss의 변화

between <- c()

for(i in 1:10){
     between[i] <- sum(kmeans(hc_scale, centers = i)$betweenss)
}

plot(1:10, between, type="b", xlab = "Number of Clusters", ylab = "between group sum of squares")

→ K가 증가할수록 betweenss의 값이 커진다.

→ K가 증가할수록 군집 간의 분산이 커진다.

3-8. K 변화에 따른 정확도의 변화

정확도는 betweenss / totss * 100으로 계산하여 K값을 1에서 10까지 변화시키면서 그 변화를 본다.

→ 정확도 X , 군집에 관한 평가 측도 O

bet_ss <- c()

for(i in 1:10){
     kms <- kmeans(hc_scale , i)
     bet_ss[i] <- round(kms$betweenss / kms$totss * 100,1)
}

y_name = paste("between_ss","\n", "/", "\n", "total_ss", collapse = '')

par(oma=c(0,1,0,0))        # 그래프 여백 조절(하,좌,상,우)

par(mgp=c(1,0.1,0))       # 그래프 내 축 여백 조절(제목, 눈금, 지표선)

plot(1:10, bet_ss, type="b", 
        xlab = "Number of Clusters",
        ylab = y_name, ylim=c(0,100), las = 1)

→ K가 증가할 수록 정확도가 높아지는 것을 확인 할 수 있다.

→ gap statistics 테스트 추가하기(참고)

ref.

- R, SAS, MS-SQL을 활용한 데이터마이닝, 이정진

- [R 분석] 계층적 군집 분석(hierarchical clustering) (tistory.com)

- R을 사용한 K-means 군집분석 (rstudio-pubs-static.s3.amazonaws.com)

- [R 분석] 비계층적 군집 분석(k-means clustering) (tistory.com)

728x90

'Data > R' 카테고리의 다른 글

[데이터사이언스/R] Red Wine Quality -아다부스팅 앙상블 ( AdaBoosting Ensemble) (0)	2021.07.14
[데이터사이언스/R] Red Wine Quality - 서포트 벡터 머신 (Support Vector Machine, 지지 벡터 머신) (0)	2021.07.13
[데이터사이언스/R] 데이터 분석해보기 6 - Red Wine Quality (0)	2021.07.08
[데이터 분석해보기] Bayes Classification (베이즈 분류) (feat. Mushroom) (1)	2021.07.07
[빅데이터] 분류분석 (Classification Analysis) (0)	2021.07.07

공지사항

최근에 올라온 글

le récit de ellie

티스토리 뷰