티스토리 뷰
[데이터사이언스/R] Red Wine Quality - 군집분석(Clustering Analysis) : 계층적 군집 분석 (Hierarchical Clustering), K-평균 군집 분석 (K-means Clustering)
ellie.strong 2021. 7. 14. 16:33<목차>
Red Wine Quality - 데이터 설명 및 분석
[데이터사이언스/R] 데이터 분석해보기 6 - Red Wine Quality
<목차> 1. 데이터 확인 1-1. 데이터 소개 - 레드 와인의 물리 화학적 특징과 퀄리티 점수를 보여주는 CSV파일 데이터 1-2. 데이터 구조 분석 - 데이터를 불러온다. wine - 데이터의 structure를 확인한다. s
programmer-ririhan.tistory.com
1. 데이터 확인
1-1. 데이터 정제
덴드로그램으로 확인해보기위해 데이터의 수를 약 200개로 줄인다.
- 전체 데이터가 1599개 이므로 이 중 200개의 데이터만 랜덤으로 추출한다.
library(dplyr)
hc<-sample_n(wine, 200)
class 변수로 사용되는 quality와 rating 변수는 모델을 적용할 데이터에서는 삭제해야한다.
hc_data<-subset(hc, select=-c(quality, rating))
str(hc_data)
'data.frame': 200 obs. of 11 variables:
$ fixed.acidity : num 8 6.4 7 10.8 8.9 7.1 7.8 12.2 8.8 7.7 ...
$ volatile.acidity : num 0.45 0.885 0.975 0.26 0.75 0.61 0.56 0.34 0.45 0.26 ...
$ citric.acid : num 0.23 0 0.04 0.45 0.14 0.02 0.12 0.5 0.43 0.3 ...
$ residual.sugar : num 2.2 2.3 2 3.3 2.5 2.5 2 2.4 1.4 1.7 ...
$ chlorides : num 0.094 0.166 0.087 0.06 0.086 0.081 0.082 0.066 0.076 0.059 ...
$ free.sulfur.dioxide : num 16 6 12 20 9 17 7 10 12 20 ...
$ total.sulfur.dioxide: num 29 12 67 49 30 87 28 21 21 38 ...
$ density : num 0.996 0.996 0.996 0.997 0.998 ...
$ pH : num 3.21 3.56 3.35 3.13 3.34 3.48 3.37 3.12 3.21 3.29 ...
$ sulphates : num 0.49 0.51 0.6 0.54 0.64 0.6 0.5 1.18 0.75 0.47 ...
$ alcohol : num 10.2 10.8 9.4 9.6 10.5 9.7 9.4 9.2 10.2 10.8 ...
1-2. 거리행렬 구하기
거리 기반이므로 변수들의 값을 표준화하고 각 점들 사이의 거리(거리행렬)를 구해준다.
hc_scale<-scale(hc_data)
hc_dist<-dist(hc_scale)
1, 2, 3 행 데이터와 다른 데이터들과의 거리를 계산 값을 확인해볼 수 있다.
as.matrix(wine_dist)[1:3,]
1 2 3 4 5 6 7 8 9 10 11
1 0.000000 2.200616 3.378307 4.058526 3.730904 6.198812 5.242537 4.058480 5.995339 4.310446 7.280152
2 2.200616 0.000000 3.492344 4.338353 2.967422 6.412352 4.515130 3.961033 5.637104 4.432912 6.891150
3 3.378307 3.492344 0.000000 2.197622 3.973361 4.107527 4.563170 1.427273 4.780674 3.114454 6.532054
12 13 14 15 16 17 18 19 20 21 22
1 5.998511 3.831569 6.089432 1.896099 7.171976 5.345357 4.818021 2.530091 5.012705 2.240985 2.572035
2 5.715539 3.929231 6.045429 3.132730 6.698603 4.833751 4.391930 3.599300 4.978711 2.047614 2.942882
3 4.639037 2.146202 6.306019 3.379558 5.694200 4.879030 4.638929 2.018307 2.194946 2.815712 2.610826
2. HC모델 적용 (계층적 군집화)
hclust() 함수 : HC모델을 생성해준다.
- method 옵션을 이용하여 기준법을 설정할 수 있다.
2-1. 최단거리 기준법(single)
hc_model<-hclust(hc_dist, method="single")
hc_model
Call:
hclust(d = hc_dist, method = "single")
Cluster method : single
Distance : euclidean
Number of objects: 200
NbClust() 함수를 이용하여 최적의 군집 개수를 찾는다.
library(NbClust)
nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="single")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 7 proposed 2 as the best number of clusters
* 10 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 2 proposed 11 as the best number of clusters
* 2 proposed 12 as the best number of clusters
* 1 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
→ 군집 수를 3개로 설정하는 것이 10 번의 선택을 받았으므로 가장 적합하다는 결론이 나왔다.
3개의 군집으로 분류할 경우 군집별 데이터의 개수를 확인하였다.
hc_cutree<-cutree(hc_model, k=3)
table(hc_cutree)
hc_cutree
1 2 3
198 1 1
3개의 군집으로 분류한 결과를 덴드로그램으로 시각화 하였다.
plot(hc_model, hang=-1)
rect.hclust(hc_model, k=3)
3개의 군집으로 분류한 결과를 산점도로 시각화 하였다.
- x : volatile.acidity, y : alcohol → 상관계수가 가장 높은 상위 두 변수 사용
df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)
2-2. 최장 연결법(complete)
hc_model<-hclust(hc_dist, method="complete")
nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="complete")
if ((resCritical[ncB - min_nc + 1, 3] >= alphaBeale) && (!foundBeale)) {에서 다음과 같은 에러가 발생했습니다:
TRUE/FALSE가 필요한 곳에 값이 없습니다
추가정보: 경고메시지(들):
pf(beale, pp, df2)에서: NaN이 생성되
→ 오류가 발생한다.....ㅜ
2-3. 평균 연결법(average)
hc_model<-hclust(hc_dist, method="average")
nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="average")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 7 proposed 2 as the best number of clusters
* 5 proposed 3 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 11 as the best number of clusters
* 8 proposed 12 as the best number of clusters
* 1 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 12
*******************************************************************
hc_cutree<-cutree(hc_model, k=12)
table(hc_cutree)
hc_cutree
1 2 3 4 5 6 7 8 9 10 11 12
135 41 2 5 2 3 1 5 1 2 1 2
plot(hc_model, hang=-1)
rect.hclust(hc_model, k=12)
df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)
2-4. 중심 연결법(centroid)
hc_model<-hclust(hc_dist, method="centroid")
nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="centroid")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 7 proposed 2 as the best number of clusters
* 10 proposed 3 as the best number of clusters
* 2 proposed 9 as the best number of clusters
* 2 proposed 11 as the best number of clusters
* 2 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
hc_cutree<-cutree(hc_model, k=3)
table(hc_cutree)
hc_cutree
1 2 3
198 1 1
plot(hc_model, hang=-1)
rect.hclust(hc_model, k=3)
df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)
2-5. 와드 연결법(ward.D, ward.D2)
ward.D
hc_model<-hclust(hc_dist, method="ward.D")
nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="ward.D")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 4 proposed 2 as the best number of clusters
* 9 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 2 proposed 7 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 1 proposed 10 as the best number of clusters
* 2 proposed 13 as the best number of clusters
* 2 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
hc_cutree<-cutree(hc_model, k=3)
table(hc_cutree)
hc_cutree
1 2 3
88 60 52
plot(hc_model, hang=-1)
rect.hclust(hc_model, k=3)
df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)
ward.D2
hc_model<-hclust(hc_dist, method="ward.D2")
nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="ward.D2")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 5 proposed 2 as the best number of clusters
* 3 proposed 3 as the best number of clusters
* 3 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 5 proposed 7 as the best number of clusters
* 2 proposed 11 as the best number of clusters
* 2 proposed 13 as the best number of clusters
* 1 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
hc_cutree<-cutree(hc_model, k=2)
table(hc_cutree)
hc_cutree
1 2
129 71
plot(hc_model, hang=-1)
rect.hclust(hc_model, k=2)
df<-as.data.frame(hc_scale)
plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)
→ 이전까지의 연결법을 사용했을 때는 모두 너무 편향적이게 분류가 되는 모습을 보였다.
→ 그나마 와드 연결법을 사용하였을 때 분류가 적용되는 듯한 모습을 보여준다.
3. K-means 모델 적용
library(graphics)
3-1. 군집 개수 결정
최적의 군집 개수를 확인해본 결과 3개의 군집 개수가 가장 적절하다고 나왔다.
nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="kmeans")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 5 proposed 2 as the best number of clusters
* 10 proposed 3 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 2 proposed 7 as the best number of clusters
* 1 proposed 11 as the best number of clusters
* 2 proposed 13 as the best number of clusters
* 1 proposed 14 as the best number of clusters
* 1 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
3-2. K-means 모델 생성
kmeans() 함수를 이용하여 군집 개수를 3개로 나누는 k-평균 군집화 모델을 생성한다.
- k-평균 군집화 또한 관측치 사이의 거리를 이용하므로 평균화한 데이터를 사용한다.
km_model<-kmeans(hc_scale, 3)
km_model
K-means clustering with 3 clusters of sizes 36, 46, 118
Cluster means:
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
1 1.4327908 -0.6735567 1.1245064 0.2357906 0.7606881 -0.004900576
2 0.2631471 -0.7940938 0.7821667 0.1055560 -0.2284503 -0.183012545
3 -0.5397054 0.5150539 -0.6479822 -0.1130851 -0.1430175 0.072838965
total.sulfur.dioxide density pH sulphates alcohol
1 0.02835748 1.1946828 -1.159909668 0.8535948 -0.5849508
2 -0.24242479 -0.1020296 0.001636419 0.2129711 0.7586628
3 0.08585315 -0.3247052 0.353232820 -0.3434414 -0.1172903
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 3 3 1 3 3 3 1 2 2 3 3 2 3 3 3 1 3 1 3 2 3 3 3 1
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
3 3 2 2 2 2 1 1 3 2 3 3 3 3 1 2 3 2 2 2 3 3 2 1 3
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
2 2 3 3 3 3 3 1 2 1 3 3 3 2 1 3 2 2 1 3 2 3 3 3 3
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
3 2 3 3 2 3 1 3 3 1 3 1 1 3 3 2 1 2 3 3 3 3 1 3 3
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
2 3 2 3 3 3 1 3 3 2 3 3 1 1 3 3 1 2 3 1 1 3 2 1 3
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
2 3 3 2 1 1 3 3 3 2 2 3 3 3 2 2 3 3 3 3 3 3 3 3 3
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
3 2 3 3 3 2 3 3 1 3 2 1 3 2 3 1 3 3 2 1 3 3 3 3 3
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
3 3 3 3 2 3 3 3 3 2 3 3 2 2 3 3 3 1 3 1 3 1 1 2 3
Within cluster sum of squares by cluster:
[1] 492.8321 285.6163 853.6879
(between_SS / total_SS = 25.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
→ 3개의 군집에는 각각 36, 46, 118개의 데이터가 있다.
→ 각 관측치가 몇 번째 군집에 포함되어있는지 모두 나와있다.
3-3. $cluster 확인
km_model$cluster는 군집번호(Clustering vector)를 의미한다.
- 각 관측치가 몇 번째 군집에 포함되어있는지를 알려준다.
plot(hc_scale, col=km_model$cluster)
→ 세개의 군집으로 꽤 잘 분류되어 있는 모습을 확인 할 수 있다.
3-4. $totss 확인
between_SS : 군집 간 분산 정도
→ 분산이 클 수록 군집끼리 확연히 분리된다.
km_model$betweenss
[1] 556.8638
within_SS : 군집 내 데이터의 분산 정도
→ 분산이 작을 수록 군집 내의 데이터끼리 밀집해있어 군집끼리 확연히 분리된다.
km_model$withinss
[1] 492.8321 285.6163 853.6879
→ 두번째 군집 내의 데이터들이 다른 군집의 데이터들에 비해 밀집해 있음을 알 수 있다.
total_SS = between_SS + within_SS
→ total_SS가 클 수록 클러스터가 잘 분류되었다는 의미를 가진다.
km_model$totss
[1] 2189
→ total_SS가 매우 작은 편인것 같다....
3-5. 변수간 관계 확인
plot(hc_data, col=km_model$cluster)
→ 두 변수의 상관관계가 커질수록 군집이 명확히 분류된다.
3-6. K 변화에 따른 withinss의 변화
kmeans() 함수의 centers 옵션을 1에서 10까지 돌아가면서 적용해본다.
within <- c()
for(i in 1:10){
within[i] <- sum(kmeans(hc_scale, centers = i)$withinss)
}
plot(1:10, within, type="b", xlab = "Number of Clusters", ylab = "Within group sum of squares")
→ K가 증가할수록 withinss의 값이 작아진다.
→ K가 증가할수록 군집 내의 분산이 작아진다. = 군집 내의 밀도가 높아진다.
→ 약 6개 정도에서부터 큰 감소의 차이가 심하지 않으므로 군집의 수를 6으로 결정할 수 있을 것 같다.
3-7. K 변화에 따른 betweenss의 변화
between <- c()
for(i in 1:10){
between[i] <- sum(kmeans(hc_scale, centers = i)$betweenss)
}
plot(1:10, between, type="b", xlab = "Number of Clusters", ylab = "between group sum of squares")
→ K가 증가할수록 betweenss의 값이 커진다.
→ K가 증가할수록 군집 간의 분산이 커진다.
3-8. K 변화에 따른 정확도의 변화
정확도는 betweenss / totss * 100으로 계산하여 K값을 1에서 10까지 변화시키면서 그 변화를 본다.
→ 정확도 X , 군집에 관한 평가 측도 O
bet_ss <- c()
for(i in 1:10){
kms <- kmeans(hc_scale , i)
bet_ss[i] <- round(kms$betweenss / kms$totss * 100,1)
}
y_name = paste("between_ss","\n", "/", "\n", "total_ss", collapse = '')
par(oma=c(0,1,0,0)) # 그래프 여백 조절(하,좌,상,우)
par(mgp=c(1,0.1,0)) # 그래프 내 축 여백 조절(제목, 눈금, 지표선)
plot(1:10, bet_ss, type="b",
xlab = "Number of Clusters",
ylab = y_name, ylim=c(0,100), las = 1)
→ K가 증가할 수록 정확도가 높아지는 것을 확인 할 수 있다.
→ gap statistics 테스트 추가하기(참고)
ref.
- R, SAS, MS-SQL을 활용한 데이터마이닝, 이정진
- [R 분석] 계층적 군집 분석(hierarchical clustering) (tistory.com)
- R을 사용한 K-means 군집분석 (rstudio-pubs-static.s3.amazonaws.com)
- [R 분석] 비계층적 군집 분석(k-means clustering) (tistory.com)
'Data > R' 카테고리의 다른 글
[데이터사이언스/R] Red Wine Quality -아다부스팅 앙상블 ( AdaBoosting Ensemble) (0) | 2021.07.14 |
---|---|
[데이터사이언스/R] Red Wine Quality - 서포트 벡터 머신 (Support Vector Machine, 지지 벡터 머신) (0) | 2021.07.13 |
[데이터사이언스/R] 데이터 분석해보기 6 - Red Wine Quality (0) | 2021.07.08 |
[데이터 분석해보기] Bayes Classification (베이즈 분류) (feat. Mushroom) (1) | 2021.07.07 |
[빅데이터] 분류분석 (Classification Analysis) (0) | 2021.07.07 |