    Red Wine Quality - 데이터 설명 및 분석


    [데이터사이언스/R] 데이터 분석해보기 6 - Red Wine Quality

    <목차> 1. 데이터 확인 1-1. 데이터 소개 - 레드 와인의 물리 화학적 특징과 퀄리티 점수를 보여주는 CSV파일 데이터 1-2. 데이터 구조 분석 - 데이터를 불러온다. wine - 데이터의 structure를 확인한다. s



    1. 데이터 확인

    1-1. 데이터 정제

    덴드로그램으로 확인해보기위해 데이터의 수를 약 200개로 줄인다. 

    - 전체 데이터가 1599개 이므로 이 중 200개의 데이터만 랜덤으로 추출한다. 

    hc<-sample_n(wine, 200)


    class 변수로 사용되는 quality와 rating 변수는 모델을 적용할 데이터에서는 삭제해야한다. 

    hc_data<-subset(hc, select=-c(quality, rating))


    'data.frame':   200 obs. of  11 variables:
     $ fixed.acidity       : num  8 6.4 7 10.8 8.9 7.1 7.8 12.2 8.8 7.7 ...
     $ volatile.acidity    : num  0.45 0.885 0.975 0.26 0.75 0.61 0.56 0.34 0.45 0.26 ...
     $ citric.acid         : num  0.23 0 0.04 0.45 0.14 0.02 0.12 0.5 0.43 0.3 ...
     $ residual.sugar      : num  2.2 2.3 2 3.3 2.5 2.5 2 2.4 1.4 1.7 ...
     $ chlorides           : num  0.094 0.166 0.087 0.06 0.086 0.081 0.082 0.066 0.076 0.059 ...
     $ free.sulfur.dioxide : num  16 6 12 20 9 17 7 10 12 20 ...
     $ total.sulfur.dioxide: num  29 12 67 49 30 87 28 21 21 38 ...
     $ density             : num  0.996 0.996 0.996 0.997 0.998 ...
     $ pH                  : num  3.21 3.56 3.35 3.13 3.34 3.48 3.37 3.12 3.21 3.29 ...
     $ sulphates           : num  0.49 0.51 0.6 0.54 0.64 0.6 0.5 1.18 0.75 0.47 ...
     $ alcohol             : num  10.2 10.8 9.4 9.6 10.5 9.7 9.4 9.2 10.2 10.8 ...


    1-2. 거리행렬 구하기

    거리 기반이므로 변수들의 값을 표준화하고 각 점들 사이의 거리(거리행렬)를 구해준다. 



    1, 2, 3 행 데이터와 다른 데이터들과의 거리를 계산 값을 확인해볼 수 있다. 

             1        2        3        4        5        6        7        8        9       10       11
    1 0.000000 2.200616 3.378307 4.058526 3.730904 6.198812 5.242537 4.058480 5.995339 4.310446 7.280152
    2 2.200616 0.000000 3.492344 4.338353 2.967422 6.412352 4.515130 3.961033 5.637104 4.432912 6.891150
    3 3.378307 3.492344 0.000000 2.197622 3.973361 4.107527 4.563170 1.427273 4.780674 3.114454 6.532054
            12       13       14       15       16       17       18       19       20       21       22
    1 5.998511 3.831569 6.089432 1.896099 7.171976 5.345357 4.818021 2.530091 5.012705 2.240985 2.572035
    2 5.715539 3.929231 6.045429 3.132730 6.698603 4.833751 4.391930 3.599300 4.978711 2.047614 2.942882
    3 4.639037 2.146202 6.306019 3.379558 5.694200 4.879030 4.638929 2.018307 2.194946 2.815712 2.610826


    2. HC모델 적용 (계층적 군집화)

    hclust() 함수 : HC모델을 생성해준다.

    hclust() documentation 참고

    - method 옵션을 이용하여 기준법을 설정할 수 있다. 


    2-1. 최단거리 기준법(single)

    hc_model<-hclust(hc_dist, method="single")
    hclust(d = hc_dist, method = "single")
    Cluster method   : single 
    Distance         : euclidean 
    Number of objects: 200


    NbClust() 함수를 이용하여 최적의 군집 개수를 찾는다.

    nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="single")
    *** : The Hubert index is a graphical method of determining the number of clusters.
                    In the plot of Hubert index, we seek a significant knee that corresponds to a 
                    significant increase of the value of the measure i.e the significant peak in Hubert
                    index second differences plot. 
    *** : The D index is a graphical method of determining the number of clusters. 
                    In the plot of D index, we seek a significant knee (the significant peak in Dindex
                    second differences plot) that corresponds to a significant increase of the value of
                    the measure. 
    * Among all indices:                                                
    * 7 proposed 2 as the best number of clusters 
    * 10 proposed 3 as the best number of clusters 
    * 1 proposed 4 as the best number of clusters 
    * 2 proposed 11 as the best number of clusters 
    * 2 proposed 12 as the best number of clusters 
    * 1 proposed 15 as the best number of clusters 
                       ***** Conclusion *****                            
    * According to the majority rule, the best number of clusters is  3 

    군집 수를 3개로 설정하는 것이 10 번의 선택을 받았으므로 가장 적합하다는 결론이 나왔다. 


    3개의 군집으로 분류할 경우 군집별 데이터의 개수를 확인하였다. 

    hc_cutree<-cutree(hc_model, k=3)
      1   2   3 
    198   1   1


    3개의 군집으로 분류한 결과를 덴드로그램으로 시각화 하였다. 

    plot(hc_model, hang=-1)
    rect.hclust(hc_model, k=3)


    3개의 군집으로 분류한 결과를 산점도로 시각화 하였다. 

    - x : volatile.acidity, y : alcohol → 상관계수가 가장 높은 상위 두 변수 사용

    plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)


    2-2. 최장 연결법(complete)

    hc_model<-hclust(hc_dist, method="complete")
    nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="complete")
    if ((resCritical[ncB - min_nc + 1, 3] >= alphaBeale) && (!foundBeale)) {에서 다음과 같은 에러가 발생했습니다:
      TRUE/FALSE가 필요한 곳에 값이 없습니다
    추가정보: 경고메시지(들): 
    pf(beale, pp, df2)에서: NaN이 생성되

    → 오류가 발생한다.....ㅜ


    2-3. 평균 연결법(average)

    hc_model<-hclust(hc_dist, method="average")
    nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="average")
    *** : The Hubert index is a graphical method of determining the number of clusters.
                    In the plot of Hubert index, we seek a significant knee that corresponds to a 
                    significant increase of the value of the measure i.e the significant peak in Hubert
                    index second differences plot. 
    *** : The D index is a graphical method of determining the number of clusters. 
                    In the plot of D index, we seek a significant knee (the significant peak in Dindex
                    second differences plot) that corresponds to a significant increase of the value of
                    the measure. 
    * Among all indices:                                                
    * 7 proposed 2 as the best number of clusters 
    * 5 proposed 3 as the best number of clusters 
    * 1 proposed 7 as the best number of clusters 
    * 1 proposed 11 as the best number of clusters 
    * 8 proposed 12 as the best number of clusters 
    * 1 proposed 15 as the best number of clusters 
                       ***** Conclusion *****                            
    * According to the majority rule, the best number of clusters is  12 

    hc_cutree<-cutree(hc_model, k=12)
      1   2   3   4   5   6   7   8   9  10  11  12 
    135  41   2   5   2   3   1   5   1   2   1   2
    plot(hc_model, hang=-1)
    rect.hclust(hc_model, k=12)

    plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)


    2-4. 중심 연결법(centroid)

    hc_model<-hclust(hc_dist, method="centroid")
    nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="centroid")
    *** : The Hubert index is a graphical method of determining the number of clusters.
                    In the plot of Hubert index, we seek a significant knee that corresponds to a 
                    significant increase of the value of the measure i.e the significant peak in Hubert
                    index second differences plot. 
    *** : The D index is a graphical method of determining the number of clusters. 
                    In the plot of D index, we seek a significant knee (the significant peak in Dindex
                    second differences plot) that corresponds to a significant increase of the value of
                    the measure. 
    * Among all indices:                                                
    * 7 proposed 2 as the best number of clusters 
    * 10 proposed 3 as the best number of clusters 
    * 2 proposed 9 as the best number of clusters 
    * 2 proposed 11 as the best number of clusters 
    * 2 proposed 15 as the best number of clusters 
                       ***** Conclusion *****                            
    * According to the majority rule, the best number of clusters is  3 

    hc_cutree<-cutree(hc_model, k=3)
      1   2   3 
    198   1   1
    plot(hc_model, hang=-1)
    rect.hclust(hc_model, k=3)

    plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)


    2-5. 와드 연결법(ward.D, ward.D2)


    hc_model<-hclust(hc_dist, method="ward.D")
    nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="ward.D")
    *** : The Hubert index is a graphical method of determining the number of clusters.
                    In the plot of Hubert index, we seek a significant knee that corresponds to a 
                    significant increase of the value of the measure i.e the significant peak in Hubert
                    index second differences plot. 
    *** : The D index is a graphical method of determining the number of clusters. 
                    In the plot of D index, we seek a significant knee (the significant peak in Dindex
                    second differences plot) that corresponds to a significant increase of the value of
                    the measure. 
    * Among all indices:                                                
    * 4 proposed 2 as the best number of clusters 
    * 9 proposed 3 as the best number of clusters 
    * 1 proposed 4 as the best number of clusters 
    * 1 proposed 6 as the best number of clusters 
    * 2 proposed 7 as the best number of clusters 
    * 1 proposed 8 as the best number of clusters 
    * 1 proposed 10 as the best number of clusters 
    * 2 proposed 13 as the best number of clusters 
    * 2 proposed 15 as the best number of clusters 
                       ***** Conclusion *****                            
    * According to the majority rule, the best number of clusters is  3 

    hc_cutree<-cutree(hc_model, k=3)
     1  2  3 
    88 60 52
    plot(hc_model, hang=-1)
    rect.hclust(hc_model, k=3)

    plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)



    hc_model<-hclust(hc_dist, method="ward.D2")
    nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="ward.D2")
    *** : The Hubert index is a graphical method of determining the number of clusters.
                    In the plot of Hubert index, we seek a significant knee that corresponds to a 
                    significant increase of the value of the measure i.e the significant peak in Hubert
                    index second differences plot. 
    *** : The D index is a graphical method of determining the number of clusters. 
                    In the plot of D index, we seek a significant knee (the significant peak in Dindex
                    second differences plot) that corresponds to a significant increase of the value of
                    the measure. 
    * Among all indices:                                                
    * 5 proposed 2 as the best number of clusters 
    * 3 proposed 3 as the best number of clusters 
    * 3 proposed 4 as the best number of clusters 
    * 1 proposed 5 as the best number of clusters 
    * 1 proposed 6 as the best number of clusters 
    * 5 proposed 7 as the best number of clusters 
    * 2 proposed 11 as the best number of clusters 
    * 2 proposed 13 as the best number of clusters 
    * 1 proposed 15 as the best number of clusters 
                       ***** Conclusion *****                            
    * According to the majority rule, the best number of clusters is  2 

    hc_cutree<-cutree(hc_model, k=2)
      1   2 
    129  71
    plot(hc_model, hang=-1)
    rect.hclust(hc_model, k=2)

    plot(df$volatile.acidity, df$alcohol, col=hc_cutree, pch=hc_cutree)

    → 이전까지의 연결법을 사용했을 때는 모두 너무 편향적이게 분류가 되는 모습을 보였다. 

     그나마 와드 연결법을 사용하였을 때 분류가 적용되는 듯한 모습을 보여준다. 


    3. K-means 모델 적용



    3-1. 군집 개수 결정

    최적의 군집 개수를 확인해본 결과 3개의 군집 개수가 가장 적절하다고 나왔다. 

    nc <- NbClust(hc_scale, distance="euclidean", min.nc=2, max.nc=15, method="kmeans")
    *** : The Hubert index is a graphical method of determining the number of clusters.
                    In the plot of Hubert index, we seek a significant knee that corresponds to a 
                    significant increase of the value of the measure i.e the significant peak in Hubert
                    index second differences plot. 
    *** : The D index is a graphical method of determining the number of clusters. 
                    In the plot of D index, we seek a significant knee (the significant peak in Dindex
                    second differences plot) that corresponds to a significant increase of the value of
                    the measure. 
    * Among all indices:                                                
    * 5 proposed 2 as the best number of clusters 
    * 10 proposed 3 as the best number of clusters 
    * 1 proposed 6 as the best number of clusters 
    * 2 proposed 7 as the best number of clusters 
    * 1 proposed 11 as the best number of clusters 
    * 2 proposed 13 as the best number of clusters 
    * 1 proposed 14 as the best number of clusters 
    * 1 proposed 15 as the best number of clusters 
                       ***** Conclusion *****                            
    * According to the majority rule, the best number of clusters is  3


    3-2. K-means 모델 생성

    kmeans() 함수를 이용하여 군집 개수를 3개로 나누는 k-평균 군집화 모델을 생성한다. 

    - k-평균 군집화 또한 관측치 사이의 거리를 이용하므로 평균화한 데이터를 사용한다. 

    km_model<-kmeans(hc_scale, 3)
    K-means clustering with 3 clusters of sizes 36, 46, 118
    Cluster means:
      fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides free.sulfur.dioxide
    1     1.4327908       -0.6735567   1.1245064      0.2357906  0.7606881        -0.004900576
    2     0.2631471       -0.7940938   0.7821667      0.1055560 -0.2284503        -0.183012545
    3    -0.5397054        0.5150539  -0.6479822     -0.1130851 -0.1430175         0.072838965
      total.sulfur.dioxide    density           pH  sulphates    alcohol
    1           0.02835748  1.1946828 -1.159909668  0.8535948 -0.5849508
    2          -0.24242479 -0.1020296  0.001636419  0.2129711  0.7586628
    3           0.08585315 -0.3247052  0.353232820 -0.3434414 -0.1172903
    Clustering vector:
      1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25 
      3   3   3   1   3   3   3   1   2   2   3   3   2   3   3   3   1   3   1   3   2   3   3   3   1 
     26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50 
      3   3   2   2   2   2   1   1   3   2   3   3   3   3   1   2   3   2   2   2   3   3   2   1   3 
     51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75 
      2   2   3   3   3   3   3   1   2   1   3   3   3   2   1   3   2   2   1   3   2   3   3   3   3 
     76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
      3   2   3   3   2   3   1   3   3   1   3   1   1   3   3   2   1   2   3   3   3   3   1   3   3 
    101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 
      2   3   2   3   3   3   1   3   3   2   3   3   1   1   3   3   1   2   3   1   1   3   2   1   3 
    126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 
      2   3   3   2   1   1   3   3   3   2   2   3   3   3   2   2   3   3   3   3   3   3   3   3   3 
    151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 
      3   2   3   3   3   2   3   3   1   3   2   1   3   2   3   1   3   3   2   1   3   3   3   3   3 
    176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
      3   3   3   3   2   3   3   3   3   2   3   3   2   2   3   3   3   1   3   1   3   1   1   2   3 
    Within cluster sum of squares by cluster:
    [1] 492.8321 285.6163 853.6879
     (between_SS / total_SS =  25.4 %)
    Available components:
    [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
    [7] "size"         "iter"         "ifault"

    → 3개의 군집에는 각각 36, 46, 118개의 데이터가 있다. 

    → 각 관측치가 몇 번째 군집에 포함되어있는지 모두 나와있다. 


    3-3. $cluster 확인

    km_model$cluster는 군집번호(Clustering vector)를 의미한다. 

    - 각 관측치가 몇 번째 군집에 포함되어있는지를 알려준다. 

    plot(hc_scale, col=km_model$cluster)

    → 세개의 군집으로 꽤 잘 분류되어 있는 모습을 확인 할 수 있다. 


    3-4. $totss 확인

    between_SS : 군집 간 분산 정도 

    → 분산이 클 수록 군집끼리 확연히 분리된다. 

    [1] 556.8638


    within_SS : 군집 내 데이터의 분산 정도 

    → 분산이 작을 수록 군집 내의 데이터끼리 밀집해있어 군집끼리 확연히 분리된다. 

    [1] 492.8321 285.6163 853.6879

    → 두번째 군집 내의 데이터들이 다른 군집의 데이터들에 비해 밀집해 있음을 알 수 있다.


    total_SS = between_SS + within_SS

    → total_SS가 클 수록 클러스터가 잘 분류되었다는 의미를 가진다. 

    [1] 2189

    → total_SS가 매우 작은 편인것 같다....


    3-5. 변수간 관계 확인

    plot(hc_data, col=km_model$cluster)

    → 두 변수의 상관관계가 커질수록 군집이 명확히 분류된다. 


    3-6. K 변화에 따른 withinss의 변화

    kmeans() 함수의 centers 옵션을 1에서 10까지 돌아가면서 적용해본다. 

    within <- c()
    for(i in 1:10){
    	within[i] <- sum(kmeans(hc_scale, centers = i)$withinss)
    plot(1:10, within, type="b", xlab = "Number of Clusters", ylab = "Within group sum of squares")

    → K가 증가할수록 withinss의 값이 작아진다. 

    → K가 증가할수록 군집 내의 분산이 작아진다. = 군집 내의 밀도가 높아진다.


    → 약 6개 정도에서부터 큰 감소의 차이가 심하지 않으므로 군집의 수를 6으로 결정할 수 있을 것 같다.


    3-7. K 변화에 따른 betweenss의 변화

    between <- c()
    for(i in 1:10){
         between[i] <- sum(kmeans(hc_scale, centers = i)$betweenss)
    plot(1:10, between, type="b", xlab = "Number of Clusters", ylab = "between group sum of squares")

    → K가 증가할수록 betweenss의 값이 커진다.

    → K가 증가할수록 군집 간의 분산이 커진다.


    3-8. K 변화에 따른 정확도의 변화

    정확도는 betweenss / totss * 100으로 계산하여 K값을 1에서 10까지 변화시키면서 그 변화를 본다. 


    → 정확도 X , 군집에 관한 평가 측도 O


    bet_ss <- c()
    for(i in 1:10){
         kms <- kmeans(hc_scale , i)
         bet_ss[i] <- round(kms$betweenss / kms$totss * 100,1)
    y_name = paste("between_ss","\n", "/", "\n", "total_ss", collapse = '')
    par(oma=c(0,1,0,0))        # 그래프 여백 조절(하,좌,상,우)
    par(mgp=c(1,0.1,0))       # 그래프 내 축 여백 조절(제목, 눈금, 지표선)
    plot(1:10, bet_ss, type="b", 
            xlab = "Number of Clusters",
            ylab = y_name, ylim=c(0,100), las = 1)

    → K가 증가할 수록 정확도가 높아지는 것을 확인 할 수 있다. 


    → gap statistics 테스트 추가하기(참고)




    - R, SAS, MS-SQL을 활용한 데이터마이닝, 이정진

    - [R 분석] 계층적 군집 분석(hierarchical clustering) (tistory.com)

    - R을 사용한 K-means 군집분석 (rstudio-pubs-static.s3.amazonaws.com)

    - [R 분석] 비계층적 군집 분석(k-means clustering) (tistory.com)


