The data is drawn from Kangle Repository, to access the data-set click the URL https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/User%20Knowledge%20Modeling

Setting working Directory and calling required libraries

#setting a working directory
setwd("E:/Datasets")

#libraries 
library(tidyverse) 
library(Amelia)#checking for missing values 
library(skimr) #generates descriptive statistics 
library(caTools) #splitting the data
library(class) #knn package
library(mclust) #Bayesian inference for k means 
library(rio) #importing data
install_formats()
## [1] TRUE

Preparing the data-sets

#read data from Microsoft excel into R 
data_1<-import("Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls",
                   sheet="Training_Data")
data_2<-import("Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls",
                   sheet="Test_Data")

#combine two data-sets so that we can randomly subset the data using caTools package
raw_data<-data_1 %>% 
  rbind(data_2)

#sub setting the data
raw_data<-raw_data[,1:6]

#reading UNS as character 
raw_data$UNS<-as.character(raw_data$UNS)

Descriptive Analysis

#structure of raw_data
str(raw_data)
## 'data.frame':    403 obs. of  6 variables:
##  $ STG: num  0 0.08 0.06 0.1 0.08 0.09 0.1 0.15 0.2 0 ...
##  $ SCG: num  0 0.08 0.06 0.1 0.08 0.15 0.1 0.02 0.14 0 ...
##  $ STR: num  0 0.1 0.05 0.15 0.08 0.4 0.43 0.34 0.35 0.5 ...
##  $ LPR: num  0 0.24 0.25 0.65 0.98 0.1 0.29 0.4 0.72 0.2 ...
##  $ PEG: num  0 0.9 0.33 0.3 0.24 0.66 0.56 0.01 0.25 0.85 ...
##  $ UNS: chr  "very_low" "High" "Low" "Middle" ...
missmap(raw_data,main = "Missing Values Plot",
        col=c("red","sky blue"),
        legend = T,
        las=F)

#Exploring the data-set
skim(raw_data)
Data summary
Name raw_data
Number of rows 403
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
UNS 0 1 3 8 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
STG 0 1 0.35 0.21 0 0.20 0.30 0.48 0.99 ▅▇▃▂▁
SCG 0 1 0.36 0.22 0 0.20 0.30 0.51 0.90 ▅▇▃▃▂
STR 0 1 0.46 0.25 0 0.26 0.44 0.68 0.95 ▆▇▆▇▅
LPR 0 1 0.43 0.26 0 0.25 0.33 0.65 0.99 ▅▇▃▅▂
PEG 0 1 0.46 0.27 0 0.25 0.40 0.66 0.99 ▅▇▃▅▃

The data-set is well normalized as most of values range between 0 and 1,Therefore we can proceed with running the algorithm.

Classification using KNN Algorithm

  1. Elbow Method
#determining the optimal k value
# function to compute total within-cluster sum of square 
wss <- function(k) {
  kmeans(raw_data[1:5], k, nstart = 10 )$tot.withinss
}


# Compute and plot wss for k = 1 to k = 20
k.values <- 1:15

# extract was for 2-15 clusters
wss_values <- map_dbl(k.values, wss)

#elbow graph 
plot(k.values, wss_values,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")

In elbow method its difficult to determine the optimal k value as the trend in the line is not well defined therefore we go a step further and use Bayesian Inference Criterion

  1. Bayesian Inference Criterion for k Means
d_clust <- Mclust(as.matrix(raw_data[1:5]), G=1:15, 
                  modelNames = mclust.options("emModelNames"))

#best k-values
plot(d_clust$BIC,
     las=1,
     cex=0.4,
     ylab = "Bayesian Inference Criterion(BIC)")

From the graph the best k value is 8. Therefore, we will use 8 clusters in the knn model.

  1. KNN Model
#randomization
set.seed(1234)

#splitting the data
sample<-sample.split(raw_data$UNS,SplitRatio = 0.7)
training_data<-subset(raw_data[1:5],sample==T)
testing_data<-subset(raw_data[1:5],sample==F)
training_labels<-subset(raw_data[,6],sample==T)
testing_labels<-subset(raw_data[,6],sample==F)


#KNN model
predicted.rank<-knn(train=training_data,test=testing_data,cl=training_labels,k=8)

#Accuracy of the model 
missclafication.error<-mean(testing_labels!=predicted.rank)
Accuracy<-round((1-missclafication.error)*100,2)

The accuracy of KNN model in predicting the knowledge level of the users is 79.51%