The data is drawn from Kangle Repository, to access the data-set click the URL https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/User%20Knowledge%20Modeling
Setting working Directory and calling required libraries
#setting a working directory
setwd("E:/Datasets")
#libraries
library(tidyverse)
library(Amelia)#checking for missing values
library(skimr) #generates descriptive statistics
library(caTools) #splitting the data
library(class) #knn package
library(mclust) #Bayesian inference for k means
library(rio) #importing data
install_formats()
## [1] TRUE
Preparing the data-sets
#read data from Microsoft excel into R
data_1<-import("Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls",
sheet="Training_Data")
data_2<-import("Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls",
sheet="Test_Data")
#combine two data-sets so that we can randomly subset the data using caTools package
raw_data<-data_1 %>%
rbind(data_2)
#sub setting the data
raw_data<-raw_data[,1:6]
#reading UNS as character
raw_data$UNS<-as.character(raw_data$UNS)
Descriptive Analysis
#structure of raw_data
str(raw_data)
## 'data.frame': 403 obs. of 6 variables:
## $ STG: num 0 0.08 0.06 0.1 0.08 0.09 0.1 0.15 0.2 0 ...
## $ SCG: num 0 0.08 0.06 0.1 0.08 0.15 0.1 0.02 0.14 0 ...
## $ STR: num 0 0.1 0.05 0.15 0.08 0.4 0.43 0.34 0.35 0.5 ...
## $ LPR: num 0 0.24 0.25 0.65 0.98 0.1 0.29 0.4 0.72 0.2 ...
## $ PEG: num 0 0.9 0.33 0.3 0.24 0.66 0.56 0.01 0.25 0.85 ...
## $ UNS: chr "very_low" "High" "Low" "Middle" ...
missmap(raw_data,main = "Missing Values Plot",
col=c("red","sky blue"),
legend = T,
las=F)
#Exploring the data-set
skim(raw_data)
Name | raw_data |
Number of rows | 403 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
UNS | 0 | 1 | 3 | 8 | 0 | 5 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
STG | 0 | 1 | 0.35 | 0.21 | 0 | 0.20 | 0.30 | 0.48 | 0.99 | ▅▇▃▂▁ |
SCG | 0 | 1 | 0.36 | 0.22 | 0 | 0.20 | 0.30 | 0.51 | 0.90 | ▅▇▃▃▂ |
STR | 0 | 1 | 0.46 | 0.25 | 0 | 0.26 | 0.44 | 0.68 | 0.95 | ▆▇▆▇▅ |
LPR | 0 | 1 | 0.43 | 0.26 | 0 | 0.25 | 0.33 | 0.65 | 0.99 | ▅▇▃▅▂ |
PEG | 0 | 1 | 0.46 | 0.27 | 0 | 0.25 | 0.40 | 0.66 | 0.99 | ▅▇▃▅▃ |
The data-set is well normalized as most of values range between 0 and 1,Therefore we can proceed with running the algorithm.
Classification using KNN Algorithm
#determining the optimal k value
# function to compute total within-cluster sum of square
wss <- function(k) {
kmeans(raw_data[1:5], k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 20
k.values <- 1:15
# extract was for 2-15 clusters
wss_values <- map_dbl(k.values, wss)
#elbow graph
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
In elbow method its difficult to determine the optimal k value as the trend in the line is not well defined therefore we go a step further and use Bayesian Inference Criterion
d_clust <- Mclust(as.matrix(raw_data[1:5]), G=1:15,
modelNames = mclust.options("emModelNames"))
#best k-values
plot(d_clust$BIC,
las=1,
cex=0.4,
ylab = "Bayesian Inference Criterion(BIC)")
From the graph the best k value is 8. Therefore, we will use 8 clusters in the knn model.
#randomization
set.seed(1234)
#splitting the data
sample<-sample.split(raw_data$UNS,SplitRatio = 0.7)
training_data<-subset(raw_data[1:5],sample==T)
testing_data<-subset(raw_data[1:5],sample==F)
training_labels<-subset(raw_data[,6],sample==T)
testing_labels<-subset(raw_data[,6],sample==F)
#KNN model
predicted.rank<-knn(train=training_data,test=testing_data,cl=training_labels,k=8)
#Accuracy of the model
missclafication.error<-mean(testing_labels!=predicted.rank)
Accuracy<-round((1-missclafication.error)*100,2)
The accuracy of KNN model in predicting the knowledge level of the users is 79.51%