Machine Learning

The data is drawn from Kangle Repository, to access the data-set click the URL https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/User%20Knowledge%20Modeling

Setting working Directory and calling required libraries

#setting a working directory
setwd("E:/Datasets")

#libraries 
library(tidyverse) 
library(Amelia)#checking for missing values 
library(skimr) #generates descriptive statistics 
library(caTools) #splitting the data
library(class) #knn package
library(mclust) #Bayesian inference for k means 
library(rio) #importing data
install_formats()

## [1] TRUE

Preparing the data-sets

#read data from Microsoft excel into R 
data_1<-import("Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls",
                   sheet="Training_Data")
data_2<-import("Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls",
                   sheet="Test_Data")

#combine two data-sets so that we can randomly subset the data using caTools package
raw_data<-data_1 %>% 
  rbind(data_2)

#sub setting the data
raw_data<-raw_data[,1:6]

#reading UNS as character 
raw_data$UNS<-as.character(raw_data$UNS)

Descriptive Analysis

Structure and Dimension of Data

#structure of raw_data
str(raw_data)

## 'data.frame':    403 obs. of  6 variables:
##  $ STG: num  0 0.08 0.06 0.1 0.08 0.09 0.1 0.15 0.2 0 ...
##  $ SCG: num  0 0.08 0.06 0.1 0.08 0.15 0.1 0.02 0.14 0 ...
##  $ STR: num  0 0.1 0.05 0.15 0.08 0.4 0.43 0.34 0.35 0.5 ...
##  $ LPR: num  0 0.24 0.25 0.65 0.98 0.1 0.29 0.4 0.72 0.2 ...
##  $ PEG: num  0 0.9 0.33 0.3 0.24 0.66 0.56 0.01 0.25 0.85 ...
##  $ UNS: chr  "very_low" "High" "Low" "Middle" ...

Checking for Missing Values

missmap(raw_data,main = "Missing Values Plot",
        col=c("red","sky blue"),
        legend = T,
        las=F)

Descriptive Statistics

#Exploring the data-set
skim(raw_data)

Data summary
Name	raw_data
Number of rows	403
Number of columns	6
_______________________
Column type frequency:
character	1
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
UNS	0	1	3	8	0	5	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
STG	1	0.35	0.21	0.20	0.30	0.48	0.99	▅▇▃▂▁
SCG	1	0.36	0.22	0.20	0.30	0.51	0.90	▅▇▃▃▂
STR	1	0.46	0.25	0.26	0.44	0.68	0.95	▆▇▆▇▅
LPR	1	0.43	0.26	0.25	0.33	0.65	0.99	▅▇▃▅▂
PEG	1	0.46	0.27	0.25	0.40	0.66	0.99	▅▇▃▅▃

The data-set is well normalized as most of values range between 0 and 1,Therefore we can proceed with running the algorithm.

Classification using KNN Algorithm

Elbow Method

#determining the optimal k value
# function to compute total within-cluster sum of square 
wss <- function(k) {
  kmeans(raw_data[1:5], k, nstart = 10 )$tot.withinss
}


# Compute and plot wss for k = 1 to k = 20
k.values <- 1:15

# extract was for 2-15 clusters
wss_values <- map_dbl(k.values, wss)

#elbow graph 
plot(k.values, wss_values,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")

In elbow method its difficult to determine the optimal k value as the trend in the line is not well defined therefore we go a step further and use Bayesian Inference Criterion

Bayesian Inference Criterion for k Means

d_clust <- Mclust(as.matrix(raw_data[1:5]), G=1:15, 
                  modelNames = mclust.options("emModelNames"))

#best k-values
plot(d_clust$BIC,
     las=1,
     cex=0.4,
     ylab = "Bayesian Inference Criterion(BIC)")

From the graph the best k value is 8. Therefore, we will use 8 clusters in the knn model.

KNN Model

#randomization
set.seed(1234)

#splitting the data
sample<-sample.split(raw_data$UNS,SplitRatio = 0.7)
training_data<-subset(raw_data[1:5],sample==T)
testing_data<-subset(raw_data[1:5],sample==F)
training_labels<-subset(raw_data[,6],sample==T)
testing_labels<-subset(raw_data[,6],sample==F)


#KNN model
predicted.rank<-knn(train=training_data,test=testing_data,cl=training_labels,k=8)

#Accuracy of the model 
missclafication.error<-mean(testing_labels!=predicted.rank)
Accuracy<-round((1-missclafication.error)*100,2)

The accuracy of KNN model in predicting the knowledge level of the users is 79.51%

Machine Learning

Snr Data Analyst Moses Kioko

9th March 2020