Skip to content

anandhakrishnanh/Naive-Bayes-Spam-Filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues MIT License


Naive Bayes Spam Filter

A simple implementation of the Naive Bayes classifier built entirely on C without any libraries

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

This is a simple implementation of the classic Naive Bayes classifier that classifies whether a given text is spam or not based on the initial training data it was trained on.

Right now the script requires you to train and test everything one by one ( yes I know it sucks ) but I don't work on C anymore, so I'm not sure if I'm going to update it to be used properly.

Built With

This is build entirely on built in C libraries stdio.h and string.h

Usage

To train and test the classifier just run spam.c, where the first part will be training and the second part is testing on an unknown text.

How it Works

  1. First the filter learns the number of times a word has been used in the training phase. Once the training phase is done the filter calculates the probability of the sequence of words being spam or not using the already set of words it has seen in the training phase.
strcpy(learn[struct_control].words,temp);
if(spamorham==1)
    learn[struct_control].spam_word=learn[struct_control].spam_word+1;
else
    learn[struct_control].ham_word=learn[struct_control].ham_word+1;
temp_control=0;
struct_control=struct_control+1;
memset(temp, NULL, sizeof temp);

Here the words are sotred into an array and the whenever the same word is obtained it will update the frequency instead of storing the word again.

  1. During the testing phase, we calculate the testing sentance being a HAM (not spam) or SPAM using

image

Here,

  • Pr(S|W) is the probability that a message is a spam, knowing that a word is in it
  • Pr(S) is the overall probability that any given message is spam
  • Pr(W|S) is the probability that the word appears in spam messages
  • Pr(H) is the overall probability that any given message is not SPAM (is HAM)
  • Pr(W|H) is the probability that the word appears in ham messages

So according to this if we have all the instances of a word in question in spam messages and ham messages we can calculate the probability for that particular word.

  1. The frequence of all the words in the test case is found out
strcpy(test[test_control].words,temp);
test[test_control].freq=test[test_control].freq+1;
temp_control=0;
test_control=test_control+1;
memset(temp, NULL, sizeof temp); 

Now using the frequency we calculate the probability

wbys=(((float)test[j].spamfreq)/(float)total_spam);
wbyh=((float)test[j].hamfreq)/(float)total_ham;
sbyw=(wbys/(wbys+wbyh));
prob[j]=(((float)3*(float)0.5)+((float)test[j].freq*(float)sbyw))/((float)3*(float)test[j].freq);
printf("\n%d/%d=%f\n",test[j].spamfreq,total_spam,wbys);
printf("\n%d/%d=%f\n",test[j].hamfreq,total_ham,wbyh);
printf("\nw/s=%f  w/h=%f  prob=%f\n",wbys,wbyh,prob[j]);

Contact

Your Name - Anandha Krishnan H - [email protected]

Project Link: https:/anandhakrishnanh/Naive-Bayes-Spam-Filter

(back to top)

About

This is to make a Naive Bayes spam filter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages