GitHub - anandhakrishnanh/Naive-Bayes-Spam-Filter: This is to make a Naive Bayes spam filter

Naive Bayes Spam Filter

A simple implementation of the Naive Bayes classifier built entirely on C without any libraries

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

This is a simple implementation of the classic Naive Bayes classifier that classifies whether a given text is spam or not based on the initial training data it was trained on.

Right now the script requires you to train and test everything one by one ( yes I know it sucks ) but I don't work on C anymore, so I'm not sure if I'm going to update it to be used properly.

Built With

This is build entirely on built in C libraries stdio.h and string.h

Usage

To train and test the classifier just run spam.c, where the first part will be training and the second part is testing on an unknown text.

How it Works

First the filter learns the number of times a word has been used in the training phase. Once the training phase is done the filter calculates the probability of the sequence of words being spam or not using the already set of words it has seen in the training phase.

strcpy(learn[struct_control].words,temp);
if(spamorham==1)
    learn[struct_control].spam_word=learn[struct_control].spam_word+1;
else
    learn[struct_control].ham_word=learn[struct_control].ham_word+1;
temp_control=0;
struct_control=struct_control+1;
memset(temp, NULL, sizeof temp);

Here the words are sotred into an array and the whenever the same word is obtained it will update the frequency instead of storing the word again.

During the testing phase, we calculate the testing sentance being a HAM (not spam) or SPAM using

Here,

Pr(S|W) is the probability that a message is a spam, knowing that a word is in it
Pr(S) is the overall probability that any given message is spam
Pr(W|S) is the probability that the word appears in spam messages
Pr(H) is the overall probability that any given message is not SPAM (is HAM)
Pr(W|H) is the probability that the word appears in ham messages

So according to this if we have all the instances of a word in question in spam messages and ham messages we can calculate the probability for that particular word.

The frequence of all the words in the test case is found out

strcpy(test[test_control].words,temp);
test[test_control].freq=test[test_control].freq+1;
temp_control=0;
test_control=test_control+1;
memset(temp, NULL, sizeof temp);

Now using the frequency we calculate the probability

wbys=(((float)test[j].spamfreq)/(float)total_spam);
wbyh=((float)test[j].hamfreq)/(float)total_ham;
sbyw=(wbys/(wbys+wbyh));
prob[j]=(((float)3*(float)0.5)+((float)test[j].freq*(float)sbyw))/((float)3*(float)test[j].freq);
printf("\n%d/%d=%f\n",test[j].spamfreq,total_spam,wbys);
printf("\n%d/%d=%f\n",test[j].hamfreq,total_ham,wbyh);
printf("\nw/s=%f  w/h=%f  prob=%f\n",wbys,wbyh,prob[j]);

Contact

Your Name - Anandha Krishnan H - [email protected]

Project Link: https:/anandhakrishnanh/Naive-Bayes-Spam-Filter

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE.md		LICENSE.md
README.md		README.md
spam.c		spam.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Naive Bayes Spam Filter

About The Project

Built With

Usage

How it Works

Contact

About

Releases

Packages

Languages

License

anandhakrishnanh/Naive-Bayes-Spam-Filter

Folders and files

Latest commit

History

Repository files navigation

Naive Bayes Spam Filter

About The Project

Built With

Usage

How it Works

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages