Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naive bayes #68

Merged
merged 10 commits into from
Jun 23, 2021
Merged

Naive bayes #68

merged 10 commits into from
Jun 23, 2021

Conversation

sfsf9797
Copy link
Contributor

@sfsf9797 sfsf9797 commented May 15, 2021

  • What I did

Implement Gaussian naive Bayes Class and Basic documentation.

  • How I did it

Refer to the formula of Gaussian naive Bayes from the notes

  • How to verify it

compare the performance with the gaussianNB from sci-kit learn.

@ddbourgin
Copy link
Owner

Amazing! At first glance this looks great, @sfsf9797 ! I'm pretty slammed with work right now, but am going to have a look at this this weekend.

@ddbourgin ddbourgin self-assigned this May 25, 2021
Comment on lines 112 to 118
prob = -self.n_features / 2 * np.log(2 * np.pi) - 0.5 * np.sum(
np.log(sigma )
)
prob -= 0.5 * np.sum(np.power(X -mean, 2) / (sigma), 1)

joint_log_likelihood = prior + prob
return joint_log_likelihood
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few errors here.

  • The log Gaussian likelihood calc in prob isn't quite right (see later commit for details)
  • In the joint_log_likelihood = prior + prob line, you're adding a log-transformed vector (prob) to raw probabilities (prior), which doesn't make sense. I think what you want is np.log(prior) + prob. See my later commit for details.
  • This isn't a joint likelihood, right? You're computing the the joint class likelihood in prob, but when you add in the prior, this will be proportional to the log class posterior.


def prob(self,X,mean,sigma,prior):
"""
compute the joint log likelihood of data based on gaussian distribution
Copy link
Owner

@ddbourgin ddbourgin May 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nomenclature: I'm not sure it's right to call this the joint log likelihood? That is, this function computes the unnormalized quantity P(y = c | X, mean_c, sigma_c), which I'd think is the class posterior.

prior = P["prior"][class_idx]
sigsq = P["sigma"][class_idx]

# log likelihood = log X | N(mu, sigsq)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfsf9797 this is what I believe the proper log likelihood + log class posterior calc should be. Let me know if you agree!

@ddbourgin
Copy link
Owner

Hey @sfsf9797 - Thanks for the PR! I just had a more thorough look and committed a few changes. Brief summary:

  1. I moved this under the linear_models module rather than keeping it as a single hanging model.
  2. I think there might have been a few bugs in the original log likelihood calc. I've committed what I believe is the correct version, but please go through it and make sure you agree.
  3. I expanded the unit test you included -- comparing model accuracies is a good start (thanks!), but I think it actually masked some problems with the implementation. In particular, testing on multiple random cases revealed that there were mismatches in accuracy between sklearn and the current implementation, and comparing the actual class probabilities (rather than just the predictions) revealed a bug in the log posterior calculation.
  4. I expanded the documentation to provide a better overview of the model.

Please feel free to make adjustments or ask questions. Once we both agree on the implementation and are happy with the model performance, I'm happy to merge.

@sfsf9797
Copy link
Contributor Author

Hi, thank you so much for all the feedbacks, I will go through all these and get back to you this weekend.

@sfsf9797
Copy link
Contributor Author

sorry, I am kind of busy these few weeks, but I will get you back latest by another 2 weeks times.

Copy link
Contributor Author

@sfsf9797 sfsf9797 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, true, naive Bayes is definitely under linear models when likelihood factors p(x∣c) are from exponential families.

@sfsf9797
Copy link
Contributor Author

Hi @ddbourgin thanks for correcting my implementation. I have learnt a lot from you. I am pretty satisfied with the model after a few round of checking. Lastly, Thank you a lot for the all the comments!

@ddbourgin ddbourgin merged commit 1e10697 into ddbourgin:master Jun 23, 2021
@ddbourgin
Copy link
Owner

Awesome, merged! Thanks for the PR @sfsf9797 :)

@ddbourgin ddbourgin mentioned this pull request Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants