Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use with gridsearch? #1

Open
hurcy opened this issue Feb 23, 2017 · 4 comments
Open

How to use with gridsearch? #1

hurcy opened this issue Feb 23, 2017 · 4 comments

Comments

@hurcy
Copy link

hurcy commented Feb 23, 2017

I've got this error while I was fitting by GridSearchCV.

If no scoring is specified, the estimator passed should have a 'score' method. The estimator LdaTransformer(alpha='symmetric', chunksize=2000, decay=0.5,
distributed=False, eta=None, eval_every=10, gamma_threshold=0.001,
iterations=50, n_latent_topics=100, passes=1, update_every=1) does not.

So, I read the manual(http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator).
Some functions should be implemented to use GridSearchCV.

How did you do it?

@StevenReitsma
Copy link
Owner

StevenReitsma commented Feb 23, 2017

It's been a while since I made this. Back then I don't think having a fit_transform() and a score() was required. Adding fit_transform() should be trivial:

def fit_transform(self, X, y=None):
    return self.fit(X, y).transform(X)

For your error with the score() method: indeed the last step in a pipeline or GridSearch should have a score method. However, usually you don't have LDA or LSI as the last step in your pipeline, since it's often a preprocessing step for a classifier. In theory you can add a score() method to the LsiTransformer and LdaTransformer classes, but that wouldn't necessarily make sense. It's quite hard to determine the goodness of fit of LDA/LSI since it just creates topic embeddings, which aren't inherently good or bad. I would consider adding a classifier to your pipeline and using that plus your document labels to determine the goodness of fit of your LDA preprocessing (keeping the classifier parameters constant).

Feel free to comment if you have further questions!

@hurcy
Copy link
Author

hurcy commented Feb 23, 2017

@StevenReitsma
Thanks for your answer. Now I understand why you named it LdaTransformer.

I think perplexity, topic_coherence can be qualitative metrics to determine the goodness of fit. Since we need to determine the number of topics for LDA, I think score() function can help to choose the best number of topics.

@StevenReitsma
Copy link
Owner

Thanks for those links. Looks like you can definitely use those metrics to get an approximation of the goodness of fit and that should be fine if your ultimate goal is to have a good topic coherence or a good perplexity. However, in a real world use-case your goal is usually not to have a good topic coherence or a good perplexity but to have a good classification or regression performance. Hence my comment to add a classifier in your pipeline to be sure of the performance of your actual problem.

But again, if you're instead working on a research problem where the goal is to have good topic coherence, perplexity, or another metric, then using those to do a GridSearch should be a perfect solution! Adding that as a score() method to the classes shouldn't be too hard as the gensim models expose perplexity and topic coherence.

@hurcy
Copy link
Author

hurcy commented Feb 23, 2017

@StevenReitsma
Thanks again. I'm considering your comment!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants