Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bound for memory usage #81

Closed
raspooti opened this issue Oct 21, 2014 · 2 comments
Closed

Bound for memory usage #81

raspooti opened this issue Oct 21, 2014 · 2 comments

Comments

@raspooti
Copy link

First, thanks a lot for the great tool. I've been trying it out, and seems magic (except for some corner cases, websites for which it doesn't work, etc) but really cool :)

However, I tried it in a setting with scarce ressources (1G of RAM), and I have the impression that the memory keeps growing build after build until ... memory error. I deactivated the memoize articles, tried to empty the articles, dereference the sources, but looks like a bunch of other things are also memoized, and kept in memory, with no means to deactivate them. What is the best way to handle this? How does newspaper handle the increase of memory usage build after build? Is there a limit?

Thanks again for the magic tool :)
raspooti

@raspooti
Copy link
Author

Hi, it's me again :)
If you wrap your python script into a shell script and do your loop to scrape news with shell, the python newspaper script is initiated each time and the memory usage can be kept bounded. But it's not practical, I wish there was a way to do it all the way in python.

Or there's something I'm missing :) (Actually, there's a post on stackoverflow on this very same issue...)

@codelucas
Copy link
Owner

@raspooti there is no way around this at the moment.

I tried it in a setting with scarce ressources (1G of RAM), and I have the impression that the memory keeps growing build after build until ... memory error.

Yeah, things can be improved but what do you expect, this tool has features to downloads tons of articles along with their related data, it's bound to consume a lot of memory. This is especially true if you keep everything in python and keep growing the memory used.

Until a better solution comes up you can wrap a new shell script for every 1000 articles or something and run them on cron (not in parallel)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants