Skip to content
/ pbc Public

An internet crawler that digs the resources of the PAN Biblioteka Gdańska

License

Notifications You must be signed in to change notification settings

vevurka/pbc

Repository files navigation

PAN Kreator bot

PAN Kreator bot is an internet crawler that digs the resources of the PAN Biblioteka Gdańska and posts interesting results on the Twitter/Facebook.

Bot uses the OAI-PMH API to connect to the pbc.gda.pl and perform a query. Matching record is downloaded, unzipped and converted from djvu to jpg. Finally, the image is posted on the Twitter.

But this is just the part of the bot's abilities. This guy uses machine learning algorithms (Support Vector Machine) to get the idea about the content of the downloaded book. He's able to tell the difference between the text, blank page and image (preferably a figure). Bot goes through all pages of a books and picks only those that are worth posting from his point of view. When a book ends, he chooses the page that seems to contain highest percent of images.

How does he know what to look for?

The bot was initially taught to distinguish three categories of pages by a human. We used a set of 368 images that contained different data.

For example this was marked as a text (which we don't want to publish on Twitter):

this as a blank page (also not very interesting):

but this as an image, because it contains something different and possibly worth showing:

The effectiveness of the image recognition is quite hard to predict, but it makes the results of bot's work interesting.

To check what PAN Kreator have found recently, please visit his Twitter or Facebook page.

https://twitter.com/PAN_Kreator

https://www.facebook.com/pankreatorbot/

Please follow him if you like this!

About

An internet crawler that digs the resources of the PAN Biblioteka Gdańska

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published