Skip to content

Developer Website Tutorial

Ronny Wegener edited this page Jun 29, 2019 · 16 revisions

Tutorial for adding a Website Scraper

This tutorial will give a basic introduction on how to add a new website scraper for HakuNeko. It will use the website beta.mangazuki.co as reference. This website may change in the future, so be aware that the outcome of this tutorial may not work anymore, however the concepts will still work.

1. Requirements

To follow this tutorial, the following requirements shall be met:

  • HakuNeko is ready for local development

  • Advanced knowledge of web-development (HTTP protocol, HTML, JavaScript, JSON, CSS selectors, …​)

  • Experience with browser integrated developer tools (Inspector, Console, Network Monitor, …​)

  • Good understanding of object oriented programming (OOP)

2. Determine an Identifier for the new Website Scraper

The first thing when adding a new website is to select an ID. The ID is a string that will be used in various places to reference this scraper (e.g. in bookmarks). Since the ID cannot be changed at a later time (it would break functionality) select the ID carfully.

Furthermore the ID shall obey the following rules:

  • The ID must be unique (no other website scraper must use the same ID)

  • The ID must only contain lowercase letters, numbers and the characters _ and -

  • The ID must always start with a lowercase letter (this restriction will be lifted with HakuNeko 5.0)

In this tutorial we will use the ID mangazuki-archive [1]

3. Preparation of the required Files

To add a new website scraper we need to create two new files and modify one existing file.

3.1. Business Logic for the Website Scraper

The first thing to do is to create a new file that will contain all the scraping logic. Basically the location of the file doesn’t matter, but for consistens all website scrapers shall be placed into the /web/lib/hakuneko/engine/base/connectors directory of the repository. The filename doesn’t matter, but to make it easier for the developers it is recommend to use the ID as filename appended with the .html extension.

Now lets create the file /web/lib/hakuneko/engine/base/connectors/mangazuki-archive.html with a bare minimum of boilerplate code which will be discussed in more detil within the next chapters:

mangazuki-archive.html
<link rel="import" href="../connector.html">

<script>
    class MangaZukiArchive extends Connector {
        constructor() {
            super();
            super.id    = 'mangazuki-archive';
            super.label = 'Mangazuki Archive';
            this.tags   = [ 'webtoon', 'high-quality', 'multi-lingual', 'scanlation' ];
            this.url    = 'https://beta.mangazuki.co';
        }

        _getMangaList( callback ) {
            callback( new Error( 'Not implemented!' ), undefined );
        }

        _getChapterList( manga, callback ) {
            callback( new Error( 'Not implemented!' ), undefined );
        }

        _getPageList( manga, chapter, callback ) {
            callback( new Error( 'Not implemented!' ), undefined );
        }
    }
</script>

3.2. Logo for the Website Scraper

Most websites have an icon / logo for better recognition. You may add this icon / logo, so it will also be shown in HakuNeko. First head over to the website in the browser and search for the icon. There are various approaches to find and extract the icon / logo:

  • Save it directly from the website using the browser

  • Searching the source code of the website for certain images such as favicon, touch icon, …​

  • Observing the network traffic for images

  • Try to append the default favicon name to the website domain https://beta.mangazuki.co/favicon.ico

Now download the icon / logo to your machine and. The icon / logo should be small (1~16 KB @ 64x64 pixel) to prevent performance loss. If necessary use an image editing software (e.g. GIMP) to modify or resize the icon / logo. Now store the icon / logo in the /web/img/connectors directory and use the ID as filename without any file extension.

For our tutorial the file /web/img/connectors/mangazuki-archive has been added.

mangazuki archive
Figure 1. Final PNG logo @ 64x64

3.3. Register the Website Scraper to the Website List

The last part is to make the new website scraper available by including it in the list of supported websites. Open the file /web/lib/hakuneko/engine/base/connectors.html. At the top of the file is aleady a long list of includes for other website scrapers, just add the one that was created just now. If possible keep the order of includes in alphabetical order.

In this tutorial the line <link rel="import" href="connectors/mangazuki-archive.html"> has been inserted.

connectors.html
<link rel="import" href="connectors/mangayosh.html">
<link rel="import" href="connectors/mangazuki-archive.html">
<link rel="import" href="connectors/mangazuki-info.html">

We just included the file, but we still need to create an object instance. This is done in the constructor(), we simply create a new instance and register it with the list of supported websites. Same as for the includes, it would be nice to keep the alphabetical order of the register calls.

connectors.html
this.register( new MangaWindow() );
this.register( new MangaZukiArchive() );
this.register( new ManhuaBox() );

4. The Website Scraper Structure

Congratulations, at this point the website scraper is now part of HakuNeko. You may run the npm start command to fire up HakuNeko and ensure the availibility of the new website scraper. If something went wrong, an error shall be shown in HakuNeko’s developer console exposing more details. Still, it is not yet functional, when trying to interact with the website scraper it would just show an error. This chapter will provide more details on how the website scraper is supposed to be implemented.

💡
After starting HakuNeko leave the application running so you can reload it by pressing the F5 key (while the developer console is focused). This will make it easier to update the application after making some changes to one of the source code files.

4.1. The Base Class

First lets take a closer look into the created file /web/lib/hakuneko/engine/base/connectors/mangazuki-archive.html. All website scrapers are based on the Connector class which contains many predefined fields and methods which can be re-used or overwritten. So in short a webscraper is nothing more than a specialization of the connector. To inherit the from the conector, i must first be included in the webscraper file. After that a new class is created that extends the Connector class.

mangazuki-archive.html
<link rel="import" href="../connector.html">

<script>
    class MangaZukiArchive extends Connector {
        /* non-relevant code */
    }
</script>

TODO: Quick Reference for significant methods of the base class

4.2. Initialization

Now lets take a closer look into the constructor. This is the place where initial fields will be overwritten. First we call super() this will call the constructor() of the base class. Now we can start overwriting fields. The following fields are mandatory:

  • The id field is the unique identifier which has already been discussed

  • The label field is the name of the website that will be shown in the UI

  • The tags field is a list of words describing the website (used for filering), it should at least contain the type (e.g. manga, webtoon, hentai, …​) and the language (e.g. english, japanese, korean, …​)

  • The url field is the URL of the website which is used as source, this is also the link that will be used when opening the website from within HakuNeko (e.g. for manual website interaction)

💡
Try to re-use tags from other website scrapers to avoid increasing the number of tags. If you introduce a new tag that is not yet used anzwhere else, it will be automatically added to the selection filter in the UI.
mangazuki-archive.html
constructor() {
    super();
    super.id    = 'mangazuki-archive';
    super.label = 'Mangazuki Archive';
    this.tags   = [ 'webtoon', 'high-quality', 'multi-lingual', 'scanlation' ];
    this.url    = 'https://beta.mangazuki.co';
}

4.3. Get the Manga List from the Website

Lets discuss the first method we are overwriting from the base class _getMangaList( callback ). This method is invoked whenever HakuNeko requests the manga list, e.g. when the user clicks on the synchronize button. It will receive a callback function as parameter that will relay our scraping result back to HakuNeko when our operation is complete.

The callback( error, mangaList ) function requires two parameters. The error which must be set to an object of type Error when we failed to get the manga list, or to null on success. The second parameter is the list of mangas that we scraped from the website. The list is a collection of simple manga objects, where each object has an id and a title property, all of type string. The id must be set to something that we can use later to determine the chapters belonging to the manga, usually this would be the path of the URL for the manga page on the website (e.g. /series/perfect-half). The title is the name of the manga that will be shown in the UI and also the name of the folder when downloading chapters.

ℹ️
Keep in mind that the implementation shown in this tutorial is just an example. It is completely up to the individual developer how to implement the body of the _getMangaList( callback ) method. In the end all that matters is that the callback function is called with the error and mangaList parameters.

Lets start with a simple fake list to see how this works. First we create a dummy manga list and then we relay it to HakuNeko. Change the corresponding method of our website scraper file to the following code:

mangazuki-archive.html
_getMangaList( callback ) {
    let mangaList = [
        {
            id: 'dragonball',
            title: 'Tutorial: Dragonball'
        },
        {
            id: 'onepiece',
            title: 'Tutorial: One Piece'
        }
    ];
    callback( null, mangaList );
}

After changing the source code, start HakuNeko (or reload the application if already running), select our added website scraper and click the synchronize button. The result should look like this:

developer tutorial mangas
Figure 2. List with fake entries

So far so good, but now it’s time for the real deal. Head over to your favorite browser and open the website. On the website lets search for a page that looks like it is listing mangas. For our tutorial https://beta.mangazuki.co/series/list looks very promising. Now inspect the page and find a way to extract the manga titles and IDs out of the page hat we need for our manga list. This is where your web-development skills and browser-tools experience are required. In this case a simple CSS selector such as div#root div.container table.table tbody tr td:nth-of-type(2) a seems very appropriate to get the elements that we are after.

developer tutorial investigation
Figure 3. Website investigation

It will provide the URL for each manga which can be used as id and also the name which can can used as title. Fortunately the base class provides the fetchDOM( request, query, retries ) convenience method to fetch a web page, build the DOM, run a CSS query and returns a Promise with the list of nodes when resolved, or an error when rejected. This method requires an object of type Request as first parameter. To create a request we need two things, an URL and some options to configure the request. The URL is known, it can be assembled by this.url + '/series/list' (remember initialization of the url field in constructor). The base class also provides some default requestOptions which can be used. The only thing that is left is to convert the node list into a list of manga objects which can be done by simple mapping. Since we don’t know what type of link we get in the anchor element (FQN, absolute, relative, protocol independent, …​), we could use anoher convenience method from the base class. The method getRootRelativeOrAbsoluteLink( reference, base ) will return a relative path if the link points hostname of the given base URL, otherwise the full qualified name will be returned. Putting all this together and turn it into code will look like this:

mangazuki-archive.html
_getMangaList( callback ) {
    let request = new Request( this.url + '/series/list', this.requestOptions );
    this.fetchDOM( request, 'div#root div.container table.table tbody tr td:nth-of-type(2) a' )
    .then( data => {
        let mangaList = data.map( element => {
            return {
                id: this.getRootRelativeOrAbsoluteLink( element, request.url ),
                title: element.text.trim()
            };
        } );
        callback( null, mangaList );
    } )
    .catch( error => {
        console.error( error, this );
        callback( error, undefined );
    } );
}

Now start or reload HakuNeko again and synchronize the manga list. Instead of the fake entries, the mangas from the website are now shown.

4.3.1. Get paginated Manga List

The previous example showed the sunshine path, where all mangas are listed on a single page. In reality this is a rare case. Most websites are using pagination and some websites does not even provide a list at all. In this section the method will be improved to load a manga list from multiple pages. Since fate has decided that our webite for the tutorial also has more than a single page that provides a manga list (what a coincidence), we will just go on with this website. Actually there is another manga list for RAWs, so overall we got two pages with mangas, /series/list and /raws/list both of them having the same DOM structure so we can use the same CSS query.

The idea is quite simple, instead of getting the manga list from a single page, we get all manga lists from multiple pages and join them together. For this purpose we will introduce a helper method that will recurse through each entry of a given list of manga page links and merge the result of each page link. The helper method must not make parallel requests, or the server may be overlaoded, which may lead to temporary server crash or getting IP banned. The final helper method will look very similar to the previous implemented method to get the manga list, except that we now added a list of page links and an abort condition for the recursion. Take your time to understand the following implementation, the comments shall help to grasp whats going on.

mangazuki-archive.html
_getMangaListFromPages( mangaPageLinks, index ) {
    // the index to determine which link from the given list shall be processed
    index = index || 0;
    let request = new Request( this.url + mangaPageLinks[ index ], this.requestOptions );
    return this.fetchDOM( request, 'div#root div.container table.table tbody tr td:nth-of-type(2) a', 5 )
    .then( data => {
        // get the manga list for this certain page
        let mangaList = data.map( element => {
            return {
                id: this.getRootRelativeOrAbsoluteLink( element, request.url ),
                // append 'RAW' text to title to distinguish between translated title with same name
                title: element.text.trim() + ( request.url.includes( '/raw' ) ? ' (RAW)' : '' )
            };
        } );
        // abort condition
        if( index < mangaPageLinks.length - 1 ) {
            // the end of the list is not reached,
            // so we merge the manga list with the one from the next page
            return this._getMangaListFromPages( mangaPageLinks, index + 1 )
            .then( mangas => mangaList.concat( mangas ) );
        } else {
            // the end of the list is reached, just return the manga list
            return Promise.resolve( mangaList );
        }
    } );
}

Finally lets adjust the method that overwrites the method from the base class. Since the logic now has been moved to the new helper method, we just need to launch the helper method with a list of all manga pages and patiently wait for the result to relay it to the callback. In this example the manga page links are known, but many websites use paginaion with hundreds of page links. For those we would first try to extract the number of pages and then dynamically generate the page links which are relayed to our helper method.

mangazuki-archive.html
_getMangaList( callback ) {
    this._getMangaListFromPages( [ '/series/list', '/raws/list' ] )
    .then( data => {
        callback( null, data );
    } )
    .catch( error => {
        console.error( error, this );
        callback( error, undefined );
    } );
}

Time to check if our hard work paid off. Start or reload HakuNeko and synchromize the manga list for our website again. The list should now show more entries then before and many of them should end wiht (RAW).

4.4. Get the Chapter List from the Website

At this point the HakuNeko can get the manga list from the website, but when selecting a manga chapter loading fails (as expected). So the next step would be to overwrite our second method from the base class _getChapterList( manga, callback ). This method is invoked whenever HakuNeko requests the chapter list, e.g. when the user selects a manga from the manga list. It will receive the manga object and a callback function as parameters, which will relay our scraping result back to HakuNeko when our operation is complete.

The manga object is one of those from the list that we created in our first method, so we know it has an id and title property. Based on the id of the manga object we now need to find the corresponding chapters (and thats why it is so important to provide a useful id in the first method, because we are now the consumer of this property).

The callback( error, chapterList ) function requires two parameters. The error which must be set to an object of type Error when we failed to get the chapter list, or to null on success. The second parameter is the list of chapters that we scraped from the website. The list is a collection of simple chapter objects, where each object has an id, a title and a language property, all of type string. The id must be set to something that we can use later to determine the images belonging to the chapter, usually this would be the path of the URL for the chapter page on the website (e.g. /series/perfect-half/75). The title is the number or/and the name of the chapter that will be shown in the UI and also the name of the folder/archive when downloading images.

ℹ️
Keep in mind that the implementation shown in this tutorial is just an example. It is completely up to the individual developer how to implement the body of the _getChapterList( manga, callback ) method. In the end all that matters is that the callback function is called with the error and chapterList parameters.

Lets start again with a simple fake implementation to see how it works. First we create a dummy chapter list and then we relay it to HakuNeko. This time we add some spice by including manga specific information such as the title. Change the corresponding method of our website scraper file to the following code:

mangazuki-archive.html
_getChapterList( manga, callback ) {
    let chapterList = [
        {
            id: manga.id + '/ch1',
            title: manga.title + ' - Chapter 001'
        },
        {
            id: manga.id + '/ch2',
            title: manga.title + ' - Chapter 002'
        }
    ];
    callback( null, chapterList );
}

After changing the source code, start HakuNeko (or reload the application if already running), select our added website scraper and select any of the manga from the list. The result should look like this:

developer tutorial chapters
Figure 4. List with fake entries

I guess you know the drill already. Time to implement this method for real, so lets open and investigate any chapter page on the website. After a brief analysis i came up with the div#root div.column table.table tbody tr td:nth-of-type(2) a CSS query to find the chapter links on the page. We already discussed all the related stuff in Get the Manga List from the Website, except the language property which was not present in the manga object. This property is exclusively used for filtering the chapter list by language in the UI, usually you can just assign an empty string to it. It would only be helpful for websites that have chapters in different languages for the same manga (e.g. MangaDex). In this case you also need to extract the language for each chapter from the website. It’s time to update our method to get the chapter list.

mangazuki-archive.html
_getChapterList( manga, callback ) {
    let request = new Request( this.url + manga.id, this.requestOptions );
    this.fetchDOM( request, 'div#root div.column table.table tbody tr td:nth-of-type(2) a' )
    .then( data => {
        let chapterList = data.map( element => {
            return {
                id: this.getRootRelativeOrAbsoluteLink( element, request.url ),
                title: element.text.replace( manga.title, '' ).trim(),
                language: ''
            };
        } );
        callback( null, chapterList );
    } )
    .catch( error => {
        console.error( error, manga );
        callback( error, undefined );
    } );
}

To verify if the chapters are correctly loaded, start or reload HakuNeko and select our added website scraper and select any of the manga from the list. Instead of the fake entries, the chapters from the manga are now shown.

4.4.1. Get paginated Chapter List

This would be analogous to Get paginated Manga List, by introducing a recursive helper method such as _getChapterListFromPages( manga, chapterPageLinks, index ). Since the website for our tutorial does not have paginated chapters and i’m convinced you are smart enough to figure out how this would be done, i will take the liberty to skip this section for the time being.

Almost there, but we still need to add the functionality to download the images for a chapter. For this the method _getPageList( manga, chapter, callback ) needs to be filled with something useful. This method is invoked whenever HakuNeko requests the page list, e.g. when the user want to preview or download a chapter from the chapter list. It will receive the manga object, the chapter object and a callback function as parameters, which will relay our scraping result back to HakuNeko when our operation is complete.

The manga object is one of those from the list that we created in our first method and the chapter object is one of those from the list that we created in our second method, so we know both of them have the id and title property. Based on the id of the chapter and manga object we now need to find the corresponding images (and thats why it is so important to provide a useful id in the first and second method, because we are now the consumer of this property).

The callback( error, pageList ) function requires two parameters. The error which must be set to an object of type Error when we failed to get the page list, or to null on success. The second parameter is the list of pages that we scraped from the website. The list is a collection of simple image links of type string.

ℹ️
Keep in mind that the implementation shown in this tutorial is just an example. It is completely up to the individual developer how to implement the body of the _getPageList( manga, chapter, callback ) method. In the end all that matters is that the callback function is called with the error and pageList parameters.
⚠️
When using CSS query to find all image elements in the DOM, you need to use the tag source instead of img. The method fetchDOM will internally replace the img tags to improve performance.

I spare you the fake implementation, so directly jump to the browser and open any chapter of the website to investigate how to get those precious images. Unforunately this time there are no image elements which could be found with a CSS query, but instead the images are loaded by XHR and then drawn on a canvas, probably to avoid of being scraped (you can discover this in the network monitor of your browser development console). Peeking in the source reveals that the image links can be extracted from some JavaScript that is embedded into the page. Without going into details here is the JavaScript that can be injected into the page (e.g. through the developer console of the browser):

Object.keys(__APOLLO_STATE__)
    .filter(property => property.startsWith('File:'))
    .map(property => new URL('/file/' + __APOLLO_STATE__[property]._id, location.origin).href)

Since we cannot use fetchDOM this time, it is the perfect opportunity to introduce you to the Engine.Request.fetchUI( request, script ) method, which is especially designed to cope with websites using complex JavaScript frameworks or trying hard to block manga downloaders such as HakuNeko. Basically this method will fire up a browser, navigate to the requested URL and then allows us to interact with the page. In our case we will use this method to inject the JavaScript from above to extract our image links directly from the page. There are not so much differences compared to fetchDOM except, that we now inject JavaScript instead of a CSS query. Furthermore we will wrap the JavaScript into a Promise for the following two reasons. Internally the fetchUI method works with promises as result and the promise will also provide better error handling for unexpected exceptions.

mangazuki-archive.html
_getPageList( manga, chapter, callback ) {
    let script = `
        new Promise( resolve => {
            let makeLink = property => {
                let path = '/file/' + __APOLLO_STATE__[property]._id;
                return new URL( path, location.origin ).href;
            };
            let result = Object.keys( __APOLLO_STATE__ );
            result = result.filter( property => property.startsWith( 'File:' ) );
            result = result.map( property => makeLink( property ) );
            resolve( result );
        } );
    `;
    let request = new Request( this.url + chapter.id, this.requestOptions );
    Engine.Request.fetchUI( request, script )
    .then( data => {
        callback( null, data );
    } )
    .catch( error => {
        console.error( error, chapter );
        callback( error, undefined );
    } );
}

And again, start or reload HakuNeko and select our added website scraper and select any of the manga from the list. Now click on the preview button of any chapter in the chapter list and see if the images are loaded correctly. Click the download button of any other chapter to verify that the download is also working correctly. Congratulations, another website scraper has found its way into HakuNeko.

This chapter explains the use case when the image links cannot be used directly, because the images are encrypted or require a certain setup of the HTTP protocl (e.g. header fields such as referer)

COMING SOON

4.6. Realize Copy & Paste Support

COMING SOON


1. Despite the domain name beta.mangazuki.co, it contains only older releases which are not available on the current website. This might be an artifact of an attempt to update the website. Therefore the identifier mangazuki-archive is choosen.
Clone this wiki locally