Skip to content

hut8/wikiwalk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikiwalk

Shortest path algorithm between pages on Wikipedia. Let's say that you want to get from page Exploding animal to Cucumber. This thing will find the set of shortest paths between them, and return the path as a list of page titles.

Prior Art

  • Six Degrees of Wikipedia by Jacob Wenger - The bi-directional BFS algorithm is based on this project. There are a lot of differences between the implementations: SDOW is in Python, and uses SQLite. This is in Rust and uses a bespoke graph memory-mapped database that I made. SDOW's import process uses shell scripts rather than a parser combinator library, and as a result, it is currently broken. Nonetheless, it's pretty brilliant,

Graph Database

There are two files that comprise the adjacency list database of edges:

vertex_al

Vertex Adjacency List

An array of records of edges in this format:

null        ⇒ 0 as u32
vertex      ⇒ u32
vertexes    ⇒ vertex+
edges       ⇒ outgoing_vertexes, null, incoming_vertexes, null

To determine which vertex each record belongs to, see below (vertex_al_ix)

vertex_al_ix

Vertex Adjacency List Index

An array of u32 indexed by vertex ID. Each u32 is the offset into vertex_al at which the edge data is located. So to load the record for the page with ID 1337:

  • Read the eight bytes at offset 1337 * 8 in vertex_al_ixoffset_bytes
  • Decode offset_bytes into a u64offset
  • If offset == 0, the page either does not exist, or has no edges attached to it.
  • Otherwise, read the record at offset in vertex_al (see above)

Importing MySQL data for test

If you want to spot-check some data from MySQL, it's faster to import the dumps without indexes. First define the tables thusly:

CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT 0,
  `pl_from_namespace` int(11) NOT NULL DEFAULT 0,
  `pl_target_id` bigint(20) unsigned NOT NULL
  ) ENGINE=InnoDB DEFAULT CHARSET=binary;

 CREATE TABLE `redirect` (
  `rd_from` int(8) unsigned NOT NULL DEFAULT 0,
  `rd_namespace` int(11) NOT NULL DEFAULT 0,
  `rd_title` varbinary(255) NOT NULL DEFAULT '',
  `rd_interwiki` varbinary(32) DEFAULT NULL,
  `rd_fragment` varbinary(255) DEFAULT NULL
  ) ENGINE=InnoDB DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED;

 CREATE TABLE `page` (
  `page_id` int(8) unsigned NOT NULL,
  `page_namespace` int(11) NOT NULL DEFAULT 0,
  `page_title` varbinary(255) NOT NULL DEFAULT '',
  `page_is_redirect` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_is_new` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `page_random` double unsigned NOT NULL DEFAULT 0,
  `page_touched` binary(14) NOT NULL,
  `page_links_updated` varbinary(14) DEFAULT NULL,
  `page_latest` int(8) unsigned NOT NULL DEFAULT 0,
  `page_len` int(8) unsigned NOT NULL DEFAULT 0,
  `page_content_model` varbinary(32) DEFAULT NULL,
  `page_lang` varbinary(35) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary;

CREATE TABLE `linktarget` (
  `lt_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `lt_namespace` int(11) NOT NULL,
  `lt_title` varbinary(255) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=binary;

Then read in dumps, skipping DDL

pv enwiki-*-pagelinks.sql | tail +34 | mysql wiki
pv enwiki-*-page.sql | tail +45 | mysql wiki
pv enwiki-*-redirect.sql | tail +38 | mysql wiki
pv enwiki-*-linktarget.sql | tail +34 | mysql wiki

Then build indexes

create index page_title_ix on page (page_title);
create index page_title_id on page(page_id);
create index page_title_namespace on page(page_namespace);

To export for later:

select * from page into outfile '/tmp/wiki-page-dump';
select * from pagelinks into outfile '/tmp/wiki-pagelinks-dump';
select * from redirect into outfile '/tmp/wiki-redirect-dump';
select * from linktarget into outfile '/tmp/wiki-linktarget-dump';

Then compress and such:

sudo mv /tmp/wiki-*-dump ~/data
sudo chown $(id -u):$(id -g) ~/data/wiki-*-dump
zstd -T0 ~/data/wiki-*-dump

Then to import (assuming wiki-page-dump is on the server at some location):

LOAD DATA INFILE 'wiki-page-dump' INTO TABLE page;
LOAD DATA INFILE 'wiki-pagelinks-dump' INTO TABLE pagelinks;
LOAD DATA INFILE 'wiki-redirect-dump' INTO TABLE redirect;
LOAD DATA INFILE 'wiki-linktarget-dump' INTO TABLE linktarget;

SQLite Databases

There are two SQLite databases:

  • master.db - this is the persistent database with an unlimited lifetime. It is used to log performance and usage.
  • graph.db - this is generated by the build routine, and also contains a cache. The cache is here because it becomes useless when the rest of the graph is updated.

Queries

To find large paths:

SELECT
  sv.title AS 'source_title',
  tv.title AS 'target_title',
  coalesce(json_array_length(json_extract(path_data, '$[0]')),0) AS degrees
  FROM path p
  INNER JOIN vertexes sv ON p.source_page_id=sv.id
  INNER JOIN vertexes tv ON p.target_page_id=tv.id
  WHERE degrees >= 5
  ORDER BY degrees DESC
  LIMIT 20;

Database build process

The build process is a bit involved, so here's a high-level overview of how it's done in code.

  • "tool" downloads the latest Wikipedia dump from here. Specifically, the files are:

    • enwiki-[date]-pagelinks.sql.gz
    • enwiki-[date]-page_props.sql.gz (not currently used)
    • enwiki-[date]-page.sql.gz
    • enwiki-[date]-redirect.sql.gz
    • enwiki-[date]-linktarget.sql.gz
  • Read the page table, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contains a page ID and a page title. In batches, write these records to SQLite. They go in the the vertexes table.

  • Read pagelinks, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contains a source page ID and a "link target" ID.

  • Because pagelinks only specifies a target page title (but a source page ID), we want to resolve the target pages to IDs. We do this by iterating over the channel from the above step, combining the titles to look up into chunks in order to reduce the number of queries, then looking up the IDs in SQLite. We the write the source and destination IDs to a very simple flat-file adjacency list made in "edge_db.rs".

  • This edge list is then duplicated: one for the incoming edges, and one for the outgoing edges. This is because the algorithm is bi-directional, and the vertex_al file (see "Graph Database" above and last step below) contains both incoming and outgoing edges.

  • Both of these edge data files are then sorted by their respective page IDs.

  • We then read the sorted edge data files, and write them to the vertex_al file. We also write an index file, vertex_al_ix, which is an array of offsets into the vertex_al file based on the page ID. This is used to quickly look up the edges for a given page ID.

Prior to July 1, 2024, linktarget was unnecessary, and the way to generate the list of links (from Page ID to Page ID) was:

  • Read the page table, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contains a page ID and a page title. In batches, write these records to SQLite. They go in the the vertexes table.
  • Read pagelinks, decompressing it and parsing on the fly. Filter out irrelevant namespaces (anything except 0). Each record contained a source page ID and a target page title. Send each record down a channel.
  • Because pagelinks only specifies a target page title (but a source page ID), we want to resolve the target pages to IDs. We do this by iterating over the channel from the above step, combining the titles to look up into chunks in order to reduce the number of queries, then looking up the IDs in SQLite. We the write the source and destination IDs to a very simple flat-file adjacency list made in "edge_db.rs".
  • This edge list is then duplicated: one for the incoming edges, and one for the outgoing edges. This is because the algorithm is bi-directional, and the vertex_al file (see "Graph Database" above and last step below) contains both incoming and outgoing edges.
  • Both of these edge data files are then sorted by their respective page IDs.
  • We then read the sorted edge data files, and write them to the vertex_al file. We also write an index file, vertex_al_ix, which is an array of offsets into the vertex_al file based on the page ID. This is used to quickly look up the edges for a given page ID.

About

Shortest path algorithm between pages on Wikipedia

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published