We map various wikis, maps, book catalogues and APIs into knowledge graphs of fictional worlds — so you can find out all about your favourite settings, no matter the medium.
The problem
Whether your setting is a tabletop RPG, a novel series, a comic-book universe, an MMO continent or a board-game world, it lives or dies on its metadata. Readers want to look up a creature, jump to a location, follow a character across books. Editors want to add a place once and have it surface everywhere it's referenced — novels, comics, sourcebooks, maps, events, organisations. None of that works without a clean, cross-linked catalogue underneath.
The complication is that the canonical sources for that catalogue live somewhere else. Tolkien lore lives on Tolkien Gateway and the LotR Fandom. Wheel of Time lore lives on the WoT Fandom. Sanderson's Cosmere, Pratchett's Discworld, Martin's Westeros — each has its own wiki, often two. Marvel and DC characters live on Comic Vine. Magic cards live on Scryfall, Pokémon on the Pokémon TCG API, video-game worlds on Fandom and IGN. Rulebooks live in book catalogues like Calibre, RPG Geek and BoardGameGeek. Maps with hand-placed markers live in interactive-map widgets on those same wikis. Each source has its own API (or none), its own quirks, its own rate limits, sometimes a bot challenge, and crucially its own identity for each entity.
Source families we have to handle
Each family poses a different ingestion problem, so the platform treats them as different shapes — but funnels their output into the same provenance and dedup machinery.
- MediaWiki-style wikis — paginated
action=queryAPIs, predictable category trees, parser-output HTML with anchor-rich prose. Easy to walk, easy to extract cross-links from. - Wikis behind anti-bot challenges — Cloudflare-gated, often the canonical source for a setting. Routed through an in-cluster headless-browser sidecar (Flaresolverr) so the rest of the pipeline doesn't have to care.
- Interactive maps — Fandom's map widget, standalone Leaflet exports, scanned atlas pages. Each marker becomes a location point; coordinates and the host map identity have to be preserved.
- Book catalogues — Calibre libraries, RPG Geek, BoardGameGeek, ProjectAON. A rulebook is just a book, but it carries cover art, series metadata, and external IDs that the editor wants searchable.
- Structured-data dumps — 5e.tools JSON mirrors, Scryfall bulk exports, Pokémon TCG API. Offline, indexable, refresh-by-rerun.
- REST APIs with proper auth — Comic Vine, World Anvil, Sphero Edu. Token-gated, paginated, JSON; the simplest shape.
- Static encyclopedia sites — pure HTML, no API. Hand-rolled indexers walk the category pages, extract entries, and stage them for fuzzy matching against the catalogue.
The platform has a small handler abstraction (one Python module per source), so adding a new family is a day's work, not a refactor.
The three-table linkage layer
Sitting next to the catalogue is a small, source-agnostic provenance store:
- MetadataCatalog — one row per external source. Slug, base URL, handler name. Seeded once.
- MetadataEntry — one row per (target snippet, source). Generic FK to the catalogue object, stores
external_id,canonical_title, editor confirmation status. - MetadataScrape — inline child of Entry. One row per fetch attempt: status (ok / 404 / error / rate-limited), the raw payload JSON, the handler version.
Three tables, one design rule: Entry is the join, Scrape is the history. The same character can have an entry against one wiki and against another, each with its own scrape series. The handler decides which fields to write back to the snippet on apply(), but the scrape payload is always preserved verbatim so we can replay it offline, diff against a future re-fetch, or recover when a source goes dark.
This pays for itself almost immediately. The first time a cross-extract goes wrong, the answer is usually in the saved payloads — not in re-hitting the network at the slow source's pace.
The metagraph of core objects
Eight first-class snippet types carry the catalogue's content — all Wagtail-revisioned, all translatable, all dedup-scoped to a setting. The setting page is the gravitational centre: every snippet pivots through one, every dedup decision is scoped by one. Inline orderables wire the cross-links (character → location, character → creature). Foreign keys wire the single-target relations (event → location, event → organisation). A polymorphic Generic-FK ties everything to MetadataEntry rows.
The ingest path
A new source becomes a sync run in three steps:
- Catalog seed. Add a row in
MetadataCatalog(slug, base URL, handler name). Done once. - Handler. Drop a Python module under
game_core/sync/— implementfetch(entry) → payloadandapply(entry, payload)against the relevant snippet types. One wiki, one API, one offline mirror, one HTML scraper — they all wear the same interface. - Importer command. For bulk ingestion, a category-walker or paginated walker threads each candidate through a shared engine that handles the load-bearing invariants — setting-scoped dedup, cross-setting collision suffixing, freshness skipping, MetadataEntry + Scrape stamping, and a cross-link pass against existing in-setting locations.
The shared engine is the leverage. Anything multi-source plugs into it; anything single-source gets the same provenance, dedup, and cross-link behaviour for free.
What we learned
Setting-scoping is non-negotiable. Sources copy each other's category trees, so even within one wiki family you'll see duplicates between sources. Without setting-scoped collision detection, an importer will either merge distinct entities silently or refuse the secondary source entirely. Neither is what you want.
Provenance pays for itself in days. Cross-link extraction failed the first time it ran against a new handler. The answer was in the saved scrape payloads — the handler had a payload-shape inconsistency. Re-running cross-link extraction against the existing payloads (no network) gave us the missing edges in under a minute. Without the saved payloads we'd have re-hit the slow source for hours.
Modelcluster's reverse manager doesn't persist on revisioned snippets. Wagtail's ClusterableModel + DraftStateMixin parents keep their ParentalKey children in memory until .save() is called on the parent, so naive parent.children.create(…) silently writes nothing. Going through the through-model's default manager (ThroughModel.objects.create(parent=…, …)) commits immediately. Subtle bug, costly if missed in a batch.
The headless-browser tier is the slow leg. Anti-bot challenges are real and not going away; routing those requests through a single shared sidecar means the rest of the pipeline keeps its simple HTTP shape, but you pay 5–15 seconds per fetch. Rate-limit gently; let the sync run overnight.
What's next
- More sources — every new community catalogue is a one-day handler. The interface is small enough that contributors can add their own.
- Richer cross-links — character-to-character (mentor / parent / slayer), location-to-organisation, book-to-event. The schema already supports the shape; the cross-link engine just needs to learn each anchor type.
- Editor confirmation UI — fuzzy matches against a wiki index land as unconfirmed MetadataEntries; the Wagtail admin shows candidates inline so editors can confirm or reject without leaving the snippet page.
- API consumers — every snippet is exposed via GraphQL; every
MetadataScrape.payloadis readable JSON. If you want to build a campaign-planner, a fan-wiki cross-reference, or an AI assistant grounded in canon, the schema and the data are sitting right there.
The platform is open — Wagtail snippets, public GraphQL endpoint, raw payloads exportable. Every source's content stays under its original licence (CC-BY-SA on Fandom, CC-BY-NC-SA on Tolkien Gateway, source-specific elsewhere); the editorial prose layered on top is under the site's overall content terms.