YAML2Graph: Turn YAML into a Knowledge Graph in minutes

By Ali Shamsaddinlou

knowledge-graphregistryneo4jyaml

YAML2Graph: Turn YAML into a Knowledge Graph in minutes

Introduction

Modern enterprises depend on accurate, discoverable, and governed data to power analytics and AI initiatives. But setting up and maintaining a robust data catalog often requires stitching together multiple tools, writing custom APIs, and managing separate governance systems.

YAML2Graph changes that equation. It's a graph-native, YAML-driven application that automatically generates a full-featured knowledge graph—complete with REST APIs, CLI tooling, customizable taxonomies, and federated governance—all out-of-the-box.


YAML-First Philosophy

At its core, YAML2Graph relies on a simple YAML configuration to describe assets, schemas, lineage, and governance rules. From this single file, the system spins up a fully operational knowledge graph.

Example YAML snippet:

assets:
  - name: sales.orders
    type: table
    description: "Customer orders dataset"
    schema:
      - name: order_id
        type: integer
      - name: customer_id
        type: integer
      - name: amount
        type: decimal
    taxonomy:
      domain: Finance
      sensitivity: Confidential
    lineage:
      upstream:
        - source: sales.raw_orders
      downstream:
        - layer: analytics.orders_summary

Graph-Native Foundation

Unlike traditional relational metadata stores, YAML2Graph is built around a graph model.

  • Lineage as a graph: Upstream and downstream dependencies are first-class citizens.
  • Extensibility: New node types (e.g., ML models, dashboards, pipelines) can be added without re-architecting.
  • Flexible queries: Graph traversal makes it easy to answer questions like "Which downstream assets are impacted if this table changes?" or "Show all datasets linked to Customer PII."

This graph foundation makes the catalog highly customizable and inherently suited for federated governance.


System Architecture & Flow

YAML2Graph operates through a sophisticated architecture that handles both bootstrap and runtime phases. Here's how the system works:

Bootstrap Phase: RegistryFactory Initialization

The system starts by loading YAML configuration, validating it, and generating the necessary classes and methods dynamically.

Bootstrap Phase Diagram

Shows how RegistryFactory loads config, validates it, generates functions, and creates the final writer class

Runtime Phase: Using Generated Methods

Once initialized, the system provides a seamless runtime experience where generated methods handle URN generation, database operations, and aspect processing.

Runtime Phase Diagram

Shows the flow when calling upsert_dataset() and add_aspect() methods

Configuration Loading Flow

The system intelligently merges multiple YAML files using an include mechanism and deep merging process.

Configuration Loading Diagram

Shows how RegistryLoader merges multiple YAML files into one configuration

Method Generation Flow

Dynamic method creation from configuration allows the system to generate upsert_*, get_*, and delete_* methods for each entity automatically.

Method Generation Diagram

Shows how Neo4jWriterGenerator creates upsert_*, get_*, and delete_* methods dynamically

Overall System Architecture

The complete system architecture shows how all components interact to provide a unified data catalog experience.

System Architecture Diagram

High-level view of the complete registry system architecture

Data Flow Overview

The overall data flow demonstrates how information moves from configuration files to database operations seamlessly.

Data Flow Diagram

Simplified view of how data flows through the system


Batteries Included

YAML2Graph comes with everything needed to be productive immediately:

REST API

  • GET /assets – list all assets
  • GET /assets/{name} – get details and schema
  • GET /lineage/{name} – fetch full lineage graph
  • Auto-generated OpenAPI/Swagger docs

CLI Tooling

  • yaml2graph list – show all registered assets
  • yaml2graph describe sales.orders – view schema, lineage, and taxonomy
  • yaml2graph lineage sales.orders – visualize dependencies

Storage & Indexing

  • Graph database backend for lineage and taxonomy
  • Embedded search for fast discovery

Custom Taxonomy & Data Model

Every organization has its own way of classifying and governing data. YAML2Graph supports:

  • Custom taxonomies (domains, sensitivity levels, lifecycle states, etc.)
  • Flexible data models: Add your own asset types (datasets, APIs, ML features, reports)
  • Schema evolution: Versioning and change history are built in

This ensures the catalog adapts to your enterprise vocabulary instead of forcing you into a rigid model.


Federated Governance

Because the underlying model is graph-based and generic, governance can be federated across teams and domains:

  • Decentralized ownership – Teams manage their own YAML definitions, version-controlled in Git.
  • Central visibility – The graph model unifies these definitions into one enterprise-wide catalog.
  • Policy hooks – Governance rules (access control, classification, retention) can be defined once and enforced everywhere.

This aligns with data mesh principles, giving both autonomy and oversight.


Agent Development Status

The following table shows the current development status of various lineage agents:

| Agent Name | Done | Under Development | In Backlog | Comment | |------------|------|-------------------|------------|---------| | python-lineage_agent | ✓ | | | | | airflow_lineage_agent | ✓ | | | | | java_lineage_agent | ✓ | | | | | spark_lineage_agent | ✓ | | | | | sql_lineage_agent | ✓ | | | | | flink_lineage_agent | | | ✓ | | | beam_lineage_agent | | | ✓ | | | shell_lineage_agent | | | ✓ | | | scala_lineage_agent | | | ✓ | | | dbt_lineage_agent | | | ✓ | |


Why YAML2Graph Matters

  • Declarative & version-controlled – YAML as the single source of truth
  • Graph-native – naturally models relationships, lineage, and governance
  • All-in-one – REST, CLI, storage, and APIs provided automatically
  • Extensible – customize taxonomy, add new data models, and extend governance easily
  • Federated by design – supports decentralized ownership with centralized policies

Conclusion

YAML2Graph is a code-first knowledge graph platform that turns simple YAML definitions into a fully operational, customizable, and governed data ecosystem. With REST APIs, CLI tooling, federated governance, and extensibility built in, it delivers a "batteries included" experience for modern data teams.

By uniting declarative configuration, graph flexibility, and governance controls, YAML2Graph helps organizations scale their data culture without scaling complexity.