YAML2Graph

Knowledge graph builder library that converts YAML definitions into a fully operational and customizable knowledge graph.

PythonFastAPINeo4jDockerElasticsearch

GitHub

YAML2Graph

YAML2Graph is a knowledge graph builder library that converts YAML definitions into a fully operational and customizable knowledge graph. While one key use case is building a data catalog, the framework is generic and extensible—making it easy to define entities, aspects, and relationships for any domain.

With automatic REST API and CLI tooling generation, yaml2graph delivers a "batteries included" experience for quickly turning YAML into production-ready knowledge graph.

Features

Generic Metadata Model Generator: Define flexible, extensible models with entities, aspects, and relationships
REST API Generator: Automatically expose FastAPI endpoints from your YAML registry
CLI Generator: Instantly get CLI commands derived from the registry
Type-Safe Code: Ensure correctness and reliability in data handling

Quick Start

RESTful API Generator

If you want to auto-generate RESTful APIs and save huge time for API creation, you are in the right place. Just run:

make generate-and-run-api

In the examples folder you can find some examples of how to call the API with curl. You can just run:

./examples/api_calls_examples.sh

CLI Generator

You can also auto-generate CLI tooling. Just run:

make generate-cli

Examples of CLI calls can be found and run:

./examples/cli_calls_examples.sh

How YAML2Graph Works

YAML2Graph is a dynamic code generation system that creates knowledge graphs from YAML configuration files.

1. Registry Module

Core part of the yaml2graph is the Registry module, which is developed based on the registry design pattern. Think of it as a code factory that reads configuration (which in this case is a YAML file) and builds classes automatically.

2. API Generator

This module is responsible for generating RESTful APIs. It is developed based on the FastAPI framework and leverages methods generated by the Registry module to build APIs around them.

3. CLI Generator

This module is a CLI interface generator. It is developed based on the Click framework and leverages methods generated by the Registry module to build CLI commands.

Why This Architecture is Powerful

This system essentially turns YAML configuration into a working knowledge graph backend at runtime! It provides:

Flexibility: Change data models without code changes
Consistency: All entities follow the same patterns
Maintainability: Business logic is separated from implementation
Extensibility: Easy to add new entity types and relationships
Type Safety: Generated code ensures proper data handling

Step-by-Step Process

Configuration Files (yaml2graph/config/ folder)

The system starts with YAML configuration files that define the data model:

main_registry.yaml: The main entry point that includes all other config files
entities.yaml: Defines what types of data objects exist (Dataset, DataFlow, CorpUser, etc.)
urn_patterns.yaml: Defines how to create unique identifiers (URNs) for each entity
aspects.yaml: Defines properties and metadata for entities (e.g., datasetProperties, dataflowProperties, etc.)
relationships.yaml: Defines how entities connect to each other (e.g., dataset -> dataflow)
utilities.yaml: Defines helper functions for data processing (e.g., data cleaning, data transformation, etc.)

Registry Loading (yaml2graph/registry/loaders.py)

Reads the main registry file
Merges all included YAML files into one big configuration
Handles file dependencies and deep merging

Validation (yaml2graph/registry/validators.py)

Checks that all required sections exist
Validates configuration structure
Ensures everything is properly configured

Code Generation (yaml2graph/registry/generators.py)

URNGenerator: Creates functions that generate unique identifiers
AspectProcessor: Creates functions that process entity metadata
UtilityFunctionBuilder: Creates helper functions for data cleaning/processing

Class Generation (yaml2graph/registry/writers.py)

Takes all the generated functions and configuration
Dynamically creates a Python class called Neo4jMetadataWriter
This class has methods like:
- upsert_dataset(), get_dataset(), delete_dataset()
- upsert_dataflow(), get_dataflow(), delete_dataflow()
- And so on for each entity type

Factory (yaml2graph/registry/factory.py)

Orchestrates the entire process
Creates the final writer class
Provides a simple interface to use the generated code

Example: How a Dataset Gets Created

Config says: "Dataset entities need platform, name, env, versionId properties"
URN Pattern says: "Dataset URNs should look like: urn:li:dataset:(platform,name,env)"
Generator creates: A function that builds URNs from the input data
Writer gets: A method upsert_dataset(platform="mysql", name="users", env="PROD")
Result: Creates a dataset node in Neo4j with the URN urn:li:dataset:(mysql,users,PROD)

Key Benefits

No hardcoded entity/aspect/relationship types: Add new entities/aspects/relationships by just editing YAML
Flexible URN patterns: Change how IDs are generated without touching code
Dynamic methods: New entity types automatically get create/read/delete methods
Configuration-driven: Business logic is in config files, not code
Maintainable: Changes to data model only require config updates

Getting Started

Visit the GitHub repository to get started with YAML2Graph.