YAML2Graph
YAML2Graph is a knowledge graph builder library that converts YAML definitions into a fully operational and customizable knowledge graph. While one key use case is building a data catalog, the framework is generic and extensible—making it easy to define entities, aspects, and relationships for any domain.
With automatic REST API and CLI tooling generation, yaml2graph delivers a "batteries included" experience for quickly turning YAML into production-ready knowledge graph.
Features
- Generic Metadata Model Generator: Define flexible, extensible models with entities, aspects, and relationships
- REST API Generator: Automatically expose FastAPI endpoints from your YAML registry
- CLI Generator: Instantly get CLI commands derived from the registry
- Type-Safe Code: Ensure correctness and reliability in data handling
Quick Start
RESTful API Generator
If you want to auto-generate RESTful APIs and save huge time for API creation, you are in the right place. Just run:
make generate-and-run-api
In the examples folder you can find some examples of how to call the API with curl. You can just run:
./examples/api_calls_examples.sh
CLI Generator
You can also auto-generate CLI tooling. Just run:
make generate-cli
Examples of CLI calls can be found and run:
./examples/cli_calls_examples.sh
How YAML2Graph Works
YAML2Graph is a dynamic code generation system that creates knowledge graphs from YAML configuration files.
1. Registry Module
Core part of the yaml2graph is the Registry module, which is developed based on the registry design pattern. Think of it as a code factory that reads configuration (which in this case is a YAML file) and builds classes automatically.
2. API Generator
This module is responsible for generating RESTful APIs. It is developed based on the FastAPI framework and leverages methods generated by the Registry module to build APIs around them.
3. CLI Generator
This module is a CLI interface generator. It is developed based on the Click framework and leverages methods generated by the Registry module to build CLI commands.
Why This Architecture is Powerful
This system essentially turns YAML configuration into a working knowledge graph backend at runtime! It provides:
- Flexibility: Change data models without code changes
- Consistency: All entities follow the same patterns
- Maintainability: Business logic is separated from implementation
- Extensibility: Easy to add new entity types and relationships
- Type Safety: Generated code ensures proper data handling
Step-by-Step Process
Configuration Files (yaml2graph/config/ folder)
The system starts with YAML configuration files that define the data model:
- main_registry.yaml: The main entry point that includes all other config files
- entities.yaml: Defines what types of data objects exist (Dataset, DataFlow, CorpUser, etc.)
- urn_patterns.yaml: Defines how to create unique identifiers (URNs) for each entity
- aspects.yaml: Defines properties and metadata for entities (e.g., datasetProperties, dataflowProperties, etc.)
- relationships.yaml: Defines how entities connect to each other (e.g., dataset -> dataflow)
- utilities.yaml: Defines helper functions for data processing (e.g., data cleaning, data transformation, etc.)
Registry Loading (yaml2graph/registry/loaders.py)
- Reads the main registry file
- Merges all included YAML files into one big configuration
- Handles file dependencies and deep merging
Validation (yaml2graph/registry/validators.py)
- Checks that all required sections exist
- Validates configuration structure
- Ensures everything is properly configured
Code Generation (yaml2graph/registry/generators.py)
- URNGenerator: Creates functions that generate unique identifiers
- AspectProcessor: Creates functions that process entity metadata
- UtilityFunctionBuilder: Creates helper functions for data cleaning/processing
Class Generation (yaml2graph/registry/writers.py)
- Takes all the generated functions and configuration
- Dynamically creates a Python class called Neo4jMetadataWriter
- This class has methods like:
upsert_dataset(),get_dataset(),delete_dataset()upsert_dataflow(),get_dataflow(),delete_dataflow()- And so on for each entity type
Factory (yaml2graph/registry/factory.py)
- Orchestrates the entire process
- Creates the final writer class
- Provides a simple interface to use the generated code
Example: How a Dataset Gets Created
- Config says: "Dataset entities need platform, name, env, versionId properties"
- URN Pattern says: "Dataset URNs should look like:
urn:li:dataset:(platform,name,env)" - Generator creates: A function that builds URNs from the input data
- Writer gets: A method
upsert_dataset(platform="mysql", name="users", env="PROD") - Result: Creates a dataset node in Neo4j with the URN
urn:li:dataset:(mysql,users,PROD)
Key Benefits
- No hardcoded entity/aspect/relationship types: Add new entities/aspects/relationships by just editing YAML
- Flexible URN patterns: Change how IDs are generated without touching code
- Dynamic methods: New entity types automatically get create/read/delete methods
- Configuration-driven: Business logic is in config files, not code
- Maintainable: Changes to data model only require config updates
Getting Started
Visit the GitHub repository to get started with YAML2Graph.