Nanodecoder

Foundation for experimenting and extending Transformer architectures.

LLMGenerativeAIPyTorch

Nanodecoder

Nanodecoder is meant as a foundation for experimenting and extending Transformer architectures. Modern large language models (LLMs) predominantly adopt a decoder-only transformer architecture, omitting separate encoder modules and cross-attention mechanisms typical of encoder-decoder designs. These decoder-only models rely solely on causal self-attention within stacked transformer blocks, enabling efficient and scalable autoregressive text generation. Such a structure is ideal for tasks that require prediction of the next token in a sequence, making it the backbone of systems like GPT-4 and Llama-2.

The model could surpass GPT-2 performance with a cluster of 8xA100 GPUs at a cost of around $35 for training.

GPT Variants

Nanodecoder currently implements two different types of LLM architectures:

Dense-GPT

Description: All parameters are engaged for every token (e.g., GPT-2, GPT-3, GPT-4).

Implementation Note: The dense-GPT implementation is designed to be beginner-friendly, making it ideal for learning and understanding the core GPT architecture.

MoE-GPT

Description: Follows the GPT style, but replaces certain dense feedforward layers with Mixture-of-Experts (MoE) layers (e.g., GPT-OSS, DeepSeek-V3, Switch Transformer).

Implementation Note: The moe-gpt folder provides production-ready code with comprehensive feature support.

Features

Feature	CPU	GPU	Mixed + TF32	Multi-GPU + Compile
CPU training	✅	✅	✅	✅
GPU training	❌	✅	✅	✅
Mixed precision	❌	❌	✅	✅
Multi-GPU DDP	❌	❌	❌	✅
Model compilation	❌	❌	❌	✅
Wandb logging	✅	✅	✅	✅

Hardware Requirements

If you are using GPU, you need to consider the following hardware requirements:

Working GPUs: Ampere architecture (2020+)

RTX 30xx series (RTX 3090, 3080, 3070, etc.)
A100, A40, A30
RTX 4090, 4080, 4070

Not supported:

RTX 20xx series
GTX series
Older GPUs

Architecture Details

MoE-GPT Architecture

A GPT-style model that replaces certain dense feedforward layers with Mixture-of-Experts (MoE) layers. Instead of activating all parameters for every token, only a subset of specialized experts are used, making the model more efficient while retaining high capacity. Examples include GPT-OSS, DeepSeek-V3, and Switch Transformer.

For Production and MoE-GPT Internals, check out: Production + MoE Internals Guide

Dense-GPT Architecture

In this architecture, all parameters of the model are active for every token. This design ensures the full network is always utilized during training and inference, making it straightforward to implement and analyze. Well-known examples include GPT-2, GPT-3, and GPT-4, which follow this dense architecture to deliver consistent performance across a wide range of tasks.

For detailed guide on DENSE-GPT: GPT Internals Guide

Getting Started

Visit the GitHub repository to start building your own LLM from scratch.