Database from scratch - Abhik's Journal

# Database from Scratch > *What I cannot create, I do not understand* > <span style="opacity:0.6">— Richard Feynman</span> This quote has stayed with me since my undergraduate days as an aspiring physicist, and it still resonates deeply today. To gain a deeper level of understanding, one must be able to create it from first principles. It would be unrealistic to take this to the extreme. But there is an underlying truth that I find alluring and this is my attempt to recreate this ideology with Databases. My recent fascination with databases started at work. One day, I realized that despite using PostgreSQL daily, I did not actually know what a database is. I knew how to write SQL, spin up a CRUD application, and debug queries, but those skills stopped at the interface. Is a database just a program? Where is the data actually stored? What does "data" even mean at the lowest level? Why are databases so reliable and so fast? In my mental model, a database was just something I queried, and rows would magically appear on my screen. This realization bothered me. If I wanted to build systems at scale, I needed a more grounded understanding of what sits beneath SQL. So I started looking for resources and found an excellent one: [Build Your Own Database From Scratch in Go](https://build-your-own.org/database/). Using that guide, along with some help from AI, I built a toy database called [tinyDB](https://github.com/Abhik0605/tinyDB). This blog is a record of what I learned along the way. ## What is a Database? At its simplest, a database is a program that accepts queries and retrieves data from storage. Most relational databases accept SQL, parse it, and execute it against data stored on disk. At a high level, my database follows this architecture: ``` User Input → SQL Parser → AST → Executor → Table → KV Store → B-Tree → Disk ``` The book I followed starts from the bottom of this stack. It introduces disk pages and B+trees first, then slowly builds upward. While this approach is rigorous, I found it difficult to internalize early on because it starts so far from what users actually interact with. What finally made things click for me was the table layer, since tables are the mental model most people already have. So that is where I will start. ## Tables: The User Interface of a Database Consider a simple users table: | id (INT) | name (TEXT) | | -------- | ----------- | | 1 | Alice | | 2 | Bob | This is how we think about data. Rows, columns, and types. Internally, my database does not store tables directly. Everything is stored as key value pairs. Each row becomes a single key value entry: - **Key**: contains identifying information, primarily the primary key - **Value**: contains the remaining column data Most relation databases i.e. databases that store tabular data structure their kv pairs this way. ### Internal Representation Key format ``` [table_prefix | primary_key_data] ``` Value format ``` [non_key_column_data] ``` Consider this row in the `users` table: | id | name | | --- | ----- | | 1 | Alice | This row becomes a single key value entry. - **Key**: `[3][1]` `3` identifies the `users` table, and `1` is the primary key value. - **Value**: `[5]["Alice"]` `5` is the length of the string, followed by the raw bytes of `"Alice"`. As the database grows, the number of these key value pairs grows as well. At that point, the database is no longer managing tables. It is managing a large collection of ordered keys and associated byte arrays. What the database needs to do is perform the following operations on the key-value pairs: - Inserting new key value pairs efficiently - Retrieving them quickly - Updating them correctly - Deleting them safely In other words, INSERT, SELECT, UPDATE, and DELETE are just operations on key-value entries. (There are more things that a database does but these are the most important and crucial.) This raises an important question. How do we store these key value pairs on disk in a way that allows fast access? ## Journey to Disk: Why B+Trees Exist ### The Naive Approach A natural first idea is to simply append records to a file: ``` [Record1][Record2][Record3]... ``` This approach works and is simple, but it does not scale well for reads. To find a single record, the system may need to scan the entire file. Even with multiprocessing or partitioning, the core algorithm remains linear. As datasets grow, performance degrades quickly. One of the most fascinating aspects of databases is how consistently fast they remain even with millions or billions of rows. That performance does not happen accidentally. ## Enter B+Trees: Making Search Fast So how do we make it fast? The answer is simple. If we want a value from a million row table, we do not scan a million rows. Easy. We find sneaky ways to work with less data. At the core, that is all B+trees are doing. They are just many levels of indexing and pointers. A B+tree defines _how_ we index nodes, _where_ keys live, and _how_ those nodes point to each other on disk. Honestly, if it helps, do not think about the word **tree** at all. A B+tree is an organization of many lists, byte arrays in our case, and pointing them to other byte arrays in a clever and predictable way. Each list contains a sorted set of keys and pointers to either more lists or actual data. Imagine searching for a book in a library. If every book were placed on a single massive shelf, even if sorted, you would still spend a lot of time walking back and forth. Now imagine the library is split into multiple shelves, each labeled alphabetically. You first pick the correct shelf, then narrow it down further, and finally find the book. You never look at most of the books at all. That is exactly what a B+tree is doing. The introduction to this topic is often intimidating. You will see phrases like “balanced n-ary tree” and diagrams full of arrows and boxes that make very little sense when you first encounter them. To demystify concepts we should first start with the intuition and then add the rigour. ## B+Tree Visualization ![[btree.png]] In a B+tree, keys guide you through the structure. Suppose we store the keys: `2, 4, 5, 14, 15, 17, 27, 32` To find `27`, the database starts at the root node, which contains the key `17`. Since `27` is greater than `17`, the entire left side of the tree can be ignored and the search continues to the right. The next node contains `27`, which again splits the remaining keys into smaller ranges. Because the target value is `27`, the search follows the right pointer and reaches the leaf node containing `27, 32`. At no point does the database inspect unrelated keys like 2, 4, or 14. Each comparison narrows the search space, allowing the database to find the correct value by traversing only a few nodes instead of scanning every entry. In practice, databases tune the size of each node to match the disk page size, typically 4096 bytes. Since operating systems read and write data in fixed-size blocks, aligning nodes with page sizes minimizes disk I/O. With reasonable key and value sizes, a B+tree with a large branching factor can store millions of records with a height of just 3 or 4. This means: - Each lookup requires only a handful of disk reads - Time complexity is O(log n) with a very small constant factor While this is sometimes described as constant time in practice, it is more accurate to say it is effectively constant due to the small tree height. I would highly recommend playing around with the structure [here](https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html) to get a better understanding. ## Interacting with Files At the lowest level, everything eventually becomes bytes written to disk. All the byte arrays we build while implementing the B+tree are stored inside files. Instead of manually reading from and writing to those files, the database uses `mmap` to map the file directly into memory and treat it like a large byte array. The file is divided into fixed-size pages, each 4KB in size. Each page corresponds to a single node in the B-tree. When the database needs to read a node, it simply accesses the corresponding page in memory. This keeps reads fast and avoids unnecessary system calls. Here is the file structure of how the B+Tree nodes are stored (Generated using AI) ![[db_architecture.png]] ### Writing Changes to Disk When a transaction commits, the database performs writes in a carefully defined order. First, all modified pages are written to disk using `pwrite()`. Next, `fsync()` is called to ensure those writes are fully persisted. Finally, the root metadata page is updated atomically to point to the new version of the tree. This two-phase process ensures that the database is always in a consistent state. If a crash happens at any point, the database can either see the old version or the new version, but never a corrupted mix of both. ## SQL Parser: From Text to Execution SQL is just text. Before it can do anything, the database must translate that text into actions. ```mathematica SQL Text ↓ Tokenizer ↓ AST Builder ↓ Abstract Syntax Tree ↓ Executor ↓ Table / KV Operations ``` ### Tokenization The first step is breaking the raw SQL string into small, recognizable pieces called tokens. Instead of looking at characters one by one, the database groups them into meaningful units like keywords, names, numbers, and symbols. For example, the query: `SELECT name FROM users WHERE id = 1` is broken down into something like: `SELECT | name | FROM | users | WHERE | id | = | 1` Each of these pieces is then labeled. `SELECT` and `FROM` are keywords, `name` and `users` are identifiers, and `1` is a numeric literal. At this stage, the database does not care what the query _means_. It is only answering a much simpler question: where do the words and symbols begin and end? ### AST Generation Once the query has been tokenized, the database needs to understand how those pieces fit together. This is where the Abstract Syntax Tree, or AST, comes in. The AST turns the flat list of tokens into a structured representation of the query. Instead of a sequence of words, the database now sees relationships. It knows which columns belong to the `SELECT` clause, which table is being queried, and which condition applies to the `WHERE` clause. This is also where grammar rules are enforced. Invalid queries are rejected, and ambiguous cases are resolved. A `SELECT` with a `WHERE` clause stops being text and becomes a tree that describes _what data is requested_ and _under what conditions_. ### Execution With a valid AST in hand, the database can finally do real work. The executor walks the tree and translates each part into concrete actions on tables and the underlying key value store. Filters turn into key lookups, projections decide which columns to return, and updates become modifications to stored values. At this point, SQL stops being special. It is just a structured plan that drives reads and writes against persistent storage. ## Conclusion Building this database taught me that databases are not mysterious systems hidden behind SQL but carefully crafted programs that can extract and insert data to disk in a scalable manner. Having said that there are a quite a few limitations in my database that have not yet been implemented - No concurrency support - Only primary key indexing - Limited SQL support - No query optimizer - Memory mapping increases virtual memory usage - Leaf nodes are not linked, so range scans are less efficient than in a true B+tree Each of these omissions represents a real problem that production databases have to solve. I plan to tackle many of them in future iterations, not to build a production-ready system, but to continue pushing my understanding deeper. ## Resources - tinyDB source code: [https://github.com/Abhik0605/tinyDB](https://github.com/Abhik0605/tinyDB) - Build Your Own Database From Scratch: [https://build-your-own.org/database/](https://build-your-own.org/database/) - SQLite architecture overview: [https://www.sqlite.org/arch.html](https://www.sqlite.org/arch.html)