Clickhouse Internals · Jiaming Liu

Foreword

What if we need to build a relational database like PostgreSQL, where you store and query data with SQL, but the use case is for analytical queries instead of transactional updates and point queries?

This means, usage wise, we do NOT need to be concerned with:

ACID transactions. Aka not too worried about atomic updates, consistency, concurrency isolation, or durability
Point querying. Aka not too worried about SELECT * FROM users where id = 123
Updates. Aka not too worried about UPDATE users SET username = 'joe' WHERE id = 123

Instead, our usage will mostly be for analytical queries, such as analyzing click logs, user events, etc.

So the SQLs we need to worry about are mostly:

Batch insert. Example: INSERT INTO click_events VALUES (10000 rows here)
Big analytical queries instead of point querying or updates. Example:

SELECT user_id, MONTH(timestamp), AVG(HOUR(timestamp)) FROM click_events
JOIN users ON users.id = click_events.user_id
GROUP BY user_id, MONTH(timestamp)
ORDER BY COUNT(*)

And that's what ClickHouse is built for. ClickHouse is engineered for fast analytical querying, and its architecture achieves this performance by:

Not needing MVCC, Write Ahead Log, etc.
Organizing the underlying data in columnar fashion, with compression, skip indexes, etc.

Clickhouse's architecture is a lot cleaner to understand than PostgreSQL. I have a lot less experience with Clickhouse than PostgreSQL, but the big ideas in the following blog should be accurate.

This reading assumes the reader is familiar with SQL and PostsgreSQL, which is used as occasional comparison for illustrative purpose. Let's dive in.

Data storage

Columnar storage layout

When a batch of rows is inserted, ClickHouse builds them into a "part". 1000 new rows become a part. Another 5000 rows become another part. The rows within a part are ordered by whatever primary key the users specifies. And over time, ClickHouse merges parts together.

Within a part, ClickHouse would store the column values together as a file. For example, "user id" would be stored in a file together, and "purchased product id" would be in another file.

This is different from transactional databases like PostgreSQL. PostgreSQL ensures a row stays together, but ClickHouse would break up a row into its columns and store all the values for a column together instead.

To illustrate the idea, Clickhouse would the store data like:

user_ids = [1, 2, 34, 451]  # This gets stored in a file, ordered by primary key
purchased_product_ids = [2342, 923234, 34, 13423]  # This gets stored in another file, ordered by primary key

While PostgreSQL would:

rows = [
    {'user_id': 1, 'purchased_product_id': 2342},
    {'user_id': 2, 'purchased_product_id': 923234}
    # etc.
]

Why does ClickHouse store data in columnar fashion?

This is because for analytical databases, a table would frequently have, say, dozens to a hundred fields. And when we write analytical SQL queries, usually we would be querying through a lot of rows, but only, say, 10 columns.

With ClickHouse's columnar storage layout, if our SQL query mentions 10 out of 100 columns, when we execute the query, we touch 1 / 10th of total data.

Had we stored rows together instead of columns, we would have touched ALL the data, because in that case we have to scan all the rows and not use the 90 / 100 columns in each row, which is very inefficient.

The drawback is that, columnar storage layout suffers from point query. Suppose we run SELECT * FROM users WHERE id = 100, then ClickHouse would need to touch 100 files to assemble all the column values for that row together.

But that's totally fine, because ClickHouse's use case is for analytical queries that touches a lot of rows but a few columns. Columnar storage layout wins here.

For transactional databases that emphasize on point query and updates, storing rows together is a must, which is what PostgreSQL do.

Compression

When we store columns together instead of rows together, compression becomes super easy and effective. This is because the same column would have the same data type, and due to data semantics, the same column's data usually gives huge room for compression.

Compression brings 2 benefits:

Less disk usage, so that saves cost
Less IO needed. Compression trades data size against CPU. If you can compress 10 MB to 5 MB, that means you are doing half of disk IOs at the cost of CPU overhead with compression. But that's win, because CPU is super fast compared to disk IO. So the overall performance increases

Primary key and granule

Unlike PostgreSQL, primary keys in ClickHouse do NOT ensure uniqueness. Instead, ClickHouse primary keys:

Dictate the order the columns values are stored in. Aka the rows are first sorted by primary keys and then stored per column
Dictate the creation of granules, which dictates the creation of sparse primary indexes

Granules are purely logical. By default, ClickHouse logically considers every 8192 rows as one "granule". As all the rows had been sorted by primary key, thus, the primary key of the first row in each granule is strictly increasing too.

ClickHouse would grab each granule's first primary key and store them separately, forming a "sparse index" of primary keys.

As a concrete example, if we have these rows inserted, and the primary key is (user_id, timestamp):

user_id	timestamp	product_id	action_type
123	2026-01-01	1	click
123	2026-01-02	1	buy
etc.	8190 more rows here
456	2026-01-01	2	click

Then the first 8192 rows are logically granule 1, the next granule 2, and so on.

And the sparse index, stored as a file, would have these primary keys:

[(123, 2026-01-01), (456, 2026-01-01)]

And as said before, the columns would be stored separately, so there would be a user id file, timestamp file, product id file, and action type file.

As a result, when we query ClickHouse and we filter on primary key, ClickHouse is able to use the sparse index to locate the relevant granules and avoid scanning irrelevant granules.

Partitions

When a table is defined in ClickHouse, we have the option to instruct it to "partition" the data by time range. For example, all the July 2025 data goes in 1 partition, August 2025 in another, February 2026 in another, and so on. Each partition is manifested on disk as a folder.

Then, when a "part" is created, it would be split up by the partition rule into the correct partition folders. For example, all the newly inserted rows with timestamp July 2025 goes under 1 partition, February 2026 in another, and so on.

So roughly, the underlying data storage would look like this folder structure:

July 2025
   Part 1
        user id column.file
        purchased product id column.file
   Part 2
        user id column.file
        purchased product id column.file
Janurary 2026
   Part 1
        user id column.file
        purchased product id column.file
   Part 2
        user id column.file
        purchased product id column.file
   Part 3
        user id column.file
        purchased product id column.file

For analytical databases like ClickHouse, timestamp is a first class concept, because analytical databases usually ingest rows that are semantically "events" of some sorts, and time is what defines an event.

Thus, when we write analytical queries, we frequently filter by timestamp. With the partition mechanism, as the data is already partitioned by timestamps, we are able to identify what partitions are relevant and avoid scanning partitions that don't match the timestamp filter.

We also commonly have the use case of removing data that is too old. With the partition mechanism, this becomes easy, because ClickHouse just needs to periodically delete the entire partition folder once they become too old.

Skip Indexes

Skip index is another mechanism for ClickHouse to avoid scanning data that do not match the filter condition.

Partition lets us avoid scanning parts that are out of the time range (or other partition ranges) in their entirely at the bigger level.

Sparse primary key index let us skip data based on primary key filters at the granules level.

ClickHouse gives us one more the option to skip data at the granules level: Skip Indexes. As said before, sparse primary key index skips granules based on primary key, and now our skip indexes skip granules based on any column we configure.

Consider this query:

SELECT
    country,
    SUM(money)
FROM purchase_events
GROUP BY country
WHERE
    timestamp BETWEEN '2025-12-01' AND '2026-2-05'
    AND money > 100

Partitioning by timestamp lets us skip all the data that are not in December 2025 or February 2026. If we have a skip index defined on the "money" column, then ClickHouse can skip granules within the December 2025 and February 2026 partitions.

Here's how it works. For every granule, aka every 8192 logical rows, ClickHouse finds the min and max "money" value within that granule. And the skip index is created by storing all the min and max across all granules in a file, like:

money_skip_index = [(123.99, 234.99), (56.99, 88.99)]

Then, at query time, if our query filters by money > 100, we know from the money_skip_index that we need to check the first granule, but not the second.

By the way, skip indexes can also be configured on composite columns, and not just for min and max, but also for membership set to support, for example, country IN ('US', 'China') type of query. That set can be implemented as the full set or as bloom filter.

As a quick recap and comparison, for transactional databases like PostgreSQL, BTree or HASH indexes allow us to find rows efficiently and avoid scanning unnecessary data given some filter condition. For ClickHouse, since the data is stored in columnar fashion, ClickHouse do not have BTree indexes, but instead relies on partition, sparse primary key index, and skip indexes to avoid scanning unnecessary data.

Query execution

Okay cool. We have all of our data there, stored with these mechanisms in place:

Partitions
Parts
Columnar storage, compressed
Sparse primary key index
Skip indexes

Now, how do we actually execute an SQL query, such as,

SELECT user_id, MONTH(timestamp), AVG(HOUR(timestamp)) FROM click_events
JOIN users ON users.id = click_events.user_id
GROUP BY user_id, MONTH(timestamp)
ORDER BY COUNT(*)

As with any relational database system, analogous to PostgreSQL's query execution, the big steps boils down to:

Parse the raw SQL query string into a structuralized query tree data structure
Come up with an optimal query execution plan
Execute that execution plan

Actually, ClickHouse's architecture doesn't support JOIN very well. This is because as an analytical database, ClickHouse's data is usually and expected to be denormalized, so JOINs is frowned upon. This means, the query execution is usually concerned with just 1 table in particular.

This means, to come up with a query execution plan, we just need to resolve:

What partitions match the query's filtering
For every part in the partition, which granule's primary key match the query's filtering
If there's skip indexes, which skip indexes can and should be used to filter out even more granules based on the query

Once the relevant partitions and granules within the parts are resolved, ClickHouse would execute the query by reading targeted portions of column files, applying final rounds of filtering, performing aggregations, transformations, and other operations defined in SQL.

If aggregation is involved, ClickHouse would leverage the CPU's SIMD ability to speedup query processing.

Finally, ClickHouse would use multiple threads to speed up aggregations and transformations as needed.

Conclusion

As an analytical relational database, ClickHouse organizes its data in columnar fashion with compression partitions, parts, sparse primary key index, and sparse indexes. Upon query execution, ClickHouse would aggressively prune away irrelevant granules and execute the query on the relevant granules with SIMD, multithreading, and other techniques.

Unlike PostgreSQL, ClickHouse does not need to worry about ACID transactions, updates, or even point queries. It's built and optimized for bulk analytical queries.