The way we access and analyze data has changed a lot lately. Row-oriented storage, which has been the standard for data storage for a long time, is having trouble keeping up with the demands of modern data analysis. In this article, I will introduce you to column-oriented storage and how it can help analytical queries run faster.

OLAP

In my previous post, we discussed the differences between Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP). As a reminder, OLAP which is the access pattern of analytical queries typically:

  • Consume a large number of records
  • Focus on only a specific subset of columns from each record
  • Aggregate data to calculate statistics (e.g., averages, sums)

Row-Oriented Storage

Row-oriented storage, a type of storage engine optimized for OLTP, stores all values belonging to a single row near each other. The entire row is essentially stored as a sequence of bytes and is usually indexed for quick retrieval. When you provide a key, the database efficiently locates the physical location of the row on disk. It then goes to that address, loads the sequence of bytes into memory, and parses it to extract the specific values you need. Let’s think of it like a csv file. A row is stored as a string of characters. If you want to access the 10th row, you have to scan the file for the 10th line break, read all the characters until you reach the next line break. Now you have to parse the result by separating it by commas to get the information you want.

While row-oriented storage is great for reading and writing individual records, it quickly becomes less suitable when faced with the demands of OLAP:

  • Index, the data structure behind the ability of most row-oriented data storage to quickly locate the data, doesn’t work with analytical queries. Because analytical queries are OLAP, they don’t access data using a specific key or ID. Instead, they often use multiple conditions, such as date created within a year or product category is of some specific types. Any column in the table can be used in the where clause, and we can’t just create a separate row-based index for each column.
  • Reading a single row in row-oriented storage requires loading the entire sequence of bytes from disk into memory. Thus, reading a huge number of rows with hundreds of columns (which is typical in OLAP) quickly becomes inefficient.

Column-Oriented Storage

Column-oriented storage is based on a simple idea: instead of storing all the values from one row together, just store all the values from each column together. Because the data is organized by column, a query only needs to access and process the columns that are relevant to its needs. This significantly reduces the amount of data that needs to be transferred and parsed, resulting in dramatic performance gains.

Let’s look at the example below. A sales table stored in a row-oriented format looks like this.

DATEPRODUCT_KEYCUSTOMER_KEYQUANTITYDISCOUNTPAYMENT_METHOD
2023/12/2821330.00card
2024/01/1124950.00bank
2024/01/1684910115.00card
2024/01/2165555.00card
2024/02/0252620.00bank

A column-oriented storage serializes all values in a column and store them together (as a sequence of bytes). For our example table, the data would be stored in this way:

ColumnRow 1Row 2Row 3Row 4Row 5
DATE2023/12/282024/01/112024/01/162024/01/212024/02/02
PRODUCT_KEY22865
CUSTOMER_KEY1349495526
QUANTITY3510152
DISCOUNT0.000.0015.005.000.00
PAYMENT_METHODcardbankcardcardbank

Advantages of Column-Oriented Storage

Reading data from column-oriented storage provides several key advantages over traditional row-oriented storage, especially for analytical workloads:

  • Column compression: Due to the denormalization nature in modern data warehouse, values in a column tend to be repeated. Many popular compression algorithms, such as LZW or run-length encoding, make use of the similarity of adjacent data to optimize data size. Look at the column PAYMENT_METHOD in our example. What if, instead of storing a full 4-byte string, we only needed 1 bit for it: 0 for card and 1 for bank? Now the whole column becomes one long bitmap where each row consumes only 1 bit on disk.
  • Access time: Disk access is a real bottleneck. When working with data on disk, we always need to use a different set of data structures and algorithms to minimize access time (B-tree for example). By accessing only the data needed to process the query and using data compression strategy, we can scan more rows in a single read. This means fewer reads to scan an entire table with trillions of rows, and therefore less disk access time.
  • Throughput: Fetching only the necessary columns and better data compression also lead to better throughput, or the amount of data processed in a given time. Throughput is extremely important when compute and storage are not in the same place and data must be transferred over the network.

Conclusion

Each database implementation can vary in its specific optimization. However, the fundamental principle - storing and processing data by column rather than by row - remains the same, leading to significant performance gains for analytical queries. Understanding how your database works behind the scenes is beneficial for you as an engineer. Knowing what your tool does also means knowing what you do.