By definition, a columnar database stores data
by columns rather than by rows, which makes it suitable for analytical query processing, and thus for data warehouses.
Columnar databases have been called the future of business intelligence (BI). They're often used in data warehouses, the structured data repositories that businesses use to support corporate decision-making. Businesses extract data from multiple sources, including cloud-based applications and in-house repositories, and pipe it in batches to these data warehouses, where it serves as the basis for BI tools. Data warehouses benefit from the higher performance they can gain from a database that stores data by column rather than by row.
Why are columnar databases faster for data warehouses? Storage systems have to pull data from physical disk drives, which store information magnetically on spinning platters using read/write heads that move around to find the data that users request. The less the heads have to move, the faster the drive performs. If data is kept closer together, minimizing seek time, systems can deliver that data faster.
What’s “faster”? Nowadays a typical hard drive seek operation may take only 4 milliseconds (ms) – but with the amount of big data stored in today’s enterprises, seek times can add up quickly. Solid state disk drives (SSD) offer seek times of less than 0.1 ms, but they cost several times as much as hard drives per gigabyte. In-memory databases offer seek times of just tens of nanoseconds, but they’re several hundred times more expensive than hard drives per unit of storage. Unless you have unlimited budget to throw at the problem, arranging data on the physical disk efficiently will pay off every time you need to access the data. And you can gain further performance benefits by employing compression on the columnar data, as we'll see in a moment.
Storing data efficiently
Row-oriented databases store each record in one or more contiguous blocks on disk. Column-oriented databases store each column in one or more contiguous blocks. Each scheme is better-suited to different use cases, as the following example illustrates.
Suppose you're a retailer maintaining a web-based storefront. An ecommerce site generates a lot of data. Consider product purchase transactions:
Businesses handle transactions using online transaction-processing (OLTP) software. All the fields in each row are important, so for OLTP it makes sense to store items on disk by row, with each field adjacent to the next in the same block on the hard drive:
512,Seabiscuit,Book,10.95,201712241200,goodreads.com 513,Bowler,Apparel,59.95,201712241200,google.com 514.Cuphead,Game,20.00,201712241201,gamerassaultweekly.com
Transaction data is also characterized by frequent writes of individual rows.
Some – but not all – of the information from transactions is useful to inform business decisions – what's called online analytical processing (OLAP). For instance, a retailer might want see how price affects sales, or to zero in on the referrers that send it the most traffic so it can determine where to advertise. For queries like these, we don't care about row-by-row values, but rather the information in certain columns for all rows.
For OLAP purposes, it's better to store information in a columnar database, where blocks on the disk might look like:
512,513,514 Seabiscuit,Bowler,Cuphead Book,Apparel,Game 10.95,59.95,20.00 201712241200,201712241200,201712241201 goodreads.com,google.com,gamerassaultweekly.com
With this organization, applications can read the kinds of information you might want to analyze – pricing information, or referrerers – together in a single block. You get performance wins both by retrieving information that's grouped together, and by not retrieving information you don't need, such as individual names.
Unlike transactional data, which is written frequently, analytical data doesn't change often. It's usually created by infrequent bulk writes – data dumps.
Columnar storage lets you ignore all the data that doesn’t apply to a particular query, because you can retrieve the information from just the columns you want. By contrast, if you were working with a row-oriented database and you wanted to know, say, the average population density in cities with more than a million people, your query would access each record in the database (meaning all of its fields) to get the information from the two columns whose data you needed, which would involve a lot of unnecessary disk seeks – and disk reads, which also impact performance.
Speaking of disk reads, columnar databases can boost performance in another way – by reducing the amount of data that needs to be read from disk. Most columnar databases compress similar data to reduce storage. Look back at the way columnar data is stored. In our example, you can image a number of products with the same name. Even in columns with many different values, all the values are of the same data type. Programmers have devised clever algorithms for storing repetitive information in less space than it would take if you enumerated each instance. You can't usually do that with row-oriented databases, because all the fields are different.
For all their advantages, columnar databases aren't suitable for every use case. Up until this point we’ve been talking mostly about database read performance, and not so much about writes. You can insert a new record into a row-oriented database with a single operation. It takes more computing resources to write a record to a columnar database, because you have to write all the fields to the proper columns one at a time. That means row-oriented databases are still the best choice for OLTP applications, while column-oriented databases are generally better for OLAP. Also, the more fields you need to read per record, the less benefit you’ll get from using column-oriented storage.
Want to learn about setting the data strategy for your organization?
Sign up for a free 30-day course to learn what you need to succeed with data. We've worked with more than 500 companies of all sizes and helped them build their data infrastructure, run analytics, and make data-driven decisions. Learn how the data landscape has changed and what that means for your company.
We will never share your email
Choosing a columnar database as your data warehouse
You can buy, install, and host a column-oriented database in your own data center, using software such as HP Vertica, Apache Cassandra, and Apache HBase. If you have high-end hardware, you can expect good performance from on-premises databases, as long as the load is relatively constant. If you have variation in your workloads, you could see performance impacts. You'll also need more people in your IT department to help manage the hardware and software.
Many organizations prefer to host their data warehouses in the cloud, using services such as Amazon Redshift, Google BigQuery, and Snowflake. Cloud applications offer several benefits:
- No capital requirements for hardware
- Ability to architect for high availability with built-in fault tolerance
- Flexible capacity and near-infinite scalability to deal with elastic demands
- Always the latest technology
Once you’ve settled on using data warehouse, you’re going to have to populate it with data. You may be tempted to write code that extracts the data from your applications and loads it into your columnar database. There’s a better way to go. Stitch is a simple, powerful ETL services for businesses of all sizes, up to and including the enterprise.
Simple data integration with Stitch
Stitch is a cloud data integration service. Stitch connects to today’s most popular business tools – including Salesforce, Facebook Ads, and more than 80 others – and automatically replicates the raw data to a data warehouse. There's no code to write, and it automatically keeps your data up to date.
Stitch was built to solve data integration. With just a few clicks, Stitch will extract your data from wherever it lives and get it ready to be analyzed, understood, and acted upon.
14-day free trial | setup in minutes | no ETL scripts necessary