Having observed MANY IT organization’s database architecture, I believe most can cut their database licensing spending by over 50% by implementing a new database virtualization architecture I refer to as a “hypercluster”. For this purposes of this discussion, I will be talking primarily about Microsoft SQL Server as it is the most deployed commercial database in the market today. However, many of these concepts would also apply to Oracle or IBM DB2.
History of Enterprise IT Database Architectures
Most Enterprise database environments today are a sprawl of individual servers and databases cobbled together based on hundreds (or thousands) of individual projects that have been implemented over many years. The problem is usually much worse if an IT organization has been subjected to a lot of Merger and Acquisition (M&A) activity as IT environments are merged together based on different architectures and governance. Unfortunately server and database sprawl is the norm for Enterprise IT.
The reason this situation exists has much to do with how an IT shop budgets for projects. The business (Sales, HR, Marketing, Operations, etc) has a pain of some sort, and budget is assigned to solve this problem. This budget is then allocated to IT to purchase the infrastructure needed to solve the problem for the business. Over time, a server purchased for project A may have excess capacity, but because it isn’t a shared utility, new infrastructure is purchased for project B even though the capacity necessary is available on project A’s server. This situation creates a great deal of “excess capacity” in the environment.
In addition, over time the aging of the infrastructure begins to become an issue. Most IT shops are happy to leave a server running with a mantra of “if it isn’t broke, don’t fix it” mentality. This is understandable as the disruption of database infrastructure can cause downtime, which can cause significant costly business disruption (and cause IT leaders to lose their jobs). However, these risks can be significantly mitigated with careful planning if a business case exists to optimize the infrastructure.
What I have found stands in the way of holistic private cloud architectures are 3 things:
- Lack of knowledge of the true cost of an already implemented environment.
- Business case for optimizing an environment is hidden from senior leaders of IT.
- Inability to pass through consumption costs of a shared utility.
“True cost” of the current environment
Most IT shops I have encountered have a ball park understanding of “true cost”, but often do not understand the exact cost of what they manage. (For more information on this topic, please read this article). Issues like CAPEX depreciation costs, datacenter space costs, power costs, etc all make calculating a true cost of operating existing infrastructure a very complex exercise to undertake. Without a baseline cost, it is impossible to calculate a business case to migrate, or pass through a shared utility cost to the business.
Business case for optimization
No new optimization project can be contemplated without understanding the business case for a migration. Once the “true cost” of the existing environment is understood, it must be compared against the cost of a new infrastructure. The equations looks like this:
Savings = Old infrastructure cost – (New infrastructure cost + switching cost)
If the savings over a period of time (say 3 years as this is a typical depreciation timeline for hardware) is significantly less (>50%), the savings can be significant enough to justify a migration.
Understanding a new “target environment”
Defining a business case means understanding what the cost of new infrastructure will look like. I find the savings of a new hypercluster architecture comes in the following areas:
- Hardware optimization – Database performance is based primarily on CPU and Input-Outputs/sec (IOPS). By virtualizing databases on to new hardware, dramatic efficiency is possible.
- Licensing optimization – I have found most Enterprise IT shops have only cursory understanding of database licensing. Database licensing costs are now the single largest cost component of any database solution. Failure to understand and appreciate how licensing can be optimized can be a very costly mistake.
Defining these potential savings means pulling valuable IT resources off of projects, and giving them time in a lab (either internal labs or vendor labs) to cost out a new architecture. Below, I will hopefully provide information that would compel any CIO to give his/her teams the opportunity to explore this scenario.
Building a Utility model
Ultimately, a hypercluster is fundamentally a shared utility model. The cost of a database project is no longer made up of hardware and software licenses that are CAPEX’d. Therefore, the cost of the utility must be distributed to a business unit based on an OPEX cost of consuming the shared utility. Fortunately, the public cloud providers provide a fantastic model to follow if an IT organization decides to implement a private cloud utility. Public cloud providers provide their services in the following manner:
- Virtual Machine (VM) – VM pricing is based on the speed of the CPU and attached storage (SSD or regular spinning disk).
- Storage – Costs are based on how many GB of storage is consumed.
- Network – Network cost for cloud providers is usually based on GB egress (how many GB of data are transferred out of the cloud – ingress and data transferred between public cloud datacenters is usually free). Public cloud providers are now also charging more for SLA’s associated with IOPS…which can be especially useful for really transaction heavy OLTP apps, and batch processing for ETL processes in data warehouses.
Fortunately, costing out a private cloud can be as simple as running consumption reports on a monthly basis, and multiplying by the cost of the resources consumed. Budget transfers can then be automated with just a little bit of development time. For an Enterprise of any significant size, this is an investment worth making.
Another point worth making. An existing virtualization farm IS NOT a good place to put a database. Databases have very different scaling characteristics from applications, and do require their own dedicated architecture to be successful.
Now it is time to begin discussing what the actual architecture of a Hypercluster looks like.
Traditional database architectures
Today, traditional database architecture tends to fall into 2 camps:
- Bare metal server – one or more database instances run on a single server. In the case of SQL Server, these servers run the same version. This server is connected to a SAN using a fiber channel interconnect.
- Virtual Server – A single hypervisor with multiple VM’s also connected to a SAN using a fiber channel interconnect.
In both of these cases, a shared SAN/Disk architecture is utilized. In instances where failover is needed (which is true of almost all production databases), a passive node is put into place to handle a failover event when it occurs. In a much smaller number of cases, the data is replicated to another data center for disaster recovery. This architecture is shown below:
Note that there is no difference between the hardware architecture of a bare metal box vs a hypervisor. The only difference is that the hardware hosts VMWare or HyperV, and the database is installed onto a virtual OS.
Challenges of today’s database architecture
There are a large number of problems I have found in the implementations of today’s database architectures:
No or little virtualization
It is surprising to me how IT shops do little or no virtualization in their database environments. Hardware utilization is a critical reason to virtualize. The reason is that most physical database servers run at 10-15% utilization on average. Collapsing database VM’s on to a shared hypervisor has the following benefits:
- Hypervisor runs at 60-70% utilization – The hardware can run at much higher utilization %’s because individual database performance peaks and valleys happen at different times. Therefore, the hardware is much more efficiently utilized.
- Latest hardware – Most hypervisors implemented will be based on newer hardware, allowing for much faster database execution.
I have only come across one IT organization that has virtualized their entire database environment. Virtualization saved them $millions on their database servers. Organizations with little to no virtualization represent massive opportunities for cost savings.
Leverage licensing correctly
Enterprise IT customers often complain about database core based licensing. Core based licensing applied to an Enterprise IT database footprint that isn’t virtualized is a recipe for a great deal of financial pain. However, in most cases, the software vendor had no choice but to pursue this course of action. If Microsoft and Oracle had not adopted core based licensing, they would have seen their revenues shrivel…which would have hurt innovation in these fantastic software platforms.
However, in the case of SQL Server, Microsoft gives any organization an incredible gift in the form of “unlimited virtualization”. This allows an organization to run as many virtual databases on a hypervisor they want as long as all cores on the hypervisor are licensed with Software Assurance. This single licensing feature, if leveraged appropriately in the database architecture, can DRAMATICALLY reduce software licensing spend. As licensing is by far the #1 cost component of any database solution, an Enterprise IT shop is HIGHLY INCENTED to incorporate virtualization into their future architectural plans. Virtualization significantly eases the pain of core based licensing.
Introducing the “Hypercluster”
So if virtualization is the answer to massively reducing database spend for an Enterprise IT shop, what should this virtualization architecture look like? The answer to this is a “hypercluster”. What makes a hypercluster different than a regular virtualization environment? First, we must introduce a concept that will allow us to measure this…it is called “core compaction”.
Core compaction is all about creating a new target hardware environment that allows us to reduce the number of cores required to run an existing database environment. For example, if I am an Enterprise IT shop currently running 1000 cores of SQL Server, I would like to be able to reduce this down to a much smaller number…say 100 cores. This would mean I would get a 10:1 core compaction ratio. Now, let’s look at the $ associated with this. Today, Microsoft charges $2,220.24/year/2 cores for SA on SQL Server (level D pricing). If an IT organization has 1,000 Enterprise Edition (EE) cores, they will spend $1.1Million/year just on maintenance of those cores. If, through consolidation, an organization can get a 10:1 core compaction, that would lead to SA costs of $110,000 per year…a $990,000 annual savings.
But SA savings are just the first step. An Enterprise IT shop does not usually have JUST EE cores. There is usually a mix of EE cores, Standard Edition (SE) cores, and Server/Client Access Licenses (CAL). As Server/CAL licenses are limited to 20 max cores, these licenses are destined to be worthless in a few years as the core density of servers continues to increase due to Moore’s law. By collapsing all databases onto a few hypervisors, SA can be dropped on all Server/CAL’s. In addition, this consolidation will leave a large number of SE and EE core licenses unused and no longer necessary.
What to do with unnecessary licenses
So how many of these unused core licenses should be kept? It is hard to say exactly, but there are a few guidelines to follow. First, SA on SQL Server is equal to 25% of a License (.25L). This means that unless you plan on implementing a new server to use those licenses in the next 4 years, it is best to drop SA on unnecessary licenses. Therefore, an IT organization needs to be able to predict it’s future growth. This will be a combination of looking at historical growth coupled with an IT organization’s 3 year project plan.
Another strategy is to use the licenses to upgrade the capability of IT organization. Ideas for using unnecessary licenses would be:
- Upgrade SE databases to EE – By putting EE licenses on all hypervisors, SE databases can now take advantage of EE capabilities. This usually involves providing Enterprise grade failover and disaster recovery scenarios not economically viable before. Plus, it provides IT ease of mind knowing that it will never run into artificial limitations in terms of database functionality (max cores, specific EE database functions, etc).
- Create new Disaster Recovery (DR) capabilities – SQL licensing allows for one and only one passive failover node. This has caused trouble for many IT shops who kept another passive node in another datacenter for disaster recovery, and thought the license was included when it is not. Unused licenses are a great way to make a full DR scenario much more economically viable…especially with retired hardware no longer needed.
- Other business case – let’s face it. Database licenses are EXPENSIVE! Since the License has already been purchased, why not consider other business scenarios that were not possible prior because the new licenses would be cost prohibitive? Maybe it is a new datawarehouse, or perhaps the business would like reporting/database replicas for more indepth analysis. Better to use the license if it is useful than to allow the CAPEX investment to be lost.
What drives database performance
In order to design a hypercluster, first we must understand what drives Database performance. It is constrained by the following components:
- CPU – The CPU of any database server is a critical component of overall database server performance. However, it is not THE ONLY component of database performance. All database vendors price their licenses based on CPU cores. Therefore, if you want to reduce the cost of the license, the way to do this is to purchase the fastest Intel E7 chip you can with the smallest number of cores. And because CPU’s are not a huge component of cost, purchasing these chips for massive consolidation hyperclusters is a no brainer. Even more, Intel has designed the latest E7 chips to be optimized for virtualization.
- Memory – Virtualization is massively memory intensive, and databases perform significantly faster when functions run “in memory”. SQL Server 2014 has recently introduced in memory database capabilities, so maximizing memory (size and speed) on a hypercluster is a very worthwhile investment. Remember, the more you keep the CPU optimized, the more capacity you can achieve on a single hypervisor. The more capacity you can achieve, the more databases you can run on the same hardware. The more databases you run, the fewer database licenses you have to purchase/pay SA on. And remember, licenses are the most expensive cost component of a database solution.
- IOPS – IOPS is the speed at which I can read and write data to storage media. Historically, storage has been done on expensive and slow spinning disks configured in a shared SAN. HBA cards are very expensive, and internal IT costs for SAN can be astronomical. IOPS on a hypercluster is absolutely critical for any high transaction throughput system. Big OLTP applications, and ETL processing for data warehouses are great examples of IOPS intensive applications that require massive IOPS. Again, anything that can be done to dramatically increase IOPS in an architecture can drive overall database licensing costs down significantly.
CPU and Memory on a Hypercluster
So, if our goal is to maximize CPU, Memory, and IOPS, we need a hypervisor architecture that can impact this and drive massive core compaction for our business case to make sense. CPU is easy…simply maximize the per core performance by purchasing the fastest Intel Xeon E7 chip with the smallest number of cores. Memory is also easy…maximize the speed and size of RAM on the server. Both of these areas will dramatically increase performance, and maximize core compaction.
Storage on a Hypercluster
Storage, on the other hand, is a bit more complex than CPU and memory. The first step is to get away from a shared storage solution (SAN). The goal is to implement either a Network Attached Storage (NAS) solution or a Direct Attached Storage (DAS) solution. The architecture will look something like this:
Note that this architecture utilizes SSD storage from a company called Skyera. Skyera is pioneering Enterprise grade storage based on off the shelf consumer memory card technology. They have a 1U “pizza box” that can connect directly through a 10-40GigE switch that can contain up to 136TB of SSD storage. The cost of this storage is equal to the cost of spinning disk HDD storage on a per GB basis. Because Windows Server 12 includes SMB 3.0 features (NIC bonding specifically), an expensive Fiber Channel SAN HBA is no longer required. I am still looking for a storage vendor that implements this type of storage based on Infiniband as these cards are dropping dramatically in cost. In any case, the IOPS performance increase for regular database operations is dramatic.
So how do we do failover in these scenarios? For SQL Server 12 -14, this is done using AlwaysOn functionality by creating availability groups. AlwaysOn does not require shared storage for failover. It replicates data over a high speed network link, and can be set for synchronous (highly available) or asynchronous (designed for replicas where database consistency isn’t critical). This architecture is well documented and easy to set up.
However, what do I do about older versions of SQL Server? For SQL Server 2008/2008 R2, this is done by utilizing Windows Server 12 “Storage Spaces”. This allows NAS to appear to Windows as if it is local storage. This architecture will allow 2008/08R2 databases to failover just like they do against shared storage. For SQL Server 2005/2000, these databases need to be migrated as these versions are no longer officially supported by Microsoft.
Failover on a Hypercluster
There has been a great deal of debate about the best approach for failing over a hypercluster. One approach is to put all active nodes on one hypervisor, and all passive nodes on a dedicated passive node hypervisor. This allows an Enterprise to not have to purchase licenses for the passive node hypervisor. However, most DBA experts I have talked to recommend a mixed environment as shown below:
The benefit of this approach is that loads on each hypervisor are roughly equal. A passive node does not have the load characteristics of an active node (an active node does more work). Therefore, it would be almost impossible to predict what would happen in a failover scenarios with potentially dozens of databases failing over at the same time if a hypervisor were lost. Given how safe an Enterprise database environment needs to be, this architecture strikes a nice balance between safety and cost.
Also note in this diagram the failover to Microsoft Azure for disaster recovery. This is an excellent alternative to failing over to a remote datacenter. First, licenses can be covered as part of the service…making DR an OPEX as opposed to a CAPEX expense. Secondly, Azure backup can be replicated to multiple Azure datacenters as simple as a few clicks. From a cost and ease of setup perspective, this is really a no-brainer alternative to traditional on-premise remote DC backup/DR.
Real life savings
So, what does this architecture add up to in terms of overall savings? Bottom line, it can be dramatic. One recent analysis I did showed that implementing a hypercluster for a large Enterprise IT organization would reduce spending over 4 years from $35Million down to $8Million by doing a migration to a hypercluster. My estimate is that the core compaction rate would be 12:1. Other benefits also include much better predictability from a budgeting perspective, and a doing away with any audit risks.
Many IT organizations are considering whether to move all of their database functions to the public cloud. Personally, my recommendation would be to wait as I don’t believe the cloud providers will be ready to handle large Enterprise IT Tier 1 workloads for at least the next 3 years. However, moving to a database virtual private cloud is an excellent first step in preparing an IT organization to move to the cloud.
A recent addition to both AWS and Azure is the ability to do a direct network connection to the cloud from an existing datacenter. ATT, Level3, and others provide this capability. It would be a worthwhile investment of time to determine if it makes sense to put the database virtual private cloud into a new datacenter that is very few hops away from either an AWS or Azure datacenter. This would allow the data to be controlled by IT, while allowing application VM’s to run in the public cloud. Currently, this is truly the best of both worlds.
Hopefully this paper will show the reader that moving to a Hypercluster Virtual Private Cloud architecture is worth exploring and ultimately implementing. The cost savings, ability to be nimble/elastic, and the ability to easily budget and avoid costly audit results just make sense. However, it all starts by taking a few steps of exploring hyperclusters, and determining if it is a good fit for your IT organization.