Database Hypercluster Virtual Private Cloud Architecture

Having observed MANY IT organization’s database architecture, I believe most can cut their database licensing spending by over 50% by implementing a new database virtualization architecture I refer to as a “hypercluster”. For this purposes of this discussion, I will be talking primarily about Microsoft SQL Server as it is the most deployed commercial database in the market today. However, many of these concepts would also apply to Oracle or IBM DB2.

History of Enterprise IT Database Architectures

Most Enterprise database environments today are a sprawl of individual servers and databases cobbled together based on hundreds (or thousands) of individual projects that have been implemented over many years. The problem is usually much worse if an IT organization has been subjected to a lot of Merger and Acquisition (M&A) activity as IT environments are merged together based on different architectures and governance. Unfortunately server and database sprawl is the norm for Enterprise IT.

The reason this situation exists has much to do with how an IT shop budgets for projects. The business (Sales, HR, Marketing, Operations, etc) has a pain of some sort, and budget is assigned to solve this problem. This budget is then allocated to IT to purchase the infrastructure needed to solve the problem for the business. Over time, a server purchased for project A may have excess capacity, but because it isn’t a shared utility, new infrastructure is purchased for project B even though the capacity necessary is available on project A’s server. This situation creates a great deal of “excess capacity” in the environment.

In addition, over time the aging of the infrastructure begins to become an issue. Most IT shops are happy to leave a server running with a mantra of “if it isn’t broke, don’t fix it” mentality. This is understandable as the disruption of database infrastructure can cause downtime, which can cause significant costly business disruption (and cause IT leaders to lose their jobs). However, these risks can be significantly mitigated with careful planning if a business case exists to optimize the infrastructure.

What I have found stands in the way of holistic private cloud architectures are 3 things:

  1. Lack of knowledge of the true cost of an already implemented environment.
  2. Business case for optimizing an environment is hidden from senior leaders of IT.
  3. Inability to pass through consumption costs of a shared utility.

“True cost” of the current environment

Most IT shops I have encountered have a ball park understanding of “true cost”, but often do not understand the exact cost of what they manage. (For more information on this topic, please read this article). Issues like CAPEX depreciation costs, datacenter space costs, power costs, etc all make calculating a true cost of operating existing infrastructure a very complex exercise to undertake. Without a baseline cost, it is impossible to calculate a business case to migrate, or pass through a shared utility cost to the business.

Business case for optimization

No new optimization project can be contemplated without understanding the business case for a migration. Once the “true cost” of the existing environment is understood, it must be compared against the cost of a new infrastructure. The equations looks like this:

Savings = Old infrastructure cost – (New infrastructure cost + switching cost)

If the savings over a period of time (say 3 years as this is a typical depreciation timeline for hardware) is significantly less (>50%), the savings can be significant enough to justify a migration.

Understanding a new “target environment”

Defining a business case means understanding what the cost of new infrastructure will look like. I find the savings of a new hypercluster architecture comes in the following areas:

  1. Hardware optimization – Database performance is based primarily on CPU and Input-Outputs/sec (IOPS). By virtualizing databases on to new hardware, dramatic efficiency is possible.
  2. Licensing optimization – I have found most Enterprise IT shops have only cursory understanding of database licensing. Database licensing costs are now the single largest cost component of any database solution. Failure to understand and appreciate how licensing can be optimized can be a very costly mistake.

Defining these potential savings means pulling valuable IT resources off of projects, and giving them time in a lab (either internal labs or vendor labs) to cost out a new architecture. Below, I will hopefully provide information that would compel any CIO to give his/her teams the opportunity to explore this scenario.

Building a Utility model

Ultimately, a hypercluster is fundamentally a shared utility model. The cost of a database project is no longer made up of hardware and software licenses that are CAPEX’d. Therefore, the cost of the utility must be distributed to a business unit based on an OPEX cost of consuming the shared utility. Fortunately, the public cloud providers provide a fantastic model to follow if an IT organization decides to implement a private cloud utility. Public cloud providers provide their services in the following manner:

  1. Virtual Machine (VM) – VM pricing is based on the speed of the CPU and attached storage (SSD or regular spinning disk).
  2. Storage – Costs are based on how many GB of storage is consumed.
  3. Network – Network cost for cloud providers is usually based on GB egress (how many GB of data are transferred out of the cloud – ingress and data transferred between public cloud datacenters is usually free). Public cloud providers are now also charging more for SLA’s associated with IOPS…which can be especially useful for really transaction heavy OLTP apps, and batch processing for ETL processes in data warehouses.

Fortunately, costing out a private cloud can be as simple as running consumption reports on a monthly basis, and multiplying by the cost of the resources consumed. Budget transfers can then be automated with just a little bit of development time. For an Enterprise of any significant size, this is an investment worth making.

Another point worth making. An existing virtualization farm IS NOT a good place to put a database. Databases have very different scaling characteristics from applications, and do require their own dedicated architecture to be successful.

Hypercluster Architecture

Now it is time to begin discussing what the actual architecture of a Hypercluster looks like.

Traditional database architectures

Today, traditional database architecture tends to fall into 2 camps:

  1. Bare metal server – one or more database instances run on a single server. In the case of SQL Server, these servers run the same version. This server is connected to a SAN using a fiber channel interconnect.
  2. Virtual Server – A single hypervisor with multiple VM’s also connected to a SAN using a fiber channel interconnect.

In both of these cases, a shared SAN/Disk architecture is utilized. In instances where failover is needed (which is true of almost all production databases), a passive node is put into place to handle a failover event when it occurs. In a much smaller number of cases, the data is replicated to another data center for disaster recovery. This architecture is shown below:

ExistingDBArchitecture

Note that there is no difference between the hardware architecture of a bare metal box vs a hypervisor. The only difference is that the hardware hosts VMWare or HyperV, and the database is installed onto a virtual OS.

Challenges of today’s database architecture

There are a large number of problems I have found in the implementations of today’s database architectures:

No or little virtualization

It is surprising to me how IT shops do little or no virtualization in their database environments. Hardware utilization is a critical reason to virtualize. The reason is that most physical database servers run at 10-15% utilization on average. Collapsing database VM’s on to a shared hypervisor has the following benefits:

  1. Hypervisor runs at 60-70% utilization – The hardware can run at much higher utilization %’s because individual database performance peaks and valleys happen at different times. Therefore, the hardware is much more efficiently utilized.
  2. Latest hardware – Most hypervisors implemented will be based on newer hardware, allowing for much faster database execution.

I have only come across one IT organization that has virtualized their entire database environment.   Virtualization saved them $millions on their database servers. Organizations with little to no virtualization represent massive opportunities for cost savings.

Leverage licensing correctly

Enterprise IT customers often complain about database core based licensing. Core based licensing applied to an Enterprise IT database footprint that isn’t virtualized is a recipe for a great deal of financial pain. However, in most cases, the software vendor had no choice but to pursue this course of action. If Microsoft and Oracle had not adopted core based licensing, they would have seen their revenues shrivel…which would have hurt innovation in these fantastic software platforms.

However, in the case of SQL Server, Microsoft gives any organization an incredible gift in the form of “unlimited virtualization”. This allows an organization to run as many virtual databases on a hypervisor they want as long as all cores on the hypervisor are licensed with Software Assurance. This single licensing feature, if leveraged appropriately in the database architecture, can DRAMATICALLY reduce software licensing spend. As licensing is by far the #1 cost component of any database solution, an Enterprise IT shop is HIGHLY INCENTED to incorporate virtualization into their future architectural plans. Virtualization significantly eases the pain of core based licensing.

Introducing the “Hypercluster”

So if virtualization is the answer to massively reducing database spend for an Enterprise IT shop, what should this virtualization architecture look like?   The answer to this is a “hypercluster”. What makes a hypercluster different than a regular virtualization environment? First, we must introduce a concept that will allow us to measure this…it is called “core compaction”.

Core Compaction

Core compaction is all about creating a new target hardware environment that allows us to reduce the number of cores required to run an existing database environment. For example, if I am an Enterprise IT shop currently running 1000 cores of SQL Server, I would like to be able to reduce this down to a much smaller number…say 100 cores. This would mean I would get a 10:1 core compaction ratio. Now, let’s look at the $ associated with this. Today, Microsoft charges $2,220.24/year/2 cores for SA on SQL Server (level D pricing). If an IT organization has 1,000 Enterprise Edition (EE) cores, they will spend $1.1Million/year just on maintenance of those cores. If, through consolidation, an organization can get a 10:1 core compaction, that would lead to SA costs of $110,000 per year…a $990,000 annual savings.

But SA savings are just the first step. An Enterprise IT shop does not usually have JUST EE cores. There is usually a mix of EE cores, Standard Edition (SE) cores, and Server/Client Access Licenses (CAL). As Server/CAL licenses are limited to 20 max cores, these licenses are destined to be worthless in a few years as the core density of servers continues to increase due to Moore’s law. By collapsing all databases onto a few hypervisors, SA can be dropped on all Server/CAL’s. In addition, this consolidation will leave a large number of SE and EE core licenses unused and no longer necessary.

What to do with unnecessary licenses

So how many of these unused core licenses should be kept? It is hard to say exactly, but there are a few guidelines to follow. First, SA on SQL Server is equal to 25% of a License (.25L). This means that unless you plan on implementing a new server to use those licenses in the next 4 years, it is best to drop SA on unnecessary licenses. Therefore, an IT organization needs to be able to predict it’s future growth. This will be a combination of looking at historical growth coupled with an IT organization’s 3 year project plan.

Another strategy is to use the licenses to upgrade the capability of IT organization. Ideas for using unnecessary licenses would be:

  1. Upgrade SE databases to EE – By putting EE licenses on all hypervisors, SE databases can now take advantage of EE capabilities. This usually involves providing Enterprise grade failover and disaster recovery scenarios not economically viable before. Plus, it provides IT ease of mind knowing that it will never run into artificial limitations in terms of database functionality (max cores, specific EE database functions, etc).
  2. Create new Disaster Recovery (DR) capabilities – SQL licensing allows for one and only one passive failover node. This has caused trouble for many IT shops who kept another passive node in another datacenter for disaster recovery, and thought the license was included when it is not. Unused licenses are a great way to make a full DR scenario much more economically viable…especially with retired hardware no longer needed.
  3. Other business case – let’s face it. Database licenses are EXPENSIVE! Since the License has already been purchased, why not consider other business scenarios that were not possible prior because the new licenses would be cost prohibitive? Maybe it is a new datawarehouse, or perhaps the business would like reporting/database replicas for more indepth analysis. Better to use the license if it is useful than to allow the CAPEX investment to be lost.

What drives database performance

In order to design a hypercluster, first we must understand what drives Database performance. It is constrained by the following components:

  1. CPU – The CPU of any database server is a critical component of overall database server performance. However, it is not THE ONLY component of database performance. All database vendors price their licenses based on CPU cores. Therefore, if you want to reduce the cost of the license, the way to do this is to purchase the fastest Intel E7 chip you can with the smallest number of cores. And because CPU’s are not a huge component of cost, purchasing these chips for massive consolidation hyperclusters is a no brainer. Even more, Intel has designed the latest E7 chips to be optimized for virtualization.
  2. Memory – Virtualization is massively memory intensive, and databases perform significantly faster when functions run “in memory”. SQL Server 2014 has recently introduced in memory database capabilities, so maximizing memory (size and speed) on a hypercluster is a very worthwhile investment. Remember, the more you keep the CPU optimized, the more capacity you can achieve on a single hypervisor. The more capacity you can achieve, the more databases you can run on the same hardware. The more databases you run, the fewer database licenses you have to purchase/pay SA on. And remember, licenses are the most expensive cost component of a database solution.
  3. IOPS – IOPS is the speed at which I can read and write data to storage media. Historically, storage has been done on expensive and slow spinning disks configured in a shared SAN. HBA cards are very expensive, and internal IT costs for SAN can be astronomical. IOPS on a hypercluster is absolutely critical for any high transaction throughput system. Big OLTP applications, and ETL processing for data warehouses are great examples of IOPS intensive applications that require massive IOPS. Again, anything that can be done to dramatically increase IOPS in an architecture can drive overall database licensing costs down significantly.

CPU and Memory on a Hypercluster

So, if our goal is to maximize CPU, Memory, and IOPS, we need a hypervisor architecture that can impact this and drive massive core compaction for our business case to make sense. CPU is easy…simply maximize the per core performance by purchasing the fastest Intel Xeon E7 chip with the smallest number of cores. Memory is also easy…maximize the speed and size of RAM on the server. Both of these areas will dramatically increase performance, and maximize core compaction.

Storage on a Hypercluster

Storage, on the other hand, is a bit more complex than CPU and memory. The first step is to get away from a shared storage solution (SAN). The goal is to implement either a Network Attached Storage (NAS) solution or a Direct Attached Storage (DAS) solution. The architecture will look something like this:

TargetDBArchitecture

Note that this architecture utilizes SSD storage from a company called Skyera. Skyera is pioneering Enterprise grade storage based on off the shelf consumer memory card technology. They have a 1U “pizza box” that can connect directly through a 10-40GigE switch that can contain up to 136TB of SSD storage. The cost of this storage is equal to the cost of spinning disk HDD storage on a per GB basis. Because Windows Server 12 includes SMB 3.0 features (NIC bonding specifically), an expensive Fiber Channel SAN HBA is no longer required. I am still looking for a storage vendor that implements this type of storage based on Infiniband as these cards are dropping dramatically in cost. In any case, the IOPS performance increase for regular database operations is dramatic.

So how do we do failover in these scenarios? For SQL Server 12 -14, this is done using AlwaysOn functionality by creating availability groups. AlwaysOn does not require shared storage for failover. It replicates data over a high speed network link, and can be set for synchronous (highly available) or asynchronous (designed for replicas where database consistency isn’t critical). This architecture is well documented and easy to set up.

However, what do I do about older versions of SQL Server? For SQL Server 2008/2008 R2, this is done by utilizing Windows Server 12 “Storage Spaces”. This allows NAS to appear to Windows as if it is local storage. This architecture will allow 2008/08R2 databases to failover just like they do against shared storage. For SQL Server 2005/2000, these databases need to be migrated as these versions are no longer officially supported by Microsoft.

Failover on a Hypercluster

There has been a great deal of debate about the best approach for failing over a hypercluster. One approach is to put all active nodes on one hypervisor, and all passive nodes on a dedicated passive node hypervisor. This allows an Enterprise to not have to purchase licenses for the passive node hypervisor. However, most DBA experts I have talked to recommend a mixed environment as shown below:

TargetDBFailoverArchitecture

The benefit of this approach is that loads on each hypervisor are roughly equal. A passive node does not have the load characteristics of an active node (an active node does more work). Therefore, it would be almost impossible to predict what would happen in a failover scenarios with potentially dozens of databases failing over at the same time if a hypervisor were lost. Given how safe an Enterprise database environment needs to be, this architecture strikes a nice balance between safety and cost.

Also note in this diagram the failover to Microsoft Azure for disaster recovery. This is an excellent alternative to failing over to a remote datacenter. First, licenses can be covered as part of the service…making DR an OPEX as opposed to a CAPEX expense. Secondly, Azure backup can be replicated to multiple Azure datacenters as simple as a few clicks. From a cost and ease of setup perspective, this is really a no-brainer alternative to traditional on-premise remote DC backup/DR.

Real life savings

So, what does this architecture add up to in terms of overall savings? Bottom line, it can be dramatic. One recent analysis I did showed that implementing a hypercluster for a large Enterprise IT organization would reduce spending over 4 years from $35Million down to $8Million by doing a migration to a hypercluster. My estimate is that the core compaction rate would be 12:1. Other benefits also include much better predictability from a budgeting perspective, and a doing away with any audit risks.

Other considerations

Many IT organizations are considering whether to move all of their database functions to the public cloud. Personally, my recommendation would be to wait as I don’t believe the cloud providers will be ready to handle large Enterprise IT Tier 1 workloads for at least the next 3 years. However, moving to a database virtual private cloud is an excellent first step in preparing an IT organization to move to the cloud.

A recent addition to both AWS and Azure is the ability to do a direct network connection to the cloud from an existing datacenter. ATT, Level3, and others provide this capability. It would be a worthwhile investment of time to determine if it makes sense to put the database virtual private cloud into a new datacenter that is very few hops away from either an AWS or Azure datacenter. This would allow the data to be controlled by IT, while allowing application VM’s to run in the public cloud. Currently, this is truly the best of both worlds.

Summary

Hopefully this paper will show the reader that moving to a Hypercluster Virtual Private Cloud architecture is worth exploring and ultimately implementing. The cost savings, ability to be nimble/elastic, and the ability to easily budget and avoid costly audit results just make sense. However, it all starts by taking a few steps of exploring hyperclusters, and determining if it is a good fit for your IT organization.

Technical feasibility of moving databases to the public cloud

Cloud…Cloud…Cloud. The hype around moving IT workloads to the cloud has reached a fever pitch in the marketplace. However, the first point of any migration to the cloud starts with the database. Once the data is moved, all of the app servers and other infrastructure can (relatively easily) follow. Just how realistic is it to move database workloads to the cloud today circa early 2015?

If you are an ISV, fortunately the answer is “relatively easy”. Simply look at your database structures, and architect around the limitations inherent in the cloud offering. However, if you are a large Enterprise, moving the database in any significant way is much…much harder. To understand why, you first have to put yourself into the shoes of a large Enterprise CIO, and evaluate the following:

  1. Is it even technically feasible to move my database infrastructure? Will my data be secure?
  2. If it is technically feasible to move a large % of my database environment, does is make business sense?
  3. What is the timeline for doing this migration, and am I staffed to do it myself, or do I need 3rd party help?

All 3 of these questions are very hard to answer in any simplistic way, and require a significant amount of due diligence to determine. Let’s discuss these challenges and see if there is a path to success.

 

Technical options available

Both Amazon Web Services (AWS), and Microsoft Azure have had basic abilities to host databases in the cloud for some time. This could be done either with an IaaS infrastructure (you host the Virtual Machine Operating System yourself), or with PaaS (Platform as a Service – you utilize a hosted database service where the provider manages the OS and database software on your behalf).

PaaS

For ISV’s, PaaS represents a significant opportunity IF you are able to architect your application around the limitations of the database service. For example, AWS released it’s SimpleDB service in 2009. To leverage this service, you were required to use a proprietary API that was not SQL based, and had severe limitations in terms of functionality. Even for ISV’s, these limitations meant only a very small number of use cases were even possible, and for Enterprise IT, migrating to this service was a pipe dream.

Since 2009, Amazon and Microsoft have introduced significant new database PaaS features. For example, recently Amazon released a new version of RDS (Aurora) targeting complete API compatibility with MySQL, but architecting the back end for massive scale (up to 64TB), and removing the complexities of failover and disaster recovery. Microsoft has released SQL Azure, which allows for hosting databases up to 500GB in size (today) where most traditional on premise SQL Server functions are supported. Both of these services have significantly advanced the ability to host OLTP applications in the cloud. For ISV’s, Amazon’s RDS offerings are quickly approaching a point where it really no longer makes sense for an ISV to run their own datacenter. In addition to OLTP centric offerings, both Microsoft and Amazon provide fantastic PaaS offerings around Hadoop/Big Data, and more traditional data warehousing functions such as Amazon Redshift and Microsoft cube hosting/reporting services.

However, for large Enterprise IT, leveraging PaaS in the cloud is still a LONG ways off. Enterprise IT is severely limited by a couple of critical factors:

  1. ISV’s dictate architecture – Although some Enterprise IT shops do internal application development (this obviously varies by industry), the majority simply implement ISV solutions out of the box (buy vs build). Even if an IT organization could theoretically move an ISV application database to the cloud, many ISV’s often do NOT provide support on virtualized databases…let alone hosting databases in the cloud.  This is a risk most IT shops are unwilling to take.
  2. API limitations – Many PaaS offerings are not API compliant with on premise versions. For example, Azure SQL does not currently support CLR in stored procedures. Amazon Aurora, although supposedly MySQL API compliant, may introduce potential incompatibilities based on hosting, replication, and security scenarios (currently Aurora is in preview). Regardless of whether the database is FULLY drag and drop supported will still require extensive testing…something an Enterprise IT shop would prefer to leave up to an ISV to perform.
  3. Every conceivable database vendor and versions – Many Enterprise IT shops run a bunch of database vendor products (SQL Server, Oracle, MySQL, etc). ISV’s also often dictate what specific version of database software must be run. It is not uncommon for Enterprise IT to run database versions that are over 10 years old for fear of breaking a mission critical database. Because of all of the mix of vendors and versions, it makes it almost impossible for Enterprise IT to use PaaS solutions in any sort of broad scale.

IaaS

Many Enterprise IT shops ask themselves, “If hosted PaaS database offerrings aren’t possible, at least I can do IaaS (Infrastructure as a Service)…right?”. Although it is true that hosting your own VM with the correct software vendor/version installed is absolutely doable, there are significant performance problems that get in the way of any sort of broad scale adoption. These performance challenges fall into the following categories:

  1. Network performance and latency – no Enterprise application is an island. Most applications deployed in the enterprise have integration requirements with other applications or databases. It may be just as simple as connecting to the database for data warehousing purposes, or perhaps more real time connectivity/application dependency. Therefore, a low latency/high performance connection is required for other applications to integrate with the database. Up until late 2014, connectivity to the public cloud was limited to VPN connections that are unreliable, slow, and have unpredictable latency. With the release of Amazon’s DirectConnect and Azure’s ExpressRoute capabilities, it is now possible to interconnect the public cloud back to existing Enterprise IT datacenters with speeds up to 10GigE (equivalent to LAN speeds). However, simply because I have a high speed connection can still be hampered by latency issues if there are many network hops in between. Care must be taken to ensure ExpressRoute and AWS DirectConnect don’t experience serious latency issues. Today, many Enterprise IT shops have yet to put in high speed interconnects to public cloud providers. This should accelerate in 2015 as large Enterprise IT shops put public cloud integration into their longer term Enterprise Architecture roadmaps.
  2. CPU/high capacity VM’s – For an Enterprise IT organization to consider moving to the cloud, there must be VM’s capable of handling large scale workloads. It wasn’t until late 2014 that VM instances existed that were even capable of handling medium size database workloads in IaaS. For example, AWS EC2 instances are now available to optimize for max CPU, memory, or price/performance. However, this is very recent. In addition, SSD based VM’s are also just now coming online.
  3. IOPS (Input/Output Per Second) – IOPS is perhaps the hardest hurdle to overcome in the public cloud when hosting databases. Databases have tremendous sensitivity to IOPS, and it wasn’t until late 2014 that Amazon enabled “provisioned IOPS” where you can dial in the amount of IO necessary for any database solution. Microsoft does not yet have provisioned IOPS, but should have it available in early 2015.

Because of these (and other limitations), Enterprise IT hosting databases in IaaS simply wasn’t realistic in 2014. However, many of these limitations are being removed in 2015…opening up the ability for Enterprise IT to consider hosting a most sizable % of Enterprise applications in the cloud. Although, implementing Tier 1 database workloads in the cloud will likely require more time before they can be moved (likely in 2016-2017). Tier 2-3 workloads should be able to move in 2015.

Security in the public cloud

Security in the public cloud has been a large topic of discussion, but in my experience, it tends to be a tempest in a teapot. Although public cloud providers do represent a very large juicy target to any would be hackers, the reality is that these cloud provider are very aware of this threat, and architect around this reality. They hire the best and brightest security experts available today, and have a large business reason to never allow an intrusion to occur. Any breach generates a large news event that can kill future business.

Enterprise IT, on the other hand, is an entirely different story. One only needs to look at major intrusions at Target, Home Depot, Sony, Sands Hotels, and many others to realize that Enterprise IT just isn’t up to the challenge of protecting their infrastructure from debilitating intrusions. Having been intimately involved with many IT organizations, the security infrastructure I have observed has been mediocre at best, and downright frightening at its worst. Enterprise IT intrusions/outages seldom make the news unless it is so catastrophic that the entire organization is put at risk (eg Sony).

Although there are some legitimate national security cases to be made for avoiding the public cloud, the vast majority of organizations do not have a security use case stopping them from moving. Then, why is it that IT organizations squash cloud migrations for security reasons? The answer is always simple…follow the money. IT professionals are smart individuals and do look out for their own self-interest. Nobody wants to have their job outsourced, or lose their job. There is a financial interest IT professionals have in throwing up roadblocks. In the sales world, we call this FUD (Fear, Uncertainty, and Doubt). My recommendation is that any CFO/CIO considering cloud bring in an independent 3rd party to validate what are real legitimate security concerns vs FUD.

My recommendations

PaaS

I have a very hard time technically recommending a database PaaS solution to Enterprise IT over the next 3 years. The services simply aren’t mature enough yet (nor will they be in the foreseeable future) to handle the complexities of a broad scale Enterprise database infrastructure migration. However, I absolutely believe ISV’s should have a public cloud PaaS strategy…it just makes sense from a cost and complexity argument.  The main argument I hear from ISV’s from moving to PaaS is vendor lock in. But with AWS Aurora running MySQL, I simply do not see vendor lock in being a significant problem. Vendor lock-in to me has always been a bit of a red herring.

IaaS

IaaS is absolutely coming of age in 2015 for Enterprise IT from a database perspective. However, it is in its technical infancy, and Enterprise IT should start planning for a day when databases can be deployed to the cloud due to availability of provisioned IOPS and high speed network interconnects. However, I would not recommend moving to cloud as the first step for most IT organizations. For 2015, it should be a future research project only, with first production deployments scheduled in 2016-2017.

Internal database Virtual Private Cloud (VPC)

The first step in any cloud migration should be creating an internal database virtual private cloud (VPC). In my research, an internal VPC can be less expensive than a public cloud provider given the fact that Enterprise IT already has sunk investment in database licenses. I believe it will be a few years before public cloud providers can make database deployment less expensive than an on-premise VPC. Please see my upcoming post on what I believe a future cloud migration strategy should look like.

 

Enterprise IT infrastructure migration to public cloud

So, you have read all of the IT trade press discussions about moving to the cloud, and are now considering taking that all important next step of investigating the pros and cons of an Enterprise IT infrastructure migration to the cloud. After all, who wouldn’t want to shave 20-30% off of an existing IT budget AND increase IT responsiveness to the business at the same time? Any CIO/CFO is going to start with the business case of moving to the cloud that makes sense for their business.

Enterprise IT organizations approach public cloud adoption from different points in IT maturity. IT organizations that have not done ANY virtualization, consolidation, or application rationalization have much further to go to prepare to jump to public cloud than those who have done these activities. Therefore, the business case for any individual IT organization will be largely depending on their IT maturity.

So how do we begin answering the question: “Does is make business sense to move my server infrastructure/applications to the cloud?” I find it helps to start with the following framework:

  1. Budgeting - How do I budget for projects, and how will that differ with a public cloud approach?
  2. Comparing on premise CAPEX vs cloud OPEX costs – Cloud is based on an OPEX expense model, while most Enterprise IT shops today function on a CAPEX model. How do I compare apples to apples?
  3. Switching costs – How do I determine the amount of work/expense I need to spend to get to a cloud model?

Budgeting

Almost all IT shops I have worked with budget based on projects they choose to take on every year. If the business (FLOSHIM – Finance, Legal, Operations, Sales, HR, IT, and Marketing) requires a new IT system be deployed, a project plan and budget is determined for that project. Hardware, software, and people/services costs are assessed, the project is kicked off, and ultimately completed per the plan. A cloud migration creates a real problem for most IT shops as the original IT project never contemplated a large scale migration to the cloud, and the costs associated with that move. Therefore, a new non-business aligned project must be spun up and and business justified.

A new cloud migration project looks similar to what an outsourcing contract might look like. But what makes it different is the variable cost components of cloud. Most IT outsourcing contracts have a fixed spend agreement for managing existing infrastructure components, with a variable costs for adding new capacity based on business growth. Although cloud is similar to this, there are some critically important differences.

Business impact of elasticity

Cloud services are priced per hour (in most cases). The short time window means cloud services are “elastic” (services spun up/spun down as needed). There are enormous benefits to elasticity from a business perspective:

  1. Time to solution – a cloud infrastructure can be spun up in hours vs an on premise solution being spun up in weeks or months.
  2. No stranded equipment – if a project is canceled unexpectedly, or if another the solution is no longer needed, the service is simply turned off and the billing stops.
  3. New business problems can be solved – It is not uncommon for businesses now to come into the cloud, spin up thousands of VM’s in the morning, and turn them off in the afternoon. This creates new business opportunities that would never be economically feasible if done on premise.

These 3 reasons alone are often enough to spur an IT organization to consider a broad scale migration to the public cloud.

Challenges to budgeting with cloud

As great as elasticity is, it does have a downside. From a budgeting perspective, cloud is hard. Cloud is almost infinitely variable. Servers can be turned in in the morning, shut off at night, increased during a major product release, or turned off when a business event forces an IT change. Now imagine trying to predict what spend might look like in any given year. As an organization, predicting cloud spend can feel like an exercise in futility and risk.

In addition, companies like Microsoft offer prepaid service spending plans (Azure Monetary Commit) that allow you to buy a large chunk of service credits that are lost if not consumed before a specific time period (usually 1 year). If a project comes along that consumes more of the service credits than were purchased, how is overspend allocated back to the business? How does the IT finance organization ensure that only the right amount of credit is purchased, but ensure it is consumed at the end of the year so no credits are wasted?

Clearly, any cloud migration is going to require some significant changes to how budgeting and financial management of the cloud service is done. The key to this is doing it over time to develop this muscle, and being careful not to overcommit to cloud until these organizations are comfortable monitoring these services at scale. I have heard many IT organizations make a cloud commitment to an application, only to receive a large monthly bill they didn’t expect, and then decide to bring the solution on-premise after the fact because of the fear of runaway unpredictable costs. Make no mistake, the success or failure of a cloud migration will significantly rest on success of the IT finance organization.

CAPEX vs OPEX

As hard as cloud budgeting is for Enterprise IT, the issue of CAPEX vs OPEX is even harder. However, before getting into the complexities of managing this for a typical IT organization, let’s first have a philosophical discussion as to why OPEX ultimately is so much better for an IT organization than CAPEX.

The tyranny of CAPEX

Per the budgeting discussion above, CAPEX is probably one of the most wasteful aspects of how Enterprise IT is run today. I find most IT organizations put MASSIVE effort into defining the best IT solution into place from the beginning…performing extensive costing/ROI analysis. But what most IT shops are almost universally bad at is tracking the financial usage of those assets after they are deployed. Below are a few real life horror stories I have heard over the years related to this topic:

  1. Purchased equipment, but not used – One IT organization I worked with once had a business project that was approved, and as a result, IT went ahead and purchased over 100 servers with over a $1Million CAPEX spend. Unfortunately, just prior to the project going live, the business pulled the plug. It was too late to send the servers back, and all of the servers had to be “repurposed”. This meant accelerating existing projects, hunting for older servers to replace (that still had useful life left), and even dreaming up new projects that would have never been financially viable in the first place. Nobody ever knew exactly how those servers were repurposed, or what the financial benefit to the organization was for utilizing those servers. In fact, years later after a software audit led to an overall asset management audit, dozens of brand new servers were found in closets still in the original boxes…representing hundreds of $thousands in waste.
  2. Decomissioned servers that “came back to life” – Another IT organization I worked with had a situation where a software audit led to over 50 servers being “discovered” that the IT organization had thought were decommissioned. The IT staff left the decommissioned servers in the rack in case they could be used again in the future. A power spike in the datacenter caused these “decommissioned servers” to turn back on again. They were then subsequently caught in a software audit scan. Even putting the cost in unlicensed server costs aside, these servers had been running over a year…each one chewing up power. Given a server can consume anywhere from $600-800/year of electricity, the estimated cost to the organization of power costs alone were around $30,000. Of course, this ignores the fact that these decommissioned servers had been forgotten, and not sold, repurposed, or in any way utilized.
  3. 1500 “Lost” servers – Another IT organization I worked with underwent a software audit where the 3rd party brought into scan the organization found 1,500 servers that were completely unaccounted for in their asset management systems. Unfortunately, for some inexplicable reason, WMI (Windows Management Instrumentation – a Windows Server management API/utility) was turned off on these servers. As a result, these servers were unmonitored and unmanaged…I suspected a potential security breach. In addition, there was no understanding of whether these servers were being used to their best financial usage.

Of course, these are all horror stories, and I am not saying every IT organization is this poorly managed. But this does illustrate an important concept. Jack Welch once said “you can’t manage what you can’t measure”. Each of these scenarios represented cases where sizable CAPEX financial assets were not managed/measured after they were deployed, and as a result, wasteful mis-use of assets happened.

So how does OPEX/Cloud fix this problem? Very simply…cloud is by it’s very nature is managed AND measured. Every month, the organization gets a bill for what it consumes. When IT AND the business have visibility into what is being spent, action can be taken by the organization to correct misallocation of resources. In other words, IT and the business is held accountable for the usage of IT assets with a switch to cloud. Here is another story of this principal being applied.

An IT organization I worked with decided to embark on a big data initiative. This group decided to go with a Hadoop solution they put up on AWS IaaS. Because of a miscalculation of network ingress/egress fees, the expected bill of $30,000 in the first month was instead $300,000. Of course, the IT organization and business didn’t sign up for this level of spending, and immediately decided to suspend the project, and move the solution in house by procuring their own servers.

You may be asking…”how is this a good story for the cloud?” The fact that they received a larger bill than expected meant the organization immediately took steps to rectify the situation because the OPEX spend was VISIBLE. What is tragic in this story is that by moving the solution on premise, it immediately made financial visibility of the project unmeasured and unknowable. Maybe it was, and maybe it wasn’t. Nobody will ever know if the solution deployed made financial sense AFTER it was deployed.

What is tragic here is that the tyranny of CAPEX is a silent tyranny. It is a story that goes untold, and is often hidden in the bottom line of a company. Let’s assume for a minute that public cloud is more expensive than an on-premise private cloud (which it isn’t…it is indeed less). Even if the cost is more for cloud, I would still insist on pushing for public cloud simply because it makes the finances of the solution VISIBLE to all parties involved.

Budgeting switch to OPEX

Today, almost all IT organizations budget according to CAPEX. Very little of the IT budget is OPEX. Some industries like Utilities and Telecommunications LOVE CAPEX. They are very comfortable with the concept of CAPEX, and are masters at managing depreciation cycles. In the case of public utilities, CAPEX investments are simply passed along to rate payers. However, for most organizations, OPEX is preferred…making cloud a slam dunk if this is the case. However, transitioning to a cloud model does require a fundamental rethinking of budgeting for IT, and sponsorship from senior leaders in the finance organization.

CAPEX and OPEX comparisons

So now that the business case has been made for OPEX, how do we determine which is cheaper…cloud or on premise? Traditional IT is CAPEX, and cloud is OPEX. In most cases, determining the OPEX cost of the target environment in a migration is pretty easy to calculate. If I have 50 servers and/or VM’s to migrate, I know I need 50 VM’s of a certain capability spun up in the cloud. I simply multiply make adjustments for the hourly rate multiplied out to a year, and I now have my annual OPEX budget for the cloud environment. Prices are public, and every cloud provider has easy to use calculators to come up with these costs.

However, it is NOT so simple to calculate existing internal IT costs. In fact, I have found most IT organizations cannot tell you what it costs them to operate a server in any given year.   This is because of the following challenges:

  1. Amortization schedules – Most IT equipment is typically depreciated over a 3-5 year cycle. However, most IT organizations will still continue to operate fully depreciated equipment. In some cases, I have seen servers that are well over 10 years old still in production. How do you calculate a cost for 10 year old hardware that was fully depreciated after 3 years? Cloud OPEX continues to bill year after year without ever getting a “free period” after the assets are fully depreciated.
  2. Licensing – This is an area where most IT organizations are not as efficient as they could be. Most server software licensing is done by core (a function of CPU processing). This is where the lack of visibility into CAPEX asset utilization really fails. Newer hardware has significant faster and more efficient cores. In the case of databases, performance is actually much more dependent on IOPS than CPU’s (SSD’s are your friend). Therefore, licensing costs (often the most expensive part of a solution) can be dramatically reduced by simply migrating older hardware to newer hardware. Most Cloud providers include licensing as part of their service offering. How do I use my old licenses in a cloud model? Will the software vendor give me credit for my past sunk license investment? It take a bit of work to figure out these assumptions.
  3. Storage – The most staggering change of comparing costs of cloud to on premise is in the area of storage. On premise storage is frighteningly expensive. Cloud storage more and more is being given away for free as part of a larger service. But as many storage experts will tell you, 1 GB of storage in one situation can be a dramatically different cost in another. For example, a 1 TB hard drive at Costco is around $200. 1TB of Enterprise fully redundant storage drive can be in the $1000’s.   Cloud storage can be severely limited by IOPS, or egress fees can also change the dynamic as network in an IT shop is usually treated as free. Again, careful assumptions must be made to try and get to apples and apples comparisons.

Based on the models I have come up with, typically IaaS servers are the same or slightly less than on-premise for typical application VM’s. Storage in the cloud is almost always significantly cheaper than on-premise. However, if licensing is handled correctly, and if the right database architecture is followed, hosting databases is typically MUCH more expensive to host in the cloud than on premise.

Switching costs

Now we get to the fun part of this exercise. You have now determined that the cost of public cloud is lower than on premise, understand the value of elasticity, and the value of OPEX vs CAPEX. So what is this switch going to cost my organization? There are many variables that will affect this:

  1. Timeline to switch – Moving Enterprise IT assets to the cloud is not a trivial undertaking. These projects will be measured in years…not months. Therefore, determining a correct cadence for your organization is critical. Obviously the faster the migration, the more cost up front.
  2. Your IT maturity – If you are already have a solid Enterprise Architecture team, if you have already completed a great deal of virtualization, and have a mature relationship with the business…moving to the cloud will be much easier.
  3. Business imperative to migrate – if you have a business problem you are trying to solve that the migration project can “piggyback” on top of, this will make things much easier. Often this can be an M&A event, a large strategic project, or a mandate to modernize.

Labor involved

Of course, the faster the migration, the more likely it will be that you will need to bring in 3rd party assistance. Most IT organizations have designed their staffing requirements around “keeping the lights on”, and are not staffed for large scale migrations. In addition, migrating to the cloud often represents a skillset that existing IT doesn’t have today. Therefore, bringing in a 3rd party who has done large scale outsourcing in the past will be the logical choice to help in this endeavor.

As for roles needed for various phases of the project, you will need:

  1. Inventory analysis – This analysis will go far beyond the kinds of information currently stored in asset management systems. Cloud migrations require not only an exhaustive analysis of servers (server configurations, performance metrics, etc), but also applications deployed and their dependencies. There are many 3rd parties with specialized tools that are designed to tease out these important details with the goal of determining “low hanging fruit” for migration.
  2. Business and technical Analyst – Once the Inventory analysis is complete, there is now a need for a business and technical analyst to do the following:
    1. Define the business case for migration and present to management
    2. Define the target environment
    3. Map it to the existing environment to the target environment
    4. Prioritize the infrastructure that should move first.
  3. Migration technical resources – these are the individuals who will actually move the infrastructure over. These include:
    1. Network engineer – Before anything can be moved, the networking must be designed and configured per the technical plan.
    2. Virtualization engineer – This role handles any P to V (Physical to Virtual) work needed to be done to move the VM to the cloud. Often this is a very specialized skill that may need to be externally sourced.
    3. Project management – I can’t state how important this role is. Migrating production systems require extensive collaboration and scheduling with the business.
    4. DBA – having an excellent database administrator involved full time in the migration is critical for success. Whether the database resides in a Virtual Private Cloud hosted on premise (but connected to the public cloud over a WAN infrastructure), or whether the database is actually moved to the cloud will be a very sensitive move. This must be a senior level DBA resource that many companies do not currently employ today.
  4. Cloud Operations Team – these individuals should be staffed for ongoing deployment support. As we have discussed earlier, the skills required to manage a public cloud environment is very different than managing an on premise infrastructure. The cloud operations team must be very well versed in virtualization, and understand the limitations of the cloud services you are consuming. These can be the same individuals who currently support the on-premise environment, but they will require training and perhaps initial oversight by a 3rd party until the transition is complete.

Timelines

An Enterprise IT operation can be incredibly complex, and of course is mission critical to any business. Because of these complexities and the risks involved, many IT organizations are rightfully concerned about considering a move to the cloud. However, seldom has there been an opportunity to shave potentially 10’s of $millions off of an IT budget, and create an ability to transform an entire organization like a migration to the cloud. It is worth the risk. However, how can these risks be mitigated? Quite simply…time.

As with any large scale effort, a cloud migration can and should take years with careful planning. If there is anything agile development has taught us in the software development world…it is the importance of shorter timelines and well defined smaller deliverables. A cloud migration should be handled in a similar way. Cloud migration projects should be very small initially to prove out the business case for migration, AND allow for IT organizational muscle to be built up and tested over time.

CAPEX and timelines

As discussed prior, CAPEX can make a business case a real challenge in comparison to OPEX. However, CAPEX can play a huge role in defining a timeline for migration. Most server assets are depreciated over 3-5 years. In almost every single case, a move to cloud cannot be cost justified on assets just starting out in their depreciation cycle. A brand new server that has just been purchased needs to be “used up” first. The best candidates for migration are almost always servers at the end of their useful life. In addition, in almost every case a new project should go to the cloud by default as the cloud will be less expensive over the term of the project.

If an Enterprise IT organization takes this approach, a cloud migration can be accomplished entirely within a 3-5 year period if new projects are put in the cloud, and old depreciated servers are migrated to the cloud. This will cause the least organizational disruption, and allow for IT to develop “organizational muscle” around cloud.

Summary

In summary, there is a HUGE value of any Enterprise IT organization to move their infrastructure to the public cloud because of cost savings, elasticity, and OPEX vs CAPEX benefits. However, proving this can be difficult and time consuming because of a lack of detailed inventory systems, CAPEX vs OPEX comparisons, and the challenge of budgeting for something as inherently unpredictable as elastic cloud consumption. But, it can and should be undertaken given the enormous benefits to any Enterprise IT organization. Additionally, a cloud project should span multiple years, and focus first on new projects and retiring old servers instead of a “migrate the whole enchilada” approach.