one problem we recently faced was we run Xen on remote servers.
We found out that our 3 year old storage hardware had some firmware conflict with Xen (or something like that)
took 2 weeks to diagnose + a plane trip out to fix, even then we weren't sure we got it.
after this, we moved to Rackspace.. there is no way that a sys admin running one rack of equipment is in the same position as Rackspace to diagnose and fix these types of issues.
i imagine if Rackspace/Amazon had this type of problem, they would:
1. Have 24 hour manned data centers
2. Have the input of a team of 20 engineers working on the problem
3. Have a Dell/IBM/HP engineer at the data center within the hour
4. Have lots of spares - no calling in new hardware
Having been a (part time) sys admin for years, the worst problems are those that are hardware related and are difficult to diagnose.
(The other advantages of Cloud, e.g. scalability etc are well documented. But i think my ideal setup would be dedicated, non virtualized databases with Cloud front ends)
I've been on the other end of this spectrum too. Working with a very high profile company (that I unfortunately can't name) we were paying what I'm going to call an astronomical rate for managed cloud operations. Dedicated data center admins, systems administrators, the works.
We would have a problem with something like IO throughput on database instances. We would open a ticket, wait a little while, get a response. If we claimed it was hardware related (because we couldn't tell from our host's perspective) we got the response "it's not the hardware, everything seems fine." This would go on for days. Then eventually, after we had to prove the numbers were erratic or unresponsive we would eventually get a more helpful response. Maybe.
It's a very cold splash of water in the face when you realize that your hosting company, cloud provider, whatever is not in business to hold your hand. If you need more than their minimum level of support or require human interaction you will be sadly disappointed. These companies maintain their margins by automating hardware provisioning, homogenizing infrastructure, and making it as turn-key as possible. Which is all fine and good until you need something their infrastructure doesn't provide for. Like switch bandwidth. Like larger instances. Like all your VMs in a local rack.
You will have hardware problems in the cloud too and they will not be obvious. You will need the same degree of monitoring software you have anywhere else in any other environment.
You will have hardware problems in the cloud too and they will not be obvious
I'm often hearing "in the cloud, one doesn't have to worry about that hardware" (or network). My usual retort is that one certainly does have to worry, since the same problems exist, just that one can't do anything about them when (or before) they occur, unlike with owned hardware.