Hacker News new | past | comments | ask | show | jobs | submit login
How Lanyrd moved from AWS to SoftLayer and MySQL to PostgreSQL with no downtime (lanyrd.com)
154 points by simonw on Nov 13, 2012 | hide | past | favorite | 58 comments



Not exactly no downtime... but all site content stayed available throughout the move thanks to our handy read-only mode. We added this early in the life of the site and it's proved extremely useful for a couple of major moves. It's a pretty simple implementation (based on our site-wide feature flags) - once in read-only mode all login cookies are ignored (so everyone gets the signed-out experience) and the sign-in button is disabled. We also return an error for any POST requests just in case someone has already loaded a page with a form before we turned read-only mode on.

The moment we've gone in to read-only mode we can create a brand new instance of the site from a copy of the (now frozen) database, make any necessary changes to that, then switch traffic over to the new instance once we've tested that everything is working properly. If the new version of the site has a problem we can turn read-only mode off and continue to run on our original database.

If you're running a more content-oriented site it's well worth taking the time to set this kind of thing up - it's not too hard to do, and it gives you an enormous amount of flexibility for maintenance further down the line.


This is excellent advice! I can't tell you the number of times I've worked on projects where I wished just such a "read-only" mode existed. As an added benefit if you have a read-only version of your site, I know at least Akamai (and possibly their competitors) has a service where even if your origin server disappears, they will continue to serve the latest version of your read-only site until the origin servers reappear.


We have varnish at the front of our stack with a one-minute cache timeout for users without cookies - but thinking about it, there's no reason we couldn't bump that timeout up to something much higher for the duration of read-only mode (though since the servers aren't having to deal with writes they don't really need an extra performance boost).


Check out Varnish's configurable grace period.

It will basically instruct it to continue to serve content that it considers "stale" (up to a point) until it is able to update it.


Going one further, if you really do have to maintain doing writes, what you do is you modify application code to do simultaneous writes to both databases instead of one (i.e. "fork" your writes).

This is where you are super glad you did the right thing (did you?) and your DB layer is abstracted and your SQL is standards compliant, and this saved you hours of headaches. Anyway, you maintain reads from the old db.

Then, as the site is running, you migrate all data pre-fork, to the new database. Finally, after validating that you got it right, you flip the switch again to have reads come from the new DB.

But you're not done yet. Keep the reads forking for a while, till you're sure everything went okay. If not, you can flip back to pre-migration instantly with zero data loss.

Presumably, all this flipping between dbs is done through some kind of flag that can be modified at runtime, so you don't have to do more code deployments, because you're dancing with the devil here already.

In any case, with proper capacity planning and good code, this is also doable, but not super required unless writes are mission critical for your customers 24/7.

But the last thing you want to do is to flip the switch, pray, lose customer data.


Doing simultaneous writes to two databases at the application level is very fragile and have many points of failure, all the places involving write in the application code. Also distributed transaction is probably not in place since it's an one-off thing, and there are risks of inconsistent updates to both database where one write goes through and the other one fails. Recovery from that would be nasty since failures can be on either way.

A better approach would architect to write all the updates to a persistent message query. Then a updater can read the update messages from the query and apply them to one or both database. The updates are welled ordered with respect to time and to both databases. The potential failure scope is limited to one place and it's easier to go through the recovery cases.

Persistent queue has other benefits in scalability and fast apparent saving to the users.


Sort of. It's a tradeoff. First off, I was suggesting doing the dual writes in the code that does the write in your DB layer, not by doing fixes all over the application code. That reduces the code changes by a ton.

Second, I guess it's true that there are some consistency issues since you have to commit one transaction only if the other succeeds. But this is way smaller a risk, I think, than yours.

With your method you're writing way more code just for the update and adding a whole new layer (both failure modes in their own right), and then just moving the transaction inconsistency problem to the updater instead (when you choose to write to both databases). The benefit is that you can "replay" the updates from the updater, but the truth is not all changes are idempotent so replaying some queries may fuck things up without storing DB state (e.g. inserts, increments) which is a mess in itself.

So, I guess, choose your poison :)


Great tip about read-only mode, Simon! This should help us plan our migrations with zero downtime.

I was wondering about disabling writes to databases that are critical to application to support read-only mode with application logic to gracefully handle write failures.


I thought about disabling UPDATE/INSERT statements at the database cursor level (we have our own custom Django cursor wrapper class) but we're confident that it's not necessary with our application - we don't have any writes that we can't afford to lose which aren't triggered by signed in users going to our main application database.


Thanks for the write-up - interesting read.

Out of interest, how do you catch and block all POST requests when the site's in read-only mode without duplicating code? Not sure if you use CBVs at Lanyrd - if so, do you use a common mixin? If not, how?


We have some common Django middleware used for every dynamic page on the site which deals with that (among other things). We also strip cookies at the Varnish layer.


I believe you can do this quite easily with Django request processors, though I've never tried it.

This could work: https://gist.github.com/4066816



Great tip. And kudos on architecture done right.


Did you guys consider Heroku PostgreSQL? It's inside of AWS. If so, can you shed some light on why you opted against it? I don't want to sound like a fanboy of Heroku (I am not) but I am just interested in whether or not you guys considered it.


We didn't - it's a fantastic product (I'm really impressed with how easy they've made it to set up followers and forks) but we were ready to move to dedicated hardware. It's also a pretty expensive option.


Thanks for the response.

I used to agree with you on the price until I realized that with Heroku you're essentially paying for a dedicated server with that much RAM for each plan. Reason being, every single database is "hot" and the entire thing lives in memory to avoid I/O.

On second glance ... you're right. It is kind of pricey.

Speaking of moving to dedicated.. I'd love to hear some of your general feedback on how AWS has been at some point. Essentially where you'd use it again or where you'd go directly to dedicated (or a diet option like a VPS) if you were starting on a new project.


I've worked in similar environments and Heroku is crazy expensive and slow compared to dedicated hardware.


sad thing is, most people don't realize how slow they are. once they move off to dedicated w/ SSD, their site starts to run blazing fast (plus at a fraction of the cost) and say to themselves why they haven't done this sooner.


Do SSDs improve performance with PostgreSQL if most of your data fits in ram?

I guess it depends on the type of workload (reads vs writes).


Yes, it improves performance because PostgreSQL wants to write data out to disk before returning.


Indeed. Most companies don't have the operational experience to optimize for speed until they are later stage. Also, they will develop on something like Heroku and get addicted to the ease of use.


If you follow the link to the more detailed blog post, Andrew shows how he was able to speed up the data transformation script by a factor of 60 by using ALTER TABLE statements to cast column types after the fact rather than rewriting every single INSERT. A little more dangerous perhaps, but still an impressive improvement.


While this is technically impressive, if you had gone down in the middle of the night for a few hours would it really have mattered? Apple.com takes down their entire site for several hours at a time every few months, and they don't seem to be hurting too much for it.


For us, it does matter. We have conferences running all over the world which use us for their schedule - for them, two hours of downtime in the middle of their event is a big problem. Since we're a global site we have to think about events across every timezone so there isn't really an obvious time to do this kind of thing.

We might be able to schedule downtime for a period when no active events are going on (tricky considering the number of conferences happening around the world) but it's much easier for us to be able to make these kinds of changes without worrying about events that are using us to serve critical information. As it is, we still make sure to communicate planned read-only mode periods in advance so conference organisers have a chance to plan around them.


Have you considered using PostgreSQL foreign data wrapper (mysql_fdw) to pull data directly out of MySQL instead of the dump/load process?

It would be great to know more about your new setup, i.e. do you use streaming replication and some resource manager like Pacemaker?


No, we didn't look at mysql_fdw. We're using standard PostgreSQL 9 streaming replication.


I've been using Softlayer for years now and have only good things to say about them and their support team.


Would like to know more about the hardware specs, and what Lanyrd is paying over at SoftLayer.


Sorry, no details - but we're using a variety of different server specs. We use their cheaper dedicated machines for our application servers (running Django, memcached and celery workers) and significantly meatier servers for our backends running PostgreSQL/Solr/Redis. We also have a couple of smaller things running on their cloud servers.


Are you using a connection pooler for postgres? Like pgbouncer or something.


Not at the moment.


The hardware Softlayer is offering looks ancient, Xeons from 2006, only a paltry 5GB/mo bandwidth tier as initial offering -- this page looks stuck in time, like what you could get maybe three~four years ago in dedicated hosting.

Even Cari.net, which I think prices a bit higher than others is offering more for the money. I've had several dozen machines with them since 2006 without issue, top notch support. I also use the really cheap and no-frills folks like Ubservers for when I want a bunch of disposable cheap dedicated machines. Their support is absolutely shit, but by god are they cheap and the bandwidth real. I've been through probably dozen and some change of these guys and that's really all that matters is cheap solid bandwidth and a vague sense of support.


Softlayer does have some ancient seeming hardware still available (which stupidly shows at the top of some lists), but has Xeon E5 systems too, which are fine. They actually had E5s available before general availability.

The only criticism I have of Softlayer is their RAM pricing is sometimes extortionate, but is negotiable.


Compare their budget servers against other budget servers and they're going to lose out on CPU, RAM and bandwidth tier for the price you pay. Not everyone who is cheaper is offering inferior service, either. While I am sure I could negotiate a nice deal with Softlayer, I'd rather deal with people who are offering a better value up front. But perhaps I'm biased, as I tend to provision quite a few cheap machines for ESXi farms and I don't need nor want anyone's support beyond prompt hardware replacement and their network performing as expected.


I'm a SoftLayer employee, so my perspective on this might help ... A server at SoftLayer is often more expensive than the same server at another provider, but those identical server specs don't necessarily make it an "apples to apples" comparison. When you don't need any of the free value-add functionality that differentiates a SoftLayer server (which I won't include here for fear of being "salesy"), the numbers probably won't add up. SoftLayer isn't positioned in the market to compete on price, and you're right that there are a lot of quality hosters who offer reliable, powerful servers for lower costs.


Who are the other reliable providers offer value for money in US ?


My startup does: https://uptano.com


What virtualisation tech do you guys use? Can I run FreeBSD on top of it?


OpenVZ (because it's closer to bare metal performance), which limits it to Linux distributions currently. We've had requests for FreeBSD though, and we're adding Xen support soon which should allow that.


FreeBSD on top of Xen has limitations (such as only one virtual CPU (AFAIK)) and in my testing is really unstable. Would you consider using something like KVM instead which has less issues?


It's definitely possible. We're not wed to any particular container/virtualization system. Networking tends to be a bit of a tricky issue, but I'll definitely check out FreeBSD on KVM. Thanks for the heads up.


There's new players all the time and a lot of mom-n-pop type shops out there that actually can reliably deliver 100Mbps and 1Gbps pipes, though some are certainly shady.

I've named the ones I feel comfortable mentioning, but if you really want to know more about what's out there, there's a number of forums like Web Hosting Talk where the actual companies maintain presences to run promotional deals -- and there's a lot of public opinions aired about how said companies are doing. There's going to be some rather uninformed opinions aired, but most people can tell you when there's a real problem.


What is the size of Lanyrd data? I'm curious to see what your think is considered "big data".


He refers to "our tens of gigabytes of data" in the post on his site. [1]

[1] http://www.aeracode.org/2012/11/13/one-change-not-enough/


I remember a quote from some presentation:

"If it fits on an iPod, it's not big data."


We're looking for hosting which has both dedicated and virtual servers on the same network. SoftLayer and Rackspace seem to be the two big players who do this. Does anybody have any experience with them both?


I don’t have experience with Rackspace, but, like the other commenter acangiano said, I only have good things to say about SoftLayer’s support for the past 4 years. Their customer service reps are quite responsive and can be reached either using tickets or phone. My issues were mostly addressed within an hour or two, even when it’s not during the normal office hours.


Rackspace user here. Be prepared to pay $$$$-their hybrid environment is not cheap, especially for servers with above normal specs. There are also (industry standard) restrictions on the network side of things with their VM's, which we weren't prepared for when we started. Lastly, if you're ever the target of DDoS attacks...good luck.


Previous Rackspace user here (we got fed up with Rackspace and finally got our own equipment for a song compared to our monthly hosting costs with them)... Yeah, good luck if you get a very common DDoS attack while on their network. Apparently their data-centers in general are large DDoS targets. We moved and have not had an attack since (and it's a big deal since we do e-commerce).

That said, out of any company I've ever been with their support was really good. Just understand you're paying like crazy for it.


There's also the Storm/Liquid Web combination.

http://www.stormondemand.com/ http://www.liquidweb.com/


My startup is designed around these issues: https://uptano.com

Every server you launch is connected to your own private LAN.


I only have experience with Rackspace (though I've heard consistently good things about SoftLayer), but my experience with them has been fantastic.


I've used both. SoftLayer in my opinion has faster and more price efficient offerings. Always gone with them and never regret it.


> There's nothing wrong with AWS - indeed, we still run a staging environment there - but our database benefits greatly from the low latencies of physical disks...

Doesn't AWS have physical disks for RDS? What am I missing?


AWS RDS uses EBS for storage[1]. Depending on RDS instance type, I think they use multiple raid'ed EBS volumes. EBS relies on physical disks, of course, but they're accessed as SAN devices over the network and that means higher latency than accessing disks that are hardware-attached to the local instance.

EBS has historically also had somewhat variable performance (on top of the extra network latency), but the new provisioned IOPS feature should help with that[2].

[1]: http://aws.amazon.com/message/65648/

[2]: http://aws.amazon.com/about-aws/whats-new/2012/07/31/announc...


No, RDS runs on a virtual RAID array of virtual disks over a network (EBS).


I have used SoftLayer for years and have nothing but great things to say about them. Nothing but speed, stability, and excellent service. Thank you SL!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: