Step 1. Measure, measure, measure: response time, page load time, query time
Step 2. Identify bottlenecks: Why isn't your page load snappy? Too much JS? Round-trip to DB is killing you? Your mongrels are overloaded?
Tip: Use YSlow for frontend measurements and take most of its advice
Step 3. Fix bottlenecks: Lots of techniques to fix faster frontend performance, start with YSlow above. Architectural and backend issues can be solved with a combination of hardware (load balancers, more web boxes, more db boxes) and software (better caching strategies, evolve your DB architecture, break up your application into smaller micro-apps that scale independently)
Step 4: Go to step 1 and keep at it.
I'd have to say that if you're not doing Step 1, you're going to fail now, or eventually. As one of my first mentors taught me:
After the easy stuff: move your database to its own server, add a load balancer in front of additional web servers, implement memcached, start splicing your database to multiple servers, continue to add web servers, serve your static files from a cdn
I'd suggest implementing memcached before adding a second DB server, often this can help in scaling incredible amounts. We had an application which as soon as we implemented memcached(we really should have off the bat) the load was cut over 1/2. Of course this was an extreme example since the server was getting pounded with tons of reads from our DB every second.
Before adding the load balancer make sure you stress test your app to know where the bottleneck is (CPU, memory, disk, network etc). See if you can't double your throughput by some cheap upgrades e.g. if you are CPU bound on a single core, you'll buy yourself some breathing space by upgrading to dual or quad core processors. If you are CPU bound but have loads of free memory consider caching expensive queries in RAM for a few seconds.
I only point this out, as load balancers and additional servers can be expensive.
Split static content serving from dynamic. Can start off on same server but requirements for each are so different it pays to have tuned processes serving each.
Build customized, svelt apache or use nginx.
Get separate DB server, get lots mem for DB server, tune DB to use that memory. Few/None are setup up out of box to use 4-8GB+ well.
Get more mem for http box(es), use memcache/similar to take advantage of that memory.
I usually find my apps are CPU bound rather than memory intensive, so it pays to test this out. Easy way to find out is to run apache bench locally and watch the server performance (Use perfmon on windows, not sure what the linux/unix alternative is). Its a crude performance metric but will point you in the right direction as to what you want to upgrade first, you can validate your findings with a more complex tool like Webload/JMeter/Wcat if you want to make sure.
Do the easy stuff first:
1. Enable caching & get the HTTP-headers with e-tags and stuff.
2. Get DB schema with proper indexes & tune database settings.
3. Buy better hardware where there are bottlenecks
4. Fix the software ;)
A point on 4: Find or, in very extreme circumstances, write your own very lightweight profiler you can run on production servers. This will be a life-saver.
The bottlenecks are very rarely where you think they are. Even when they are, there is typically lots of low hanging fruit hanging around that you don't know about until you profile.
Also, you should have some sort of monitoring software running aggregating stuff like server loads, connection stats, number of db requests, memcached hits/misses, etc. Cacti I find quite good for this.
The idea that everyone needs to scale horizontally has become very much less true than it was. Only Very Large sites have to go very horizontal at all these days.
Given 8 x 3Ghz core / 32GB / 5TB machines for cheap you're packing as much power as 15+ servers of a few years ago. Even quite large web sites can be run off a few beefy (but cheap!) servers. This means money saved on network infrastructure, power, space, management, etc.
Probably 80% of the Alexa top 1000 could each be run off less than 20 modern beefy machines.
Not all caches have the cache invalidation problem that Karlton is referring to (memcache doesn't). He was talking about invalidation of multiple copies of an object in a cache.
Measure and optimize. I write pretty careless code, as far as optimization goes, so first round is actually adding indexes to the database. I know what you want to say, I should do this from the start, right? But then I'd end up with a bunch of useless indexes who eat up update times and space.
Second step is usually cache - local, in-application cache. It's pretty much auto-pilot now, writing a cache function.
Cache helps, but not as much as you'd think. When it gets serious though it's still useful because its shape makes it easy to go to step three: mix the cache with a bit of refactoring. This can mean simply pre-fetching: there is a huge difference from doing 20 "select * from order_items where id=xxx" and just one "select * from order_items where order=yyy". And with the cache in place it usually means only a few extra lines of code. (This example is for building reports, not for displaying orders. I'm not that of an idiot to get each order_item separately).
If things get heavy refactoring can go as far as breaking the abstractions in place. Last month I squeezed an extra second by sending to the browser a big table in batches, while I was computing it. Not something I want to do every day, but occasionally it's fun.
This is also the step when I usually realize I made a stupid mistake along the way which caused the delay.
The only problem I met which was mostly impervious to this was a design decision. In one app I decided (as I usually do) to compute stuff like balances and debts on-the-fly. It usually (well, always) is a good idea, because any intermediary result is a source for bugs. But this particular app grew rather unexpectedly large, with the result that computing an overall balance takes 3-10 seconds. It's not a huge problem for the client, but my pride suffers. I'm still looking into this one - I think some refactoring may make it ok.
Adding lots of application servers isn't the hard part, generally (from what I've read... I have to admit I've never had to do more than the 'easy stuff'). The database is. How's Mnesia scale up? Where does it work well, and where does it not work so well?
And how will CouchDB solve your scalability problems? I am really asking, because there website contains nothing exact about scalability problems.
I also tried to read about in the first chapter of the upcoming CouchDB book, but it only contains vague generalizations like "Erlang makes you scalable automatically" or "it is built on the same technology (HTTP) as the web which is proved to be scalable".
What are the effective strategies to use when the bottleneck is disk I/O? We have memcached to cache data that is frequently accessed. However our usage pattern is such that repeated access of data in a short span of time is rare.
At the very least, make sure each system bus (i.e. PCIe/PCI-X for commodity x86) slot is populated with a host adapter (ideally a hardware RAID controller) which can saturate that slot.
For each host adapter, ensure that each port has enough disks behind it that, even with worst-case contention (not fully-random, unless you've somehow managed to magically induce 512-byte transfers), they can saturate the the upstream system bus, even after RAID processing is done by the adapter.
After that, upgrade to a server with more slots and repeat the saturation.
Repeat once more for the largest single server available, while starting the project to implement (perhaps write from scratch) a distributed database.
I have yet to see anyone make even a concerted effort on the first step, instead skipping to the second part of the last step, probably because that appears, at first glance, to require only a Small Matter Of Programming.
Generally we stick with caching, when that starts to give diminishing returns we add a web server. Database stuff is quite a bit harder, of course, we're been fortunate there so far :)
horizontally (keep adding cache and load balancers/proxies).
all you have to do is keep making more profits and keep buying more hardware into infinity.
sure, one day you'll have to build a nuclear reactor and tap a dam to provide all the power and cooling for the datacenter space, but it's easier than a redesign.
Step 1. Measure, measure, measure: response time, page load time, query time
Step 2. Identify bottlenecks: Why isn't your page load snappy? Too much JS? Round-trip to DB is killing you? Your mongrels are overloaded?
Tip: Use YSlow for frontend measurements and take most of its advice
Step 3. Fix bottlenecks: Lots of techniques to fix faster frontend performance, start with YSlow above. Architectural and backend issues can be solved with a combination of hardware (load balancers, more web boxes, more db boxes) and software (better caching strategies, evolve your DB architecture, break up your application into smaller micro-apps that scale independently)
Step 4: Go to step 1 and keep at it.
I'd have to say that if you're not doing Step 1, you're going to fail now, or eventually. As one of my first mentors taught me:
"You cannot improve, what you don't measure"