I think Stack Exchange has a secret weapon which will likely greatly improve their backend systems. Tom Limoncelli, who used to work at Google as a Site reliability engineer (SRE), now works at Stack Exchange [1]. He pretty much wrote the bible for sysadmins entitle "The Practice of System and Network Administration" [2]. I wouldn't be surprised if we start seeing more posts like this!
This oddly sounds like a death knell for Windows. I am not seeing anything here that is not in the LAMP / OSS stack as standard (graphite, nagios / Munin)
If we rephrase the blog post as "we could not find any good tools in the Windows devops space so we wrote them" and add it to the departure of the only CEO willing to dance on stage chanting developers developers developers and Windows is not an ecosystem but a hub with a few brave outlying satellites.
I am impressed by the stacke change folks and their story and skills but it feels like amazing stories of software skill written for one company and never released in the open - it just leaves no legacy
That's actually a very good point and a major downside to the open source tools. My whole work experience (and that of many sysadmins) is in primarily Windows shops. Since many of the OSS tools don't work or require hacks to make them work on Windows.
As far as what we do at Stack, a lot of our code gets released. Not the core q&a engine, but our logging framework (StackExchange.Exceptional), a profiler (MiniProfiler), and some other stuff. On the sysadmin side, we are active in developing Desired State Configuration modules and contribute those to the PowerShell.org DSC repo on GitHub. Part of the job description for our SRE developer includes the fact that much of what we develop in house will be targeted to be open sourced.
The miniprofiler was such an inspiration that i built the same thing for the PHP / ExtJS single page app that i work on. Now i can call up a floating window at any time listing all the recent network requests in the session, formatted much more appropriately than the browser's network view, and the database queries run for those requests, with timing and memory size info. Incredibly convenient while developing or debugging, and i doubt it would have happened if i hadn't seen the miniprofiler first.
You should consider porting that to the official miniprofiler standard (https://github.com/miniprofiler/ui). We want ports in other languages, and php would be welcome. You just have to make the backend, frontend is done and documented above.
In my experience, Windows tooling for serious admin work is either terrible or massively expensive/enterprise.
For instance, a key problem I had in a prior gig was that I needed to automatically log into a Windows machine, run a job, and then log out. Pretty bog standard; didn't need careful error recovery or anything particularly sophisticated.
In Linux, you configure your SSH keys, then ssh automatedjob@server -c "./run-my-thing", and that's that. I literally could not find any identical analog in Windows besides telnet (if anyone knows of a solution here that approximates the Linux one for simplicity, I'd love to know about it). Today I'd probably just requisition a copy of a Windows SSH server and be done with the sorry mess. Better yet, throw Windows out and go full Linux. ;-)
Windows has advanced in a lot of ways over the past several years. We hired Steven Murawski a little while back (Powershell MVP) and he has been able to automate just as much as would expect in the Unix world.
He is also priming out infrastructure for Desired State Configuration (configuration management (like Puppet/Chef) for Windows).
Not at all. The Windows OS now has command line accessible management points that are similar to Linux. There is still a great deal of difference in their management models (I blogged about this a while back - http://blog.serverfault.com/2013/06/03/cross-platform-config...)
That I was hired as a Windows specialist was so that we can go deeper on the OS side and the PowerShell side, just as we have a Linux expert to go deeper on the Linux side. Our sysadmin team was just more tilted with experience on the Linux side (though you wouldn't know it - as almost everyone I work with would qualify as a senior admin in any Windows shop in the world).
I am glad to see that there are comparable capabilities in the modern Windows world and will dig into the WMI side of things next time Windows admin tasks come up.
That's a good question. The best answer is that the configuration needs to be a version controllable artifact just as much as your code is. Further, Windows Scheduler is known to drop jobs from time to time[1].
What was happening is that there was a job controller running on Linux that needed to contact the Windows machine and boot the job into action. One solution would have been to install a Jenkins slave instance, which turned out to be the solution eventually adopted.
[1] It's been a few years since I studied this, but it was a known issue in '08/09 or so.
Ah, PsExec. Yes, I reviewed that. However, the issue with PsExec has to do with credentials. If you don't bother with the user/password, it impersonates your account, which has certain restrictions[1]. That was not feasible due to most interesting things living on network drives. If you do bother with the username/password, you have to store those somewhere (remember, automated system). So now instead of running psexec directly, you have to have a tool that grabs the password file, hopefully decrypts it, then runs psexec.
Also, iirc, using the psexec interface from Linux in a Linux->Windows connection does not work (a wrinkle in the original story is that it was a Linux->Windows connection).
What, in practice, is the difference between creating an account and SSH key that gives passwordless sudo (or other authorization) and creating an account and password that gives appropriate authorization?
I can generate a unique pair of ssh keys for each client that has to remote into a system; I have to have a consistent password for the same user across all systems (assuming integrated login).
Supposing that a keyfile is copied, the multiplicity of ssh files limits the damage.
This kind of thing is easily handled with Kerberos. Joining Unix servers to AD doesn't require that you manage your entire Posix user database there. You can just use a Service Principal for this.
I can see how a developer outside of the Microsoft development stack could see it that way, but it is misinformed.
Windows will die, to make way for Azure and Windows Phone.
The Microsoft development ecosystem is strong (C# in top 10 of TIOBE index). It has open source released by Microsoft (ASP.NET MVC), 3rd parties (Mono by Xamarin), and the community (opensourcewindows.org). Microsoft includes open source libraries (jQuery) in their projects.
StackExchange's software will leave a legacy.
From the article: "In order to create more, and open source it we need help. So we are looking a full time developer with ops experience to join our SRE team."
We (can I say we? I don't work on Windows stack stuff at here but whatever) are actually one of the biggest open source contributors in the .NET world, granted that's just my opinion and I don't have data behind it, but here's a list of things we've open sourced: http://blog.stackoverflow.com/2012/02/stack-exchange-open-so...
MiniProfiler is my favorite one, and now exists in .NET, Ruby and Node.js due to my insanely smart co-workers.
Please don't take it as a criticism of your work - but the fact that one small company (<100?) that does OSS as a side effect (plus recruitment retention effort) can be one of the largest OSS contributors in the dotnet ecosystem says its not a vibrant and thriving ecosystem.
SE has a very high reputation as far as I can tell so its not a quality issue - it is in this case a problem of quantity.
As someone commented the idea is to focus on Azure and mobile - but that is no substitute for lots and lots of developers in a culture of releasing stuff outside their immediate company and so driving each other to new heights
The snowball has been growing here for the last few years. The .NET community was historically not very "open" ... but ever since nuget and github, attitudes have definitely been improving.
It's always nice to see new products in the DevOps space, but be careful not to re-invent the wheel if you do this kind of stuff as the open source world is coming on leaps and bounds.
LogStash, ElasticSearch and Kibana are a great open source stack for log management.
StatsD and Graphite are nice tools for metric tracking and visualization.
There are lots of open source dashboard offerings which combined with a bit of scripting can get you far.
You are also spoiled for choice with SAAS monitoring stuff such as NewRelic and Server Density, even if the OP isn't a fan of cloud based tools.
I did an experiment with logstash, elasticsearch, and Kibana for our HAProxy logs. The default with logstash was to store each field, and then the whole text so stuff go quite large for our web logs. Also the Kibana interface was pretty buggy. Parsing our logs (~2k entries) a second doesn't work well with Regex so in our version we do a bunch of substring stuff. I'm excited about Kibana/ES for the rest of our logs over with their recent hire.
When I looked at StatsD and Graphite last time I didn't really see an API. I really like the model of data being queryable and returning nice serialized format like json (like OpenTSDB does). I'm also not that fond of the "many files" model and the automatic data summarization as it ages (it does save space, but makes forecasting difficult as it can skew data).
We're parsing 2k lines/sec with logstash using regexes. We scaled it out to 6 logstash processes across 2 nodes. They pull off a shared redis queue and then insert the results directly into elasticsearch. (That said, I'd like to configure our apache logs to just output the json logstash expects.)
Graphite has a very simple and powerful JSON API [1]. Any graph URL can include &format=json and you'll get back the raw JSON values for the datapoints. I haven't used OpenTSDB yet - I'm curious how its API is better.
You get to choose the levels of summarization. If you want to keep 1 second intervals for a year, you're welcome to do so.
jmelloy (https://news.ycombinator.com/item?id=6334778) had a really good point that wasn't covered in the post, since it was a post about what we are doing not necessarily why we are doing it.
One of our major problems with existing monitoring and management systems is the lack of good APIs. We are a shop of developers and sysadmins who all understand the real management systems need to be composable. The system needs to understand that it won't solve every case out of the box and expose hooks into management and functionality, allowing us to tie disparate systems together and enhance their coverage. I'd rather take a bunch of existing products and put some cool dashboards on top, but most enterprise solutions (and some of the open source ones) don't offer a decent API to work with.
It's quite likely that I'm confusing "decent API" and "ease of extending and integrating" and all that gets wrapped up with my long term familiarity with Nagios.
Nagios check plugins don't have an API per se, but they have a very simple standard for exit codes to be interpreted by the Orchestration component.
Nagios reporting/alerting plugins primarily use RRD, so you always have that API to do interesting things like trend analysis.
(One theory I have is that there is a such a large and growing culture of technologists that are thinking "integration tool" but only know how to say "HTTP API")
Very few companies want to pay for that, fewer still when in-house and third party devs are often begging to do that work if they'd just let them license the code as OSS.
I wasn't aware of extrahop so I will have to look into that. We currently use Solarwinds Orion which is where status gets some of its data.
We have definitely outgrown Orion, and a lot of stuff in Orion is very rough, sloppy, and not well integrated.
We don't really like the idea of cloud hosted monitoring (which is a lot of what more modern monitoring systems are). Alternatives also seem very expensive.
So if we are going to make a investment (cash or labor) I would rather we get a system that fulfills all of our needs (fit) and share it with everyone.
Had you tried Nagios and Puppet/Chef when you started developing these?
The logging tool is the only one I'm not sure about, if you're parsing 1200 events a second. I'm not familiar enough with Orion to understand why it was insufficient.
Would you say these in-house tools are primarily about integrating / presenting the data, or are there custom parts doing more heavy lifting too?
We tried Nagios a while back, before going to Orion. Nagios has some great structural ideas, but fell down in other areas (don't know all of them, that was before I came on board).
We use Puppet to manage our Linux infrastructure and are testing Desired State Configuration for our Windows systems. Puppet and DSC solve a different problem than monitoring.
Our tools span the gamut of aggregating and presenting data, to doing "heavy lifting" of things like installing patches, managing our load balancers, and removing bad query plans from our database servers.
Our in-house tooling fills in gaps where existing tooling wasn't responsive enough or made it difficult to deal with certain edge cases. General purpose tools like Orion satisfy 80% of our use cases, but we are fanatical about performance and functionality so we want to fill in that additional 20%. In fact, the majority of our Orion monitors are custom script monitors, which we have to create ourselves as it is.
If what we do can work for others (I've worked in several environments with Orion and had the same issues in each environment), that's a bonus.
We've found we are building a significant number of these projects and that's why we are looking for a dedicated developer for our team.
I'll have to look at the overlap between Puppet's Windows tools and DSC.
These are definitely distinct from monitoring, but monitoring configuration should be populated from the same configuration store as Puppet. (Sometimes Puppet is the configuration store.)
I've never evaluated Orion in particular, but it's slightly puzzling if you're creating a lot of custom monitoring scripts. In the low-touch Nagios deployments I've seen, this is often because someone didn't understand good places to use macros, and centralize more of the parameters.
Configuration systems have historically looked at config as bits-on-a-disk.
Supervisor systems look at bits-in-memory.
And they emerged and evolved independently. So there's pain points and impedance mismatches regardless of which one you start with.
What's needed is a system that sees that configuration and supervision are the same problem: you have a directed graph of what a system can look like, plus a compare-and-repair mechanism to drag the system to that state frequently.
I've personally looked at Chef, Puppet and Cfengine, none of which really does all of them the way I would like.
We use Puppet a lot and are generally quite happy. We have recently started leveraging Hiera which is addressing our main issue we had with Puppet: Reusability problems in the code since Modules (code) and our data were not easily separated.
A lot of these tools are about integrating / presenting but that isn't entirely the case. Status does its own polling of things like Redis and SQL. Realog parses the data, structures it in Redis etc, and the patching dashboard can kick of updates.
Status handles the scheduling of the polling, or the actual polling, or both?
One thing Nagios handles well, although it isn't exactly well polished, is distribution, scheduling, and aggregation of the polling. Also, anecdotally, "nobody" seems to know how easy it is to do simple-intermediate monitoring of MS-SQL in Nagios. I happen to be using this, at the moment:
I'll bet one of the reasons is the amount of data. At a previous employer I built a system similar to this which produced nightly, weekly, monthly, and annual reports of weblog analysis for an app that had 6 million+ http requests per day. It was tough enough to consolidate those logs across the load-balanced servers in a single data-center. We never consolidated across data-centers (separate reports for each instead) and I doubt we could have shipped all of that data to a cloud service in a reasonable amount of time.
BTW, we made heavy use of setting and logging http headers too. One trick I liked was capturing performance timing metrics as a request was processed and stuffing it into a response header as the response went out. We then logged the response headers, which gave us the ability to report on the performance metrics. We also had a debug mode in the app on the browser side so we could see the performance metrics from the headers there too.
~70 events per second doesn't sound like much to capture and aggregate. How much of this parsing did you need to perform in real time? Creating a unique token to pair requests/responses shouldn't add much overhead at all.
Development started around 2000, and stabilized around 2008. As far as I know the reporting scripts are still being run every day. During this period we had purchased a 1TB storage rack from EMC for a million dollars, to give you some perspective on the differences between then and now.
- No real-time parsing; it's all nightly batch processing after devops rotates the Apache server logs to a storage volume. The logs sit there for a while then get compressed and moved to offline tape archives.
- No DB storage of the logs; space was too expensive and the Oracle database we had couldn't have kept up. It was already heavily burdened with a completely separate usage statistics system that fed into user-facing reporting and billing, which had a much higher event rate, about 100x higher, than the http logs.
- We had unique tokens, but they identified a particular user session that tied together all of the user's http requests from login to logoff/abandonment, and which also tied into the Oracle-based statistics for that user, that user's organization, and the customer responsible for the user (often multi-organization). My reports had breakdowns for individual user experiences, session-level metrics, and user type/organization/customer/region/etc metrics.
- I don't recall how long the analysis took; it was between half an hour to two hours I think. A lot of that time was spent on disk I/O reading the logs. I had optimized the parsing, analysis, and results recording about as much as I could.
- This stuff was written in Perl, and ran on Solaris servers from that time era... probably not a lot more powerful than a handful of smartphones today, though they did have lots of cpus. I don't think traffic has grown much since I left the company (we had pretty full market penetration already) so it's likely those servers haven't been upgraded.
I suspected it would have to be a system of that era.
I think I have a good idea of how businesses (at a high level) have failed to understand Moore's law from 2000-present. I'm curious what those failures of understanding were like from 1985-2000.
We all know that technology has been advancing rapidly, but these specific anecdotes of organizations paying a million dollars just for the backing storage of a system that you can essentially get for free from Google now...
Yeah, it's pretty amazing how much things have changed. That raw log data was about 250GB/year which is nothing today but when we started collecting it we were paying $1000/GB.
Actually, they're probably still paying over $100/GB. The whole datacenter was outsourced to Perot Systems in the mid-2000s, and the storage fees were astronomical. We calculated that Perot must pay a separate tech to stare at each individual hard drive with a replacement in-hand in case any errors were reported. At least, they could afford to do that with what we were paying them for storage.
1. The amount of data would be quite large when it comes to the logs
2. System lock-in, I like to have the data and be able to query it as I see fit
3. If an event happens where the facility is cut off from the Internet you won't be able to tell what happened inside the facility (unless maybe there is a store and forward agent, but even then it is only useful after the event.
4. Latency monitoring (again maybe an agent can help) but if I see changes in response time I can't tell if that is the WAN or not.
One point 3, that makes high level (non-agent-based) cloud monitoring a must in addition to your primary monitoring. In that kind of event, you need local and global monitoring.
I like the idea of a totally independent 3rd party monitoring system. For that we use pingdom, and its job is basically to tell us if our public services are up from to the outside world.
Basically it is an additional perspective and a redundant level of monitoring. But I view it mostly as an up/down layer of monitoring. Extremely important, but simple and not really meant to give insight into the complexity of our system.
We're doing an evaluation of a bunch at work, and we usually come down to:
1) most charge per server per month, making costs go up quickly
2) you don't own the data & the API May not exist or be wonky with implementation details
3) most install agents, meaning your production servers have another piece of code on them, and it's hard to test how they behave under load
Where does the patching dashboard pull data from? Is it tracked by hand or is there a scanner? We use Orion at work, and it's got a decent amount of data in it, but is kind of kludgy and slow.
The patching dashboard has scheduled jobs on the clients (PowerShell scripts on the Windows boxes and Ruby scripts on the Linux boxes).
We use Puppet to deploy the client to our Linux boxes and for Windows we deploy the task and scripts with Group Policy (soon to be replaced by Desired State Configuration).
This information isn't something we need to poll often for (and on Windows, there are difficulties interacting with the Windows Update apis remotely). We add and replace servers often, so the client adds themselves to the dashboard, as well as updating their status.
We have integrated some reporting into Orion, but that was a side effect of not having the dashboard before.
We have a forwards only migration runner that takes care of the deployment for us. We don't use many stored procedures, but they're taken care of the same way (in a migration). Our policy is that pushed code be backwards compatible by however many migrations are being deployed in a production build.
[1] http://everythingsysadmin.com/2013/09/the-team-im-on-at-stac...
[2] http://www.amazon.com/dp/0321492668/tomontime-20