Hacker News new | past | comments | ask | show | jobs | submit login
Homegrown DevOps Tools at Stack Exchange (serverfault.com)
144 points by KyleBrandt on Sept 5, 2013 | hide | past | favorite | 66 comments



I think Stack Exchange has a secret weapon which will likely greatly improve their backend systems. Tom Limoncelli, who used to work at Google as a Site reliability engineer (SRE), now works at Stack Exchange [1]. He pretty much wrote the bible for sysadmins entitle "The Practice of System and Network Administration" [2]. I wouldn't be surprised if we start seeing more posts like this!

[1] http://everythingsysadmin.com/2013/09/the-team-im-on-at-stac...

[2] http://www.amazon.com/dp/0321492668/tomontime-20


Wow that's cool I would love to work with those guys wouldn't be much help though only have basic Linux administration skills.


Keep learning and soon you will be beyond basic.

If you have passion, you will have plenty to offer.


This oddly sounds like a death knell for Windows. I am not seeing anything here that is not in the LAMP / OSS stack as standard (graphite, nagios / Munin)

If we rephrase the blog post as "we could not find any good tools in the Windows devops space so we wrote them" and add it to the departure of the only CEO willing to dance on stage chanting developers developers developers and Windows is not an ecosystem but a hub with a few brave outlying satellites.

I am impressed by the stacke change folks and their story and skills but it feels like amazing stories of software skill written for one company and never released in the open - it just leaves no legacy


That's actually a very good point and a major downside to the open source tools. My whole work experience (and that of many sysadmins) is in primarily Windows shops. Since many of the OSS tools don't work or require hacks to make them work on Windows.

As far as what we do at Stack, a lot of our code gets released. Not the core q&a engine, but our logging framework (StackExchange.Exceptional), a profiler (MiniProfiler), and some other stuff. On the sysadmin side, we are active in developing Desired State Configuration modules and contribute those to the PowerShell.org DSC repo on GitHub. Part of the job description for our SRE developer includes the fact that much of what we develop in house will be targeted to be open sourced.


The miniprofiler was such an inspiration that i built the same thing for the PHP / ExtJS single page app that i work on. Now i can call up a floating window at any time listing all the recent network requests in the session, formatted much more appropriately than the browser's network view, and the database queries run for those requests, with timing and memory size info. Incredibly convenient while developing or debugging, and i doubt it would have happened if i hadn't seen the miniprofiler first.


You should consider porting that to the official miniprofiler standard (https://github.com/miniprofiler/ui). We want ports in other languages, and php would be welcome. You just have to make the backend, frontend is done and documented above.


In my experience, Windows tooling for serious admin work is either terrible or massively expensive/enterprise.

For instance, a key problem I had in a prior gig was that I needed to automatically log into a Windows machine, run a job, and then log out. Pretty bog standard; didn't need careful error recovery or anything particularly sophisticated.

In Linux, you configure your SSH keys, then ssh automatedjob@server -c "./run-my-thing", and that's that. I literally could not find any identical analog in Windows besides telnet (if anyone knows of a solution here that approximates the Linux one for simplicity, I'd love to know about it). Today I'd probably just requisition a copy of a Windows SSH server and be done with the sorry mess. Better yet, throw Windows out and go full Linux. ;-)


Windows has advanced in a lot of ways over the past several years. We hired Steven Murawski a little while back (Powershell MVP) and he has been able to automate just as much as would expect in the Unix world.

He is also priming out infrastructure for Desired State Configuration (configuration management (like Puppet/Chef) for Windows).


What I'm deriving from your statement there is, "So we hired a guru to bring our Windows systems up to par with Linux base point".

I'm glad its working for you, I use stackexchange daily and am generally quite happy with it!


Not at all. The Windows OS now has command line accessible management points that are similar to Linux. There is still a great deal of difference in their management models (I blogged about this a while back - http://blog.serverfault.com/2013/06/03/cross-platform-config...)

That I was hired as a Windows specialist was so that we can go deeper on the OS side and the PowerShell side, just as we have a Linux expert to go deeper on the Linux side. Our sysadmin team was just more tilted with experience on the Linux side (though you wouldn't know it - as almost everyone I work with would qualify as a senior admin in any Windows shop in the world).


Steven,

Thank you for your response.

I am glad to see that there are comparable capabilities in the modern Windows world and will dig into the WMI side of things next time Windows admin tasks come up.


Why did you need to log in to fire it off instead of having a scheduled job?

Now, I was pretty sure you could use WMI to do this, and looking around I found this -

http://4sysops.com/archives/three-ways-to-run-remote-windows...

and this

http://blog.commandlinekungfu.com/2009/05/episode-31-remote-...

Remote PowerShell is probably 'the way' Microsoft will be pushing now

http://msdn.microsoft.com/en-us/library/windows/desktop/ee70...


That's a good question. The best answer is that the configuration needs to be a version controllable artifact just as much as your code is. Further, Windows Scheduler is known to drop jobs from time to time[1].

What was happening is that there was a job controller running on Linux that needed to contact the Windows machine and boot the job into action. One solution would have been to install a Jenkins slave instance, which turned out to be the solution eventually adopted.

[1] It's been a few years since I studied this, but it was a known issue in '08/09 or so.


A decent approach to doing this in the Windows world would be to use PsExec.exe (http://technet.microsoft.com/en-ca/sysinternals/bb897553.asp...).


Ah, PsExec. Yes, I reviewed that. However, the issue with PsExec has to do with credentials. If you don't bother with the user/password, it impersonates your account, which has certain restrictions[1]. That was not feasible due to most interesting things living on network drives. If you do bother with the username/password, you have to store those somewhere (remember, automated system). So now instead of running psexec directly, you have to have a tool that grabs the password file, hopefully decrypts it, then runs psexec.

Also, iirc, using the psexec interface from Linux in a Linux->Windows connection does not work (a wrinkle in the original story is that it was a Linux->Windows connection).

[1] http://windowsitpro.com/systems-management/psexec


What, in practice, is the difference between creating an account and SSH key that gives passwordless sudo (or other authorization) and creating an account and password that gives appropriate authorization?


I can generate a unique pair of ssh keys for each client that has to remote into a system; I have to have a consistent password for the same user across all systems (assuming integrated login).

Supposing that a keyfile is copied, the multiplicity of ssh files limits the damage.


You have to possess the SSH key, but any keylogger could record the password? Passwords can be guessed?


If you have the ability to install a keylogger that can hijack a user session, what prevents you from capturing the SSH key of the user?


You can use remote powershell scripts to do the same. Configuring permissions is different, but is doable.


Psh I think is a viable solution (there is a pay for / shareware package of running a shell command on a remote machine)

I seem to remember doing clever things with WSI for just such a thing curiously I am pretty sure there will be examples of both on StackOverflow ;-)

Edit : oh yes WMI and psexec - amazing how quickly things drop out of your brain.


This kind of thing is easily handled with Kerberos. Joining Unix servers to AD doesn't require that you manage your entire Posix user database there. You can just use a Service Principal for this.


It's definitely more work, but a lot of provisioning can be currently automated using Chef. It plays surprisingly well with Windows.

We've developed approach using Chef that spans Linux and Windows boxes.

To log in and run some script on Windows boxes, we use WinRM, which is integrated with Chef knife tool and works almost as well as SSH.


cygwin sshd allows you to do this as well as other unix-y things like rsync, scp, etc.


Windows supports VPN natively.

* Set the firewall to only allow telnet access on a private subnet

* Set the vpn connection to use that subnet

* login from any computer that supports telnet and PPTP.


Google PsExec.


I can see how a developer outside of the Microsoft development stack could see it that way, but it is misinformed.

Windows will die, to make way for Azure and Windows Phone.

The Microsoft development ecosystem is strong (C# in top 10 of TIOBE index). It has open source released by Microsoft (ASP.NET MVC), 3rd parties (Mono by Xamarin), and the community (opensourcewindows.org). Microsoft includes open source libraries (jQuery) in their projects.

StackExchange's software will leave a legacy.

From the article: "In order to create more, and open source it we need help. So we are looking a full time developer with ops experience to join our SRE team."


We (can I say we? I don't work on Windows stack stuff at here but whatever) are actually one of the biggest open source contributors in the .NET world, granted that's just my opinion and I don't have data behind it, but here's a list of things we've open sourced: http://blog.stackoverflow.com/2012/02/stack-exchange-open-so...

MiniProfiler is my favorite one, and now exists in .NET, Ruby and Node.js due to my insanely smart co-workers.


Please don't take it as a criticism of your work - but the fact that one small company (<100?) that does OSS as a side effect (plus recruitment retention effort) can be one of the largest OSS contributors in the dotnet ecosystem says its not a vibrant and thriving ecosystem.

SE has a very high reputation as far as I can tell so its not a quality issue - it is in this case a problem of quantity.

As someone commented the idea is to focus on Azure and mobile - but that is no substitute for lots and lots of developers in a culture of releasing stuff outside their immediate company and so driving each other to new heights


The snowball has been growing here for the last few years. The .NET community was historically not very "open" ... but ever since nuget and github, attitudes have definitely been improving.


I definitely agree. The change has definitely slowed the tide of people wanting to move away from the platform.

ASP.NET MVC has definitely been a big factor with this on the web development side.


It's always nice to see new products in the DevOps space, but be careful not to re-invent the wheel if you do this kind of stuff as the open source world is coming on leaps and bounds.

LogStash, ElasticSearch and Kibana are a great open source stack for log management.

StatsD and Graphite are nice tools for metric tracking and visualization.

There are lots of open source dashboard offerings which combined with a bit of scripting can get you far.

You are also spoiled for choice with SAAS monitoring stuff such as NewRelic and Server Density, even if the OP isn't a fan of cloud based tools.


I did an experiment with logstash, elasticsearch, and Kibana for our HAProxy logs. The default with logstash was to store each field, and then the whole text so stuff go quite large for our web logs. Also the Kibana interface was pretty buggy. Parsing our logs (~2k entries) a second doesn't work well with Regex so in our version we do a bunch of substring stuff. I'm excited about Kibana/ES for the rest of our logs over with their recent hire.

When I looked at StatsD and Graphite last time I didn't really see an API. I really like the model of data being queryable and returning nice serialized format like json (like OpenTSDB does). I'm also not that fond of the "many files" model and the automatic data summarization as it ages (it does save space, but makes forecasting difficult as it can skew data).


We're parsing 2k lines/sec with logstash using regexes. We scaled it out to 6 logstash processes across 2 nodes. They pull off a shared redis queue and then insert the results directly into elasticsearch. (That said, I'd like to configure our apache logs to just output the json logstash expects.)

Graphite has a very simple and powerful JSON API [1]. Any graph URL can include &format=json and you'll get back the raw JSON values for the datapoints. I haven't used OpenTSDB yet - I'm curious how its API is better.

You get to choose the levels of summarization. If you want to keep 1 second intervals for a year, you're welcome to do so.

[1] http://graphite.readthedocs.org/en/0.9.12/render_api.html


That dashboard looks really neat. I've been searching for a good Windows dashboard, and I like the patching views. Where's the download link?


None of this is open sourced yet. Nick is investing a lot of time to get what we currently call "status" ready to open source.

Part of the reason for the open position linked to in the post ( http://careers.stackoverflow.com/jobs/39983/developer-site-r... ) is to make it so we have more manpower to get this stuff open sourced.


As it says at the bottom of the article, it's unreleased until they can make it more generic.


jmelloy (https://news.ycombinator.com/item?id=6334778) had a really good point that wasn't covered in the post, since it was a post about what we are doing not necessarily why we are doing it.

One of our major problems with existing monitoring and management systems is the lack of good APIs. We are a shop of developers and sysadmins who all understand the real management systems need to be composable. The system needs to understand that it won't solve every case out of the box and expose hooks into management and functionality, allowing us to tie disparate systems together and enhance their coverage. I'd rather take a bunch of existing products and put some cool dashboards on top, but most enterprise solutions (and some of the open source ones) don't offer a decent API to work with.


It's quite likely that I'm confusing "decent API" and "ease of extending and integrating" and all that gets wrapped up with my long term familiarity with Nagios.

Nagios check plugins don't have an API per se, but they have a very simple standard for exit codes to be interpreted by the Orchestration component.

Nagios reporting/alerting plugins primarily use RRD, so you always have that API to do interesting things like trend analysis.

(One theory I have is that there is a such a large and growing culture of technologists that are thinking "integration tool" but only know how to say "HTTP API")


OK, I'll be that guy.

This makes that page much easier to read. Especially the headings, which are all mashed together for some reason...

    #content {
        margin-left: auto;
        margin-right: auto;
        float: none;
        width: 40em;
        font-size: 15pt;
        line-height: 1.4em;
    }

    h1 {
        font-size: 180%;
        line-height: 1.2em;
        margin-top: 2em;
        margin-bottom: 0.5em;
    }

    h2 {
        font-size: 160%;
        line-height: 1.2em;
        margin-top: 2em;
        margin-bottom: 0.5em;
    }

    #wrap {
        width: 100%;
    }


An interesting consulting niche would be to help companies open source software they want to release.

Basically:

- clean up code,

- make sure the infrastructure is sufficient,

- help with marketing and adoption,

- write documentation


Very few companies want to pay for that, fewer still when in-house and third party devs are often begging to do that work if they'd just let them license the code as OSS.


I see it more likely as getting someone like the Stack Exchange guys agree to open source it, if we pay them enough.

A Kickstarter project, for instance.


I wonder how much investment they have in this vs. going with a pre-existing monitoring system like ExtraHop?


I wasn't aware of extrahop so I will have to look into that. We currently use Solarwinds Orion which is where status gets some of its data.

We have definitely outgrown Orion, and a lot of stuff in Orion is very rough, sloppy, and not well integrated.

We don't really like the idea of cloud hosted monitoring (which is a lot of what more modern monitoring systems are). Alternatives also seem very expensive.

So if we are going to make a investment (cash or labor) I would rather we get a system that fulfills all of our needs (fit) and share it with everyone.


Had you tried Nagios and Puppet/Chef when you started developing these?

The logging tool is the only one I'm not sure about, if you're parsing 1200 events a second. I'm not familiar enough with Orion to understand why it was insufficient.

Would you say these in-house tools are primarily about integrating / presenting the data, or are there custom parts doing more heavy lifting too?


We tried Nagios a while back, before going to Orion. Nagios has some great structural ideas, but fell down in other areas (don't know all of them, that was before I came on board).

We use Puppet to manage our Linux infrastructure and are testing Desired State Configuration for our Windows systems. Puppet and DSC solve a different problem than monitoring.

Our tools span the gamut of aggregating and presenting data, to doing "heavy lifting" of things like installing patches, managing our load balancers, and removing bad query plans from our database servers.

Our in-house tooling fills in gaps where existing tooling wasn't responsive enough or made it difficult to deal with certain edge cases. General purpose tools like Orion satisfy 80% of our use cases, but we are fanatical about performance and functionality so we want to fill in that additional 20%. In fact, the majority of our Orion monitors are custom script monitors, which we have to create ourselves as it is.

If what we do can work for others (I've worked in several environments with Orion and had the same issues in each environment), that's a bonus.

We've found we are building a significant number of these projects and that's why we are looking for a dedicated developer for our team.


I'll have to look at the overlap between Puppet's Windows tools and DSC.

These are definitely distinct from monitoring, but monitoring configuration should be populated from the same configuration store as Puppet. (Sometimes Puppet is the configuration store.)

I've never evaluated Orion in particular, but it's slightly puzzling if you're creating a lot of custom monitoring scripts. In the low-touch Nagios deployments I've seen, this is often because someone didn't understand good places to use macros, and centralize more of the parameters.


The problem is due to history.

Configuration systems have historically looked at config as bits-on-a-disk.

Supervisor systems look at bits-in-memory.

And they emerged and evolved independently. So there's pain points and impedance mismatches regardless of which one you start with.

What's needed is a system that sees that configuration and supervision are the same problem: you have a directed graph of what a system can look like, plus a compare-and-repair mechanism to drag the system to that state frequently.

I've personally looked at Chef, Puppet and Cfengine, none of which really does all of them the way I would like.

http://chester.id.au/2012/06/27/a-not-sobrief-aside-on-reign...


We use Puppet a lot and are generally quite happy. We have recently started leveraging Hiera which is addressing our main issue we had with Puppet: Reusability problems in the code since Modules (code) and our data were not easily separated.

A lot of these tools are about integrating / presenting but that isn't entirely the case. Status does its own polling of things like Redis and SQL. Realog parses the data, structures it in Redis etc, and the patching dashboard can kick of updates.


Status handles the scheduling of the polling, or the actual polling, or both?

One thing Nagios handles well, although it isn't exactly well polished, is distribution, scheduling, and aggregation of the polling. Also, anecdotally, "nobody" seems to know how easy it is to do simple-intermediate monitoring of MS-SQL in Nagios. I happen to be using this, at the moment:

https://github.com/scot0357/check_mssql_collection

edit: And thank you for the concise explanation of Hiera. I had entirely ignored it as just another add-on as Puppet "goes enterprise"


"We don't really like the idea of cloud hosted monitoring" Can you elaborate on the reasons?


I'll bet one of the reasons is the amount of data. At a previous employer I built a system similar to this which produced nightly, weekly, monthly, and annual reports of weblog analysis for an app that had 6 million+ http requests per day. It was tough enough to consolidate those logs across the load-balanced servers in a single data-center. We never consolidated across data-centers (separate reports for each instead) and I doubt we could have shipped all of that data to a cloud service in a reasonable amount of time.

BTW, we made heavy use of setting and logging http headers too. One trick I liked was capturing performance timing metrics as a request was processed and stuffing it into a response header as the response went out. We then logged the response headers, which gave us the ability to report on the performance metrics. We also had a debug mode in the app on the browser side so we could see the performance metrics from the headers there too.


When was this?

~70 events per second doesn't sound like much to capture and aggregate. How much of this parsing did you need to perform in real time? Creating a unique token to pair requests/responses shouldn't add much overhead at all.


Development started around 2000, and stabilized around 2008. As far as I know the reporting scripts are still being run every day. During this period we had purchased a 1TB storage rack from EMC for a million dollars, to give you some perspective on the differences between then and now.

- No real-time parsing; it's all nightly batch processing after devops rotates the Apache server logs to a storage volume. The logs sit there for a while then get compressed and moved to offline tape archives.

- No DB storage of the logs; space was too expensive and the Oracle database we had couldn't have kept up. It was already heavily burdened with a completely separate usage statistics system that fed into user-facing reporting and billing, which had a much higher event rate, about 100x higher, than the http logs.

- We had unique tokens, but they identified a particular user session that tied together all of the user's http requests from login to logoff/abandonment, and which also tied into the Oracle-based statistics for that user, that user's organization, and the customer responsible for the user (often multi-organization). My reports had breakdowns for individual user experiences, session-level metrics, and user type/organization/customer/region/etc metrics.

- I don't recall how long the analysis took; it was between half an hour to two hours I think. A lot of that time was spent on disk I/O reading the logs. I had optimized the parsing, analysis, and results recording about as much as I could.

- This stuff was written in Perl, and ran on Solaris servers from that time era... probably not a lot more powerful than a handful of smartphones today, though they did have lots of cpus. I don't think traffic has grown much since I left the company (we had pretty full market penetration already) so it's likely those servers haven't been upgraded.


I suspected it would have to be a system of that era.

I think I have a good idea of how businesses (at a high level) have failed to understand Moore's law from 2000-present. I'm curious what those failures of understanding were like from 1985-2000.

We all know that technology has been advancing rapidly, but these specific anecdotes of organizations paying a million dollars just for the backing storage of a system that you can essentially get for free from Google now...


Yeah, it's pretty amazing how much things have changed. That raw log data was about 250GB/year which is nothing today but when we started collecting it we were paying $1000/GB.

Actually, they're probably still paying over $100/GB. The whole datacenter was outsourced to Perot Systems in the mid-2000s, and the storage fees were astronomical. We calculated that Perot must pay a separate tech to stare at each individual hard drive with a replacement in-hand in case any errors were reported. At least, they could afford to do that with what we were paying them for storage.


1. The amount of data would be quite large when it comes to the logs

2. System lock-in, I like to have the data and be able to query it as I see fit

3. If an event happens where the facility is cut off from the Internet you won't be able to tell what happened inside the facility (unless maybe there is a store and forward agent, but even then it is only useful after the event.

4. Latency monitoring (again maybe an agent can help) but if I see changes in response time I can't tell if that is the WAN or not.


One point 3, that makes high level (non-agent-based) cloud monitoring a must in addition to your primary monitoring. In that kind of event, you need local and global monitoring.


I like the idea of a totally independent 3rd party monitoring system. For that we use pingdom, and its job is basically to tell us if our public services are up from to the outside world.

Basically it is an additional perspective and a redundant level of monitoring. But I view it mostly as an up/down layer of monitoring. Extremely important, but simple and not really meant to give insight into the complexity of our system.


We're doing an evaluation of a bunch at work, and we usually come down to: 1) most charge per server per month, making costs go up quickly 2) you don't own the data & the API May not exist or be wonky with implementation details 3) most install agents, meaning your production servers have another piece of code on them, and it's hard to test how they behave under load


Where does the patching dashboard pull data from? Is it tracked by hand or is there a scanner? We use Orion at work, and it's got a decent amount of data in it, but is kind of kludgy and slow.


The patching dashboard has scheduled jobs on the clients (PowerShell scripts on the Windows boxes and Ruby scripts on the Linux boxes).

We use Puppet to deploy the client to our Linux boxes and for Windows we deploy the task and scripts with Group Policy (soon to be replaced by Desired State Configuration).

This information isn't something we need to poll often for (and on Windows, there are difficulties interacting with the Windows Update apis remotely). We add and replace servers often, so the client adds themselves to the dashboard, as well as updating their status.

We have integrated some reporting into Orion, but that was a side effect of not having the dashboard before.


I am curious to how they deploy database schema and stored procedure updates. That is a much harder problem.


We have a forwards only migration runner that takes care of the deployment for us. We don't use many stored procedures, but they're taken care of the same way (in a migration). Our policy is that pushed code be backwards compatible by however many migrations are being deployed in a production build.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: