Gitlab was down

time0ut · on Oct 24, 2020

At least they are transparent about it. Certain providers (not necessarily in the git space) would show green across the board and gaslight you if you question it.

cube00 · on Oct 24, 2020

Some users may have experienced slightly elevated error rates.

Frost1x · on Oct 24, 2020

Thank you for applying to our technical PR department, you're hired!

jlgaddis · on Oct 24, 2020

> Some

Or, typically, "A small portion ..."

You might also have noticed that they almost always mention percentages ("less than 3% of users") instead of real numbers ("one million users") to make these incidents seem MUCH less significant.

ashtonkem · on Oct 24, 2020

100% is indeed an elevated rate compared to 0%.

ldoughty · on Oct 24, 2020

I love AWS, but they've been struggling lately (probably understandable with increased unexpected load this year?). Not going into details to keep on topic, but every time it's green status. Rarely any acknowledgement of problems or degradation

ashtonkem · on Oct 24, 2020

I doubt it’s related to this year. They’ve been famous for having a status dashboard that straight up lies for years.

time0ut · on Oct 24, 2020

Hey, how'd you know exactly who I was thinking about?!

pikrzyszto · on Oct 24, 2020

that's why somebody created https://stop.lying.cloud/

thespoonbends · on Oct 24, 2020

gitlab.com is down, but are public/private instances are not. Separate failure domains is an advantage of self-hosted.

Github is great, but when it's down, it impacts developers in many places.

When we spread our eggs across many baskets, we are more resilient.

rvz · on Oct 24, 2020

> Separate failure domains is an advantage of self-hosted.

Well the responsibility falls on the infrastructure people in an open-source project as soon as you self-host. But then again many open-source projects already do this. I'd rather have control over my repositories than host them on someone else's server which is why I'd go for self-hosted Gitlab or Gitea.

This has been my point which I have made for self-hosting over the last few months since GitHub was down on a regular basis, and going against locking in their ecosystem, which is another risk. GitLab cloud is no different. [0]

[0] https://news.ycombinator.com/item?id=23915707

kissgyorgy · on Oct 24, 2020

People tend to forget about scale. GitLab need to work for millions of developers and repositories, your self-hosted instance need to work for only a couple of dozens of developers in case of a smaller company, which doesn't even need a distributed system.

dx034 · on Oct 24, 2020

I've been using self hosted Gitlab instances for about two years now, maintenance is really low. I've never seen an outage on any of the systems.

Running applications at scale is hard but running Gitlab in a low traffic environment is incredibly stable and low maintenance.

AirborneUnicorn · on Oct 24, 2020

I can agree with this. I maintain a gitlab instance at my place of work and it’s been very resilient for the last two years I’ve been doing it, even with the most mild of attention paid to it.

sgt · on Oct 24, 2020

Our team uses Gitlab. I think we can do fine without Gitlab for a day. It's not like you need it constantly when developing.

onion2k · on Oct 24, 2020

I think we can do fine without Gitlab for a day.

This sounds to me like you've not really bought in to the peripheral software project tools Gitlab offers like boards, issue tracking, CI/CD for running tests, etc. In which case, Gitlab seems a bit like overkill. Running your own git remote is pretty straightforward.

sgt · on Oct 24, 2020

We use a self-hosted Jenkins for CI/CD, tests. Does Gitlab offer anything substantially better in this regard? I am open to moving away from Jenkins but only if it is better. Jenkins is easy to maintain and relatively simple to configure.

polskibus · on Oct 24, 2020

Depends on the day (it's the day we're supposed to release new version!), size of the team, etc It's a devops platform, not just source control.

Disclaimer: we use self hosted gitlab at work.

rastapasta42 · on Oct 24, 2020

Appears to be an outage with jwt_auth, which will affect most of the gitlab components.

> 09:46 - looking at https://log.gprd.gitlab.net/goto/50cd8288ecda46c79ddf7b60579... Looks like a hug of death, not an attack. A user is deleting a large number of tags

Interesting - looks like the issue might be caused by someone deleting too many tags: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2...

mcny · on Oct 24, 2020

Postgresql automatic reindex identified as cause I think?

Database load spiked and this has dramatically increased error rates and degraded times on responses. Automatic re-indexing was enabled over the weekend. A re-index of the routes table changed the statistics on the route table resulting in much less efficient queries to the table. A manual re-analyze of the table fixed the issue. The automatic re-indexing has been disabled.

I am more interested in what is automatic reindex, why enable it, how it broke things, and so on. Can someone please enlighten me?

ahachete · on Oct 24, 2020

The precise root cause was Postgres statistics on a given table were completely off. This caused, as you well pointed out, that a frequent query that normally takes 1-2ms started timing out (timeout is 15s). This caused an error spike and slow down across most of the read-only replicas.

There's correlation with the reindex operation, which affected the index that provided the necessary speedup on the operation, and the incident.

The reindex creates a concurrent index on the table and drops the old index afterwards. How it precisely affected the statistics is not determined yet.

There's more information on the incident issue. FYI: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2...

Update: the cause seems to be determined by the fact that the index was a functional index. In this case, table statistics are not applicable and the index requires its own statistics, which can only be generated via ANALYZE (manual or via autovacuum, ofc).

Disclaimer: I work at OnGres, and we help GitLab with PostgreSQL Support.

dilatedmind · on Oct 24, 2020

tangentially related, but have you ever had to deal with statistics on a table indexed on (x, date), where there are a few million rows added each day, and x could be [0,1000], with a distribution like 80% of rows owing to only a few values of x.

ran into a situation like this where after enough days of data had accumulated, postgres would eventually fall behind on updating stats, such that a week could lapse without stats being updated, causing the query planner to think no rows existed within that time range. This would result in a nested loop instead of a more efficient hash join, causing a query to take 2 hours instead of 2 seconds.

increasing the number of rows sampled with set statistics didn't seem to help. wound up running a cronjob to inspect pg_stats, and manually running analyze when enough days had lapsed without most_common_vals being updated.

trynewideas · on Oct 24, 2020

For context if/when the status page changes:

- Web, API, git, pages, CI/CD, and registry are down, among other services

- Timeline so far: reported at high DB load degrading availability as of 09:36 UTC, investigation continuing as of 10:02 UTC

This follows a reported DDOS attempt yesterday.

TangerineDream · on Oct 24, 2020

Why append "?hn-duplicate-disable=true" at the end of the URL?

phoe-krk · on Oct 24, 2020

Because I was unable to add the submission - I was instead redirected to https://news.ycombinator.com/item?id=21656268.

ravivyas · on Oct 24, 2020

The URL may be updated to: https://status.gitlab.com/pages/incident/5b36dc6502d06804c08...

dessant · on Oct 24, 2020

It's a long enough query parameter to bypass HN's duplicate detector. Previous submissions appear to have been auto-flagged.

bdcravens · on Oct 24, 2020

If Github goes down today, Gitlab will be able to do their usual "but we did it first!" post.

eznzt · on Oct 24, 2020

I find this extremely odd, given how the sysadmins of gitlab have a reputation of being among the best.

rrdharan · on Oct 24, 2020

Not sure if you were being sarcastic? I believe it is probably fair to say they are among the most radically transparent.

Given the harsh criticism some of their previous proposed architecture, approaches to business continuity etc. have been received here in years past I don’t think there’s any kind of consensus on them being “the best” at providing HA SaaS?

qes · on Oct 25, 2020

I find your comment odd, given the high percentage of "XXX hosted SCM is down today" posts that are about Gitlab.