Hacker News new | past | comments | ask | show | jobs | submit login
Gitlab was down (status.gitlab.com)
68 points by phoe-krk on Oct 24, 2020 | hide | past | favorite | 31 comments



At least they are transparent about it. Certain providers (not necessarily in the git space) would show green across the board and gaslight you if you question it.


Some users may have experienced slightly elevated error rates.


Thank you for applying to our technical PR department, you're hired!


> Some

Or, typically, "A small portion ..."

You might also have noticed that they almost always mention percentages ("less than 3% of users") instead of real numbers ("one million users") to make these incidents seem MUCH less significant.


100% is indeed an elevated rate compared to 0%.


I love AWS, but they've been struggling lately (probably understandable with increased unexpected load this year?). Not going into details to keep on topic, but every time it's green status. Rarely any acknowledgement of problems or degradation


I doubt it’s related to this year. They’ve been famous for having a status dashboard that straight up lies for years.


Hey, how'd you know exactly who I was thinking about?!


that's why somebody created https://stop.lying.cloud/


gitlab.com is down, but are public/private instances are not. Separate failure domains is an advantage of self-hosted.

Github is great, but when it's down, it impacts developers in many places.

When we spread our eggs across many baskets, we are more resilient.


> Separate failure domains is an advantage of self-hosted.

Well the responsibility falls on the infrastructure people in an open-source project as soon as you self-host. But then again many open-source projects already do this. I'd rather have control over my repositories than host them on someone else's server which is why I'd go for self-hosted Gitlab or Gitea.

This has been my point which I have made for self-hosting over the last few months since GitHub was down on a regular basis, and going against locking in their ecosystem, which is another risk. GitLab cloud is no different. [0]

[0] https://news.ycombinator.com/item?id=23915707


People tend to forget about scale. GitLab need to work for millions of developers and repositories, your self-hosted instance need to work for only a couple of dozens of developers in case of a smaller company, which doesn't even need a distributed system.


I've been using self hosted Gitlab instances for about two years now, maintenance is really low. I've never seen an outage on any of the systems.

Running applications at scale is hard but running Gitlab in a low traffic environment is incredibly stable and low maintenance.


I can agree with this. I maintain a gitlab instance at my place of work and it’s been very resilient for the last two years I’ve been doing it, even with the most mild of attention paid to it.


Our team uses Gitlab. I think we can do fine without Gitlab for a day. It's not like you need it constantly when developing.


I think we can do fine without Gitlab for a day.

This sounds to me like you've not really bought in to the peripheral software project tools Gitlab offers like boards, issue tracking, CI/CD for running tests, etc. In which case, Gitlab seems a bit like overkill. Running your own git remote is pretty straightforward.


We use a self-hosted Jenkins for CI/CD, tests. Does Gitlab offer anything substantially better in this regard? I am open to moving away from Jenkins but only if it is better. Jenkins is easy to maintain and relatively simple to configure.


Depends on the day (it's the day we're supposed to release new version!), size of the team, etc It's a devops platform, not just source control.

Disclaimer: we use self hosted gitlab at work.


Appears to be an outage with jwt_auth, which will affect most of the gitlab components.

> 09:46 - looking at https://log.gprd.gitlab.net/goto/50cd8288ecda46c79ddf7b60579... Looks like a hug of death, not an attack. A user is deleting a large number of tags

Interesting - looks like the issue might be caused by someone deleting too many tags: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2...


Postgresql automatic reindex identified as cause I think?

Database load spiked and this has dramatically increased error rates and degraded times on responses. Automatic re-indexing was enabled over the weekend. A re-index of the routes table changed the statistics on the route table resulting in much less efficient queries to the table. A manual re-analyze of the table fixed the issue. The automatic re-indexing has been disabled.

I am more interested in what is automatic reindex, why enable it, how it broke things, and so on. Can someone please enlighten me?


The precise root cause was Postgres statistics on a given table were completely off. This caused, as you well pointed out, that a frequent query that normally takes 1-2ms started timing out (timeout is 15s). This caused an error spike and slow down across most of the read-only replicas.

There's correlation with the reindex operation, which affected the index that provided the necessary speedup on the operation, and the incident.

The reindex creates a concurrent index on the table and drops the old index afterwards. How it precisely affected the statistics is not determined yet.

There's more information on the incident issue. FYI: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2...

Update: the cause seems to be determined by the fact that the index was a functional index. In this case, table statistics are not applicable and the index requires its own statistics, which can only be generated via ANALYZE (manual or via autovacuum, ofc).

Disclaimer: I work at OnGres, and we help GitLab with PostgreSQL Support.


tangentially related, but have you ever had to deal with statistics on a table indexed on (x, date), where there are a few million rows added each day, and x could be [0,1000], with a distribution like 80% of rows owing to only a few values of x.

ran into a situation like this where after enough days of data had accumulated, postgres would eventually fall behind on updating stats, such that a week could lapse without stats being updated, causing the query planner to think no rows existed within that time range. This would result in a nested loop instead of a more efficient hash join, causing a query to take 2 hours instead of 2 seconds.

increasing the number of rows sampled with set statistics didn't seem to help. wound up running a cronjob to inspect pg_stats, and manually running analyze when enough days had lapsed without most_common_vals being updated.


For context if/when the status page changes:

- Web, API, git, pages, CI/CD, and registry are down, among other services

- Timeline so far: reported at high DB load degrading availability as of 09:36 UTC, investigation continuing as of 10:02 UTC

This follows a reported DDOS attempt yesterday.


Why append "?hn-duplicate-disable=true" at the end of the URL?


Because I was unable to add the submission - I was instead redirected to https://news.ycombinator.com/item?id=21656268.



It's a long enough query parameter to bypass HN's duplicate detector. Previous submissions appear to have been auto-flagged.


If Github goes down today, Gitlab will be able to do their usual "but we did it first!" post.


I find this extremely odd, given how the sysadmins of gitlab have a reputation of being among the best.


Not sure if you were being sarcastic? I believe it is probably fair to say they are among the most radically transparent.

Given the harsh criticism some of their previous proposed architecture, approaches to business continuity etc. have been received here in years past I don’t think there’s any kind of consensus on them being “the best” at providing HA SaaS?


I find your comment odd, given the high percentage of "XXX hosted SCM is down today" posts that are about Gitlab.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: