At least they are transparent about it. Certain providers (not necessarily in the git space) would show green across the board and gaslight you if you question it.
You might also have noticed that they almost always mention percentages ("less than 3% of users") instead of real numbers ("one million users") to make these incidents seem MUCH less significant.
I love AWS, but they've been struggling lately (probably understandable with increased unexpected load this year?). Not going into details to keep on topic, but every time it's green status. Rarely any acknowledgement of problems or degradation
> Separate failure domains is an advantage of self-hosted.
Well the responsibility falls on the infrastructure people in an open-source project as soon as you self-host. But then again many open-source projects already do this. I'd rather have control over my repositories than host them on someone else's server which is why I'd go for self-hosted Gitlab or Gitea.
This has been my point which I have made for self-hosting over the last few months since GitHub was down on a regular basis, and going against locking in their ecosystem, which is another risk. GitLab cloud is no different. [0]
People tend to forget about scale. GitLab need to work for millions of developers and repositories, your self-hosted instance need to work for only a couple of dozens of developers in case of a smaller company, which doesn't even need a distributed system.
I can agree with this. I maintain a gitlab instance at my place of work and it’s been very resilient for the last two years I’ve been doing it, even with the most mild of attention paid to it.
This sounds to me like you've not really bought in to the peripheral software project tools Gitlab offers like boards, issue tracking, CI/CD for running tests, etc. In which case, Gitlab seems a bit like overkill. Running your own git remote is pretty straightforward.
We use a self-hosted Jenkins for CI/CD, tests. Does Gitlab offer anything substantially better in this regard? I am open to moving away from Jenkins but only if it is better. Jenkins is easy to maintain and relatively simple to configure.
Postgresql automatic reindex identified as cause I think?
Database load spiked and this has dramatically increased error rates and degraded times on responses.
Automatic re-indexing was enabled over the weekend. A re-index of the routes table changed the statistics on the route table resulting in much less efficient queries to the table. A manual re-analyze of the table fixed the issue. The automatic re-indexing has been disabled.
I am more interested in what is automatic reindex, why enable it, how it broke things, and so on. Can someone please enlighten me?
The precise root cause was Postgres statistics on a given table were completely off. This caused, as you well pointed out, that a frequent query that normally takes 1-2ms started timing out (timeout is 15s). This caused an error spike and slow down across most of the read-only replicas.
There's correlation with the reindex operation, which affected the index that provided the necessary speedup on the operation, and the incident.
The reindex creates a concurrent index on the table and drops the old index afterwards. How it precisely affected the statistics is not determined yet.
Update: the cause seems to be determined by the fact that the index was a functional index. In this case, table statistics are not applicable and the index requires its own statistics, which can only be generated via ANALYZE (manual or via autovacuum, ofc).
Disclaimer: I work at OnGres, and we help GitLab with PostgreSQL Support.
tangentially related, but have you ever had to deal with statistics on a table indexed on (x, date), where there are a few million rows added each day, and x could be [0,1000], with a distribution like 80% of rows owing to only a few values of x.
ran into a situation like this where after enough days of data had accumulated, postgres would eventually fall behind on updating stats, such that a week could lapse without stats being updated, causing the query planner to think no rows existed within that time range. This would result in a nested loop instead of a more efficient hash join, causing a query to take 2 hours instead of 2 seconds.
increasing the number of rows sampled with set statistics didn't seem to help. wound up running a cronjob to inspect pg_stats, and manually running analyze when enough days had lapsed without most_common_vals being updated.
Not sure if you were being sarcastic? I believe it is probably fair to say they are among the most radically transparent.
Given the harsh criticism some of their previous proposed architecture, approaches to business continuity etc. have been received here in years past I don’t think there’s any kind of consensus on them being “the best” at providing HA SaaS?