Hacker News new | past | comments | ask | show | jobs | submit login

The precise root cause was Postgres statistics on a given table were completely off. This caused, as you well pointed out, that a frequent query that normally takes 1-2ms started timing out (timeout is 15s). This caused an error spike and slow down across most of the read-only replicas.

There's correlation with the reindex operation, which affected the index that provided the necessary speedup on the operation, and the incident.

The reindex creates a concurrent index on the table and drops the old index afterwards. How it precisely affected the statistics is not determined yet.

There's more information on the incident issue. FYI: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2...

Update: the cause seems to be determined by the fact that the index was a functional index. In this case, table statistics are not applicable and the index requires its own statistics, which can only be generated via ANALYZE (manual or via autovacuum, ofc).

Disclaimer: I work at OnGres, and we help GitLab with PostgreSQL Support.




tangentially related, but have you ever had to deal with statistics on a table indexed on (x, date), where there are a few million rows added each day, and x could be [0,1000], with a distribution like 80% of rows owing to only a few values of x.

ran into a situation like this where after enough days of data had accumulated, postgres would eventually fall behind on updating stats, such that a week could lapse without stats being updated, causing the query planner to think no rows existed within that time range. This would result in a nested loop instead of a more efficient hash join, causing a query to take 2 hours instead of 2 seconds.

increasing the number of rows sampled with set statistics didn't seem to help. wound up running a cronjob to inspect pg_stats, and manually running analyze when enough days had lapsed without most_common_vals being updated.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: