Gitlab 的 Backgroud jobs 队列如何监控,经常有队列任务积压

我在队列没有积压的情况下,执行了这两个命令
gitlab-rake db:migrate:status
的执行结果
migrate_status_0810_1.log (109.4 KB)
gitlab-psql -c ‘select count(*) from ci_builds;’
的执行结果


@xiangma

使用此工具收集下日志看一下,日志需要覆盖故障发生的时段GitLab SOS

2021-08-27_05:58:16.21828 ERROR: could not obtain lock on row in relation “merge_requests”
2021-08-27_05:58:16.21832 STATEMENT: SELECT “merge_requests”.* FROM “merge_requests” WHERE “merge_requests”.“id” = 8755 LIMIT 1 FOR UPDATE NOWAIT
2021-08-27_05:58:16.43091 ERROR: could not obtain lock on row in relation “merge_requests”
2021-08-27_05:58:16.43098 STATEMENT: SELECT “merge_requests”.* FROM “merge_requests” WHERE “merge_requests”.“id” = 8755 LIMIT 1 FOR UPDATE NOWAIT
2021-08-27_05:58:16.91063 ERROR: could not obtain lock on row in relation “merge_requests”
2021-08-27_05:58:16.91067 STATEMENT: SELECT “merge_requests”.* FROM “merge_requests” WHERE “merge_requests”.“id” = 8755 LIMIT 1 FOR UPDATE NOWAIT
2021-08-27_05:58:17.63512 ERROR: could not obtain lock on row in relation “merge_requests”
2021-08-27_05:58:17.63514 STATEMENT: SELECT “merge_requests”.* FROM “merge_requests” WHERE “merge_requests”.“id” = 8755 LIMIT 1 FOR UPDATE NOWAIT
2021-08-27_05:58:18.72970 ERROR: could not obtain lock on row in relation “merge_requests”
2021-08-27_05:58:18.72975 STATEMENT: SELECT “merge_requests”.* FROM “merge_requests” WHERE “merge_requests”.“id” = 8755 LIMIT 1 FOR UPDATE NOWAIT
2021-08-27_05:59:33.02308 LOG: duration: 1902.888 ms execute : SELECT “internal_ids”.* FROM “internal_ids” WHERE “internal_ids”.“id” = 144 LIMIT 1 FOR UPDATE
2021-08-27_06:00:31.34462 LOG: duration: 2279.728 ms execute : SELECT “internal_ids”.* FROM “internal_ids” WHERE “internal_ids”.“id” = 45 LIMIT 1 FOR UPDATE
2021-08-27_06:02:28.57324 LOG: process 603531 still waiting for ShareLock on transaction 3134463 after 5000.150 ms
2021-08-27_06:02:28.57328 DETAIL: Process holding the lock: 600178. Wait queue: 603531.
2021-08-27_06:02:28.57330 CONTEXT: while updating tuple (31469,1) in relation “ci_builds”
2021-08-27_06:02:28.57330 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 3, “status” = ‘running’, “started_at” = ‘2021-08-27 06:02:23.564678’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:02:23.568049’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:02:40.62656 LOG: process 604461 still waiting for ExclusiveLock on tuple (31469,1) of relation 18225 of database 16386 after 5000.114 ms
2021-08-27_06:02:40.62661 DETAIL: Process holding the lock: 603531. Wait queue: 604461.
2021-08-27_06:02:40.62662 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 4, “status” = ‘running’, “started_at” = ‘2021-08-27 06:02:35.620458’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:02:35.622669’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:05.76301 LOG: process 606866 still waiting for ExclusiveLock on tuple (31469,1) of relation 18225 of database 16386 after 5000.285 ms
2021-08-27_06:03:05.76306 DETAIL: Process holding the lock: 603531. Wait queue: 604461, 606866.
2021-08-27_06:03:05.76306 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 19, “status” = ‘running’, “started_at” = ‘2021-08-27 06:03:00.749989’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:03:00.754449’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:23.57233 ERROR: canceling statement due to statement timeout
2021-08-27_06:03:23.57238 CONTEXT: while updating tuple (31469,1) in relation “ci_builds”
2021-08-27_06:03:23.57238 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 3, “status” = ‘running’, “started_at” = ‘2021-08-27 06:02:23.564678’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:02:23.568049’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:23.57255 LOG: process 604461 acquired ExclusiveLock on tuple (31469,1) of relation 18225 of database 16386 after 47946.091 ms
2021-08-27_06:03:23.57257 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 4, “status” = ‘running’, “started_at” = ‘2021-08-27 06:02:35.620458’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:02:35.622669’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:28.57279 LOG: process 604461 still waiting for ShareLock on transaction 3134463 after 5000.167 ms
2021-08-27_06:03:28.57285 DETAIL: Process holding the lock: 600178. Wait queue: 604461.
2021-08-27_06:03:28.57285 CONTEXT: while updating tuple (31469,1) in relation “ci_builds”
2021-08-27_06:03:28.57287 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 4, “status” = ‘running’, “started_at” = ‘2021-08-27 06:02:35.620458’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:02:35.622669’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:30.99669 LOG: process 606062 still waiting for ExclusiveLock on tuple (31469,1) of relation 18225 of database 16386 after 5000.127 ms
2021-08-27_06:03:30.99674 DETAIL: Process holding the lock: 604461. Wait queue: 606866, 606062.
2021-08-27_06:03:30.99675 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 3, “status” = ‘running’, “started_at” = ‘2021-08-27 06:03:25.985672’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:03:25.990495’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:35.62603 ERROR: canceling statement due to statement timeout
2021-08-27_06:03:35.62607 CONTEXT: while updating tuple (31469,1) in relation “ci_builds”
2021-08-27_06:03:35.62607 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 4, “status” = ‘running’, “started_at” = ‘2021-08-27 06:02:35.620458’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:02:35.622669’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:35.62637 LOG: process 606866 acquired ExclusiveLock on tuple (31469,1) of relation 18225 of database 16386 after 34863.685 ms
2021-08-27_06:03:35.62638 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 19, “status” = ‘running’, “started_at” = ‘2021-08-27 06:03:00.749989’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:03:00.754449’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:40.62662 LOG: process 606866 still waiting for ShareLock on transaction 3134463 after 5000.117 ms
2021-08-27_06:03:40.62666 DETAIL: Process holding the lock: 600178. Wait queue: 606866.
2021-08-27_06:03:40.62666 CONTEXT: while updating tuple (31469,1) in relation “ci_builds”
2021-08-27_06:03:40.62667 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 19, “status” = ‘running’, “started_at” = ‘2021-08-27 06:03:00.749989’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:03:00.754449’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:41.36040 LOG: process 606866 acquired ShareLock on transaction 3134463 after 5733.883 ms
2021-08-27_06:03:41.36044 CONTEXT: while updating tuple (31469,1) in relation “ci_builds”
2021-08-27_06:03:41.36044 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 19, “status” = ‘running’, “started_at” = ‘2021-08-27 06:03:00.749989’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:03:00.754449’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:41.36061 LOG: process 606062 acquired ExclusiveLock on tuple (31469,1) of relation 18225 of database 16386 after 15364.056 ms
2021-08-27_06:03:41.36063 STATEMENT: UPDATE “ci_builds” SET “runner_id” = 3, “status” = ‘running’, “started_at” = ‘2021-08-27 06:03:25.985672’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:03:25.990495’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:41.36073 LOG: duration: 40598.749 ms execute : UPDATE “ci_builds” SET “runner_id” = 19, “status” = ‘running’, “started_at” = ‘2021-08-27 06:03:00.749989’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:03:00.754449’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:41.36459 LOG: duration: 15368.541 ms execute : UPDATE “ci_builds” SET “runner_id” = 3, “status” = ‘running’, “started_at” = ‘2021-08-27 06:03:25.985672’, “processed” = FALSE, “updated_at”= ‘2021-08-27 06:03:25.990495’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:03:41.36912 LOG: duration: 2777.906 ms execute : UPDATE “ci_builds” SET “runner_id” = 4, “status” = ‘running’, “started_at” = ‘2021-08-27 06:03:38.585474’, “processed” = FALSE, “updated_at” = ‘2021-08-27 06:03:38.587386’, “lock_version” = 2 WHERE “ci_builds”.“id” = 261067 AND “ci_builds”.“lock_version” = 1
2021-08-27_06:04:46.02506 LOG: automatic analyze of table “gitlabhq_production.public.merge_requests” system usage: CPU: user: 0.63 s, system: 0.02 s, elapsed: 2.53 s

lock_timeout: 0
idle_in_transaction_session_timeout: 1min
deadlock_timeout: 5s
statement_timeout: 1min
这是我目前的参数,要是不是应该调小一点,有什么建议?

目前看到的是数据库阻塞了,看起来像会话一直不释放,看下下面的连接收集下日志https://gitlab.cn/xiangma0510/toolbox/-/tree/main/logging_collection,先确认下情况然后再看下调整参数

没有现场了,现在收集的日志,看不到什么有用的?

这里有个故障发生日的postgresql日志
0828 (101.6 KB)

您有没有更直接的沟通方式,谢谢!

我调整了pg超时时间,今天还是发生了队列积压,想问下 pipeline_hooks:build_hooks 这个队列是做什么的?我发现一个手工触发的Job,pipeline还没走到这个job,这个job的ID已经放在pipeline_hooks:build_hooks 这个队列里。