如果希望主从节点使用相同的URL,则需要配置地理感知DNS服务。地理感知DNS服务可以根据用户所处位置,将域名解析为相应的IP地址。比如中国用户访问可以解析到Geo主节点,美国用户访问可以解析到美国Geo从节点。
需要注意的是,在从节点拉取代码的时候,会首先检查从节点仓库代码是否是最新的,如果不是,则会自动请求到主节点上拉取代码。这样始终保证了在从节点拉取到的代码是最新的。
-
安排维护窗口
即便是有计划的切换,我们也需要设置一个维护窗口,因为Geo数据复制是异步而非同步,有时间间隔,维护窗口可以确保已有的数据完成复制,同时没有新的数据进来。
-
其他数据转移
主从节点之间,并非所有GitLab数据都可以通过Geo复制,不支持的部分, 需要在从节点上单独操作或者配置,实现这些数据的复制。如果这些数据是存放在文件里,那么,可以考虑 rsync
进行文件转移。
-
飞行检查
运行下面命令进行飞行检查,检查数据复制和确认已经完成:
gitlab-ctl promotion-preflight-checks
在13.8之前的版本,如果有的数据类型是0个项目需要复制,会报错:ERROR - Replication is not up-to-date
,即便所有项目确确实实都已完成复制,该bug已在13.8之后修复。
-
对象存储
如果你的GitLab比较庞大或者无法容忍宕机时间,那么在切换之前,建议考虑先将数据迁移到对象存储。
这样既能缩短维护窗口时间,也可减少因错误执行导致的数据丢失可能性。
GitLab 12.4之后,可以配置GitLab管理从节点的对象存储复制。
-
检查配置文件
主从节点之间的数据库复制是自动的,但是配置文件 /etc/gitlab/gitlab.rb
是手工设置的,且不一样。
比如主节点配置了Mattermost, OAuth或者LDAP集成,而从节点没有配置,那么切换之后将丢失这些配置。
因此需要在切换之前,检查从节点配置文件,确保从节点具备和主节点一样的功能。
-
系统检查
在主从节点上都运行以下命令,如果发现报错,应该在切换之前解决
gitlab-rake gitlab:check
gitlab-rake gitlab:geo:check
-
检查secrets文件
在主从节点上都运行以下命令,检查每个文件的输出结果是否一致。如果有不一致的,请从主节点拷贝相应文件到从节点。
sha256sum /etc/ssh/ssh_host /etc/gitlab/gitlab-secrets.json
-
检查Geo复制状态
打开 Admin > Geo
中的从节点,检查所有的复制内容都是100%完成。如果不是,请继续等待。任何复制失败的项目都将丢失未复制的那部分数据。
-
检查数据完整性
如果主节点和从节点的checksum是一致的,那么说明数据完整性没问题。
参见Automatic background verification进行配置,对数据完整性进行自动的后台检查。
-
通知用户
在 Admin Area > Messages
中配置一条广播消息,例如:
您好!我们将在1小时内进行例行维护,计划维护区间:XX:XX - XX:XX
-
防止新数据进入主节点
首先,开启维护模式。
其次,禁用所有非Geo的任务,Admin Area > Monitoring > Background Jobs > Cron
,选择 Disable All
。然后选择 geo_sidekiq_cron_config_worker
,点击 Enable
。
-
完成所有的复制,检查所有数据
检查非Geo复制的数据,比如对象存储,这个时候可以触发最后一次的复制了。
在主节点和从节点上,检查 Admin Area > Monitoring > Background Jobs > Queues
,确保所有的任务为0,除了Geo。
在从节点上,针对 CI artifacts, LFS objects, 以及附件等,进行完整性检查:Integrity check Rake task
-
推举从节点为新的主节点
参照下面章节,将从节点推举为主节点。在推举的过程中,用户有可能需要重新登陆。
维护完成后,不要忘记关闭广播通知。
-
1.尽可能的完成数据复制
如果 从节点
仍然可以从 主节点
复制数据,应尽量根据上述有计划的步骤完成数据的复制,避免数据损失。
-
2.永久禁用主节点
如果主节点宕机且有数据没来得及复制,这部分数据应该视为永久损失。
如果你仍然可以ssh到主节点,运行:
gitlab-ctl stop
systemctl disable gitlab-runsvdir
如果已经不能登陆主节点,那么应该尽可能防止主节点的重新上线,包括:
- 重新配置负载均衡器。
- 更改DNS记录(例如,将主要DNS记录指向从节点,以停止使用主要节点)。
- 停止虚拟服务器。
- 阻止通过防火墙的流量。
- 从主节点撤消对象存储权限。
- 物理断开机器连接。
如果你需要更改DNS, 可以考虑降低TTL, 加快DNS传播生效。
-
3.推举从节点
-
修改配置文件 /etc/gitlab/gitlab.rb
,删除下列两行:
geo_secondary_role['enable'] = true
roles ['geo_secondary_role']
注意:这里暂时不要运行 gitlab-ctl reconfigure
。
推举从节点,14.4及之前版本使用下述命令推举(15.0版本以后将不再支持):
gitlab-ctl promote-to-primary-node # 包含飞行检查
输出:
Ensure you have completed the following manual preflight checks:
- Check if you need to migrate to Object Storage
- Review configuration of each secondary node
- Run system checks
- Check that secrets match between nodes
- Notify users of scheduled maintenance
Please read https://docs.gitlab.com/ee/administration/geo/disaster_recovery/planned_failover.html#preflight-checks
Did you perform all manual preflight checks (y/n)?
y
---------------------------------------
WARNING: Make sure your primary is down
If you have more than one secondary please see https://docs.gitlab.com/ee/gitlab-geo/disaster-recovery.html#promoting-secondary-geo-replica-in-multi-secondary-configurations
There may be data saved to the primary that was not been replicated to the secondary before the primary went offline. This data should be treated as lost if you proceed.
---------------------------------------
Is primary down? (N/y): y
Running gitlab-rake gitlab:geo:check_replication_verification_status...
Repositories: 5/5 (100%)
Verified Repositories: 5/5 (100%)
Wikis: 5/5 (100%)
Verified Wikis: 5/5 (100%)
LFS Objects: 0/0 (0%)
Attachments: 1/1 (100%)
CI job artifacts: 4/4 (100%)
Design repositories: 0/0 (0%)
Merge Request Diffs: 0/0 (0%)
Package Files: 0/0 (0%)
Terraform State Versions: 0/0 (0%)
Snippet Repositories: 0/0 (0%)
Group Wiki Repositories: 0/0 (0%)
Repositories Checked: 5/5 (100%)
Package Files Verified: 0/0 (0%)
SUCCESS - Replication is up-to-date.
Repositories: 5/5 (100%)
Verified Repositories: 5/5 (100%)
Wikis: 5/5 (100%)
Verified Wikis: 5/5 (100%)
LFS Objects: 0/0 (0%)
Attachments: 1/1 (100%)
CI job artifacts: 4/4 (100%)
Design repositories: 0/0 (0%)
Merge Request Diffs: 0/0 (0%)
Package Files: 0/0 (0%)
Terraform State Versions: 0/0 (0%)
Snippet Repositories: 0/0 (0%)
Group Wiki Repositories: 0/0 (0%)
Repositories Checked: 5/5 (100%)
Package Files Verified: 0/0 (0%)
SUCCESS - Replication is up-to-date.
All preflight checks have passed. This node can now be promoted.
WARNING: Secondary will now be promoted to primary. Are you sure you want to proceed? (y/n)
y
Disabling the secondary role and enabling the primary in the cluster configuration file...
Promoting the PostgreSQL read-only replica to primary...
could not change directory to "/root": Permission denied
waiting for server to promote.... done
server promoted
The database is successfully promoted!
Reconfiguring...
...
...
...
gitlab Reconfigured!
Running gitlab-rake geo:set_secondary_as_primary...
https://us.alexju.cn/ is now the primary Geo node
You successfully promoted this node!
如果之前做过飞行检查,想跳过飞行检查:
gitlab-ctl promote-to-primary-node --skip-preflight-checks
如果主节点已失联,那么可以强制推举(未复制的数据将丢失):
gitlab-ctl promote-to-primary-node --force
14.5及之后的版本建议使用以下命令进行推举:
sudo gitlab-ctl geo promote
或强制推举:
sudo gitlab-ctl geo promote --force
通过之前的主节点URL,检查从节点是否推举成功。
-
4.(可选) 修改DNS记录
为了避免客户端修改URL地址,我们需要修改DNS记录,使域名指向从节点。
然后修改从节点(新的主节点)的配置文件 /etc/gitlab/gitlab.rb
:
external_url 'https://<new_external_url>'
使配置生效:
gitlab-ctl reconfigure
更新新的主机点的URL,该命令将使用 /etc/gitlab/gitlab.rb
中的 external_url
:
gitlab-rake geo:update_primary_node_url
输出:
Updating primary Geo node with URL https://us.alexju.cn/ ...
https://us.alexju.cn/ is now the primary Geo node URL
Gitlab 12.8
之前,需要在数据库中更新主节点的名称。
gitlab-rails runner 'Gitlab::Geo.primary_node.update!(name: GeoNode.current_node_name)'
访问旧的URL,查看是否可以正常访问。
-
5.(可选)添加一个新的从节点
新的主节点不会自动开启Geo,参照上面的手册进行配置。
-
6.移除从节点的跟踪数据库
当从节点被推举为主节点之后,删除跟踪数据库:
rm -rf /var/opt/gitlab/geo-postgresql
-
7.其他情形请参考:
-
开启服务
gitlab-ctl start
systemctl enable gitlab-runsvdir #开机自启动
-
删除 /etc/gitlab/gitlab-cluster.json,此文件主要写入了该节点的geo的关系
{
"primary": true,
"secondary": false
}
-
停止 puma
和 sidekiq
目的是在从节点配置完成之前,防止执行任何东西
gitlab-ctl stop puma
gitlab-ctl stop sidekiq
-
检查与主节点PostgreSQL数据库连通性
gitlab-rake gitlab:tcp_check[43.131.246.172,5432]
输出:
TCP connection from 43.131.246.172:45696 to 43.131.246.172:5432 succeeded
如果这步失败,请检查IP地址(注意区分内、外网地址)以及防火墙规则。
-
安装 server.crt
这一步可以跳过,因为主从并没有对证书文件产生影响,要是不放心的话,也可以操作。
将主节点的证书 ~gitlab-psql/data/server.crt
拷贝到从节点,在从节点执行安装
scp [email]root@43.131.246.172[/email]:~gitlab-psql/data/server.crt /etc/gitlab/ssl
install \
-D \
-o gitlab-psql \
-g gitlab-psql \
-m 0400 \
-T /etc/gitlab/ssl/server.crt ~gitlab-psql/.postgresql/root.crt
-
测试数据库TLS加密通信
使用 gitlab-psql
用户测试与主节点的数据库连通性,主节点的默认数据库为gitlabhq_production
sudo \
-u gitlab-psql /opt/gitlab/embedded/bin/psql \
--list \
-U gitlab_replicator \
-d "dbname=gitlabhq_production sslmode=verify-ca" \
-W \
-h 43.131.246.172
输出:
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
---------------------+-------------+----------+---------+-------+---------------------------------
gitlabhq_production | gitlab | UTF8 | C | C |
postgres | gitlab-psql | UTF8 | C | C |
template0 | gitlab-psql | UTF8 | C | C | =c/"gitlab-psql" +
| | | | | "gitlab-psql"=CTc/"gitlab-psql"
template1 | gitlab-psql | UTF8 | C | C | "gitlab-psql"=CTc/"gitlab-psql"+
| | | | | =c/"gitlab-psql"
(4 rows)
-
配置旧的主节点配置文件如下:
roles ['geo_secondary_role']
gitlab_rails['geo_node_name'] ='cn'
postgresql['sql_user_password'] = 'a21bac212bxxxxxxxx27472e46e2e2'
gitlab_rails['db_password'] = 'Gxxxxxx3'
postgresql['listen_address'] = '0.0.0.0'
postgresql['md5_auth_cidr_addresses'] = ['127.0.0.1/32','1.13.24.86/32', '43.131.246.172/32']
postgresql['max_replication_slots'] = 5
gitlab_rails['auto_migrate'] = true
gitlab-ctl reconfigure #使GitLab配置生效
gitlab-ctl restart postgresql #使PostgreSQL配置生效
-
运行数据库复制
如果开启了pgbouncer,需要修改配置文件禁用它。
参照Set up database replication,完成数据库复制。
注意,如果旧的主节点仍然有数据,需要添加 --force
参数进行强制覆盖:
gitlab-ctl replicate-geo-database --slot-name=cn --host=43.131.246.172 --force --backup-timeout=7200
-
在界面上添加从节点
打开主节点GitLab界面:Admin Area > Geo (/geo/sites)
点击 New node
按钮:
填写Name和URL,Name是/etc/gitlab/gitlab.rb文件中的 gitlab_rails['geo_node_name']
,URL是 external_url
,必须和配置文件保持一致。
-
在从节点上运行 gitlab-ctl restart
-
在从节点上检查geo配置
gitlab-rake gitlab:geo:check
输出:
Checking Geo ...
GitLab Geo secondary database is correctly configured ... no
Try fixing it:
Run the tracking database migrations: gitlab-rake db:migrate:geo
For more information see:
doc/gitlab-geo/database.md
Database replication enabled? ... yes
Database replication working? ... yes
GitLab Geo HTTP(S) connectivity ...
Can connect to the primary node ... yes
GitLab Geo is available ...
GitLab Geo is enabled ... yes
This machine's Geo node name matches a database record ... yes, found a secondary node named "cn"
HTTP/HTTPS repository cloning is enabled ... yes
Machine clock is synchronized ... yes
Git user has default SSH configuration? ... yes
OpenSSH configured to use AuthorizedKeysCommand ... skipped
Reason:
Cannot access OpenSSH configuration file
Try fixing it:
This is expected if you are using SELinux. You may want to check configuration manually
For more information see:
doc/administration/operations/fast_ssh_key_lookup.md
GitLab configured to disable writing to authorized_keys file ... yes
GitLab configured to store new projects in hashed storage? ... yes
All projects are in hashed storage? ... yes
Checking Geo ... Finished
-
迁移geo数据
gitlab-rake db:migrate:geo
输出:
WARNING: Could not write to the database main: cannot execute UPSERT in a read-only transaction
== 20170206203234 CreateProjectRegistry: migrating ============================
-- create_table(:project_registry, {:id=>:integer})
-> 0.0022s
== 20170206203234 CreateProjectRegistry: migrated (0.0023s) ===================
== 20170223033541 CreateFileRegistry: migrating ===============================
-- create_table(:file_registry, {:id=>:integer})
-> 0.0020s
-- add_index(:file_registry, :file_type)
-> 0.0007s
-- add_index(:file_registry, [:file_type, :file_id], {:unique=>true})
-> 0.0005s
== 20170223033541 CreateFileRegistry: migrated (0.0034s) ======================
...
...
The migration is expected to take at least 0 seconds. Expect all jobs to have completed after 2023-04-23 15:50:04 UTC."
geo: == 20220202101354 MigrateJobArtifactRegistry: migrated (0.0340s) ==============
geo: == 20220617125507 CreateCiSecureFileRegistry: migrating =======================
geo: -- create_table(:ci_secure_file_registry, {:id=>:bigserial, :force=>:cascade})
geo: -- quote_column_name(:verification_failure)
geo: -> 0.0000s
geo: -- quote_column_name(:last_sync_failure)
geo: -> 0.0000s
geo: -> 0.0142s
geo: == 20220617125507 CreateCiSecureFileRegistry: migrated (0.0169s) ==============
-
再次检查
gitlab-rake gitlab:geo:check
输出:
Checking Geo ...
GitLab Geo is available ...
GitLab Geo is enabled ... yes
This machine's Geo node name matches a database record ... yes, found a secondary node named "cn"
HTTP/HTTPS repository cloning is enabled ... yes
Machine clock is synchronized ... yes
Git user has default SSH configuration? ... yes
OpenSSH configured to use AuthorizedKeysCommand ... skipped
Reason:
Cannot access OpenSSH configuration file
Try fixing it:
This is expected if you are using SELinux. You may want to check configuration manually
For more information see:
doc/administration/operations/fast_ssh_key_lookup.md
GitLab configured to disable writing to authorized_keys file ... yes
GitLab configured to store new projects in hashed storage? ... yes
All projects are in hashed storage? ... yes
Checking Geo ... Finished
-
打开从节点,检查配置是否正常。如有问题,请检查上述步骤。