使用nomad实现工作负载版本升级

书接上文

《使用nomad实现集群管理和微服务部署调度》一文中,我们介绍了使用nomad进行集群管理和工作负载调度的轻量级方案(相较于Kubernetes方案)。在本文中,我们继续对方案进行延展,介绍一下在nomad集群中工作负载版本升级的一些常用模式和实现方法,包括滚动升级蓝绿部署金丝雀部署

一. 初始状态

这里我们利用基于tcp+sni路由(listener端口为9996)的httpsbackend-sni-1的job作为演示job,该job的初始部署nomad job文件为:httpsbackend-tcp-sni-1.nomad (注:不同的是,这里将count初始值改为了3)。

当前httpsbackend-sni-1这个job的状态如下:

# nomad job status httpsbackend-sni-1
ID            = httpsbackend-sni-1
Name          = httpsbackend-sni-1
Submit Date   = 2019-04-08T10:57:29+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
httpsbackend-sni-1  0       0         3        0       3         0

Allocations
ID        Node ID   Task Group          Version  Desired  Status    Created    Modified
7ac186b8  7acdd7bc  httpsbackend-sni-1  22       run      running   1m18s ago  1m1s ago
8a79085f  c281658a  httpsbackend-sni-1  22       run      running   1m18s ago  46s ago
f9ffef32  9e3ef19f  httpsbackend-sni-1  22       run      running   1m18s ago  59s ago
0ed95591  9e3ef19f  httpsbackend-sni-1  20       stop     complete  5d19h ago  7m16s ago
604d2151  9e3ef19f  httpsbackend-sni-1  20       stop     complete  5d19h ago  7m16s ago
06404fff  7acdd7bc  httpsbackend-sni-1  20       stop     complete  5d20h ago  7m14s ago

fabio路由表如下:

img{512x368}

# curl -k https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.0

接下来,我们就以这个job为基础,使用各种版本升级模式对其进行更新。

二. 滚动更新(rolling update)

下面是blog.itaysk.com上一篇文章中的有关滚动更新的示意图:

img{512x368}
可以大致看出所谓滚动更新就是对目标环境下老版本的程序进行逐批的替换,每批的数量可以是1,也可以大于1,根据目标实例的个数自定义。替换过程中,新老版本是并存的,直到所有目标实例都被替换为新版本。

nomad支持通过在job描述文件中增加update配置来支持滚动更新。我们创建httpsbackend-tcp-sni-1-rolling-update.nomad,考虑篇幅,这里仅列出与httpsbackend-tcp-sni-1.nomad的差异:

# diff httpsbackend-tcp-sni-1-rolling-update.nomad ./httpsbackend-tcp-sni-1.nomad
14,19d13
<     update {
<       max_parallel = 1
<       min_healthy_time = "30s"
<       healthy_deadline = "5m"
<     }
<
23c17
<         image = "bigwhite/httpsbackendservice:v1.0.1"
---
>         image = "bigwhite/httpsbackendservice:v1.0.0"

新job nomad文件使用了v1.0.1版本的httpsbackendservice image,增加了update {…}配置环节,其中的max_parallel指示的是滚动更新每批更新的数量,这里是1,也就是说一批仅用新版本替换一个老版本实例。

执行滚动更新:

# nomad job run httpsbackend-tcp-sni-1-rolling-update.nomad
==> Monitoring evaluation "8d39ab53"
    Evaluation triggered by job "httpsbackend-sni-1"
    Evaluation within deployment: "348ef16b"
    Allocation "88c1a29e" created: node "7acdd7bc", group "httpsbackend-sni-1"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "8d39ab53" finished with status "complete"

httpsbackendservice job的task group有三个task实例,因此更新需要一些时间,我们在更新过程中查看job status:

# nomad job status httpsbackend-sni-1
ID            = httpsbackend-sni-1
Name          = httpsbackend-sni-1
Submit Date   = 2019-04-08T13:06:35+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
httpsbackend-sni-1  0       0         3        0       4         0

Latest Deployment
ID          = 348ef16b
Status      = running
Description = Deployment is running

Deployed
Task Group          Desired  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  3        1       0        0          2019-04-08T13:16:35+08:00

Allocations
ID        Node ID   Task Group          Version  Desired  Status    Created   Modified
88c1a29e  7acdd7bc  httpsbackend-sni-1  23       run      running   44s ago   41s ago
7ac186b8  7acdd7bc  httpsbackend-sni-1  22       run      running   2h9m ago  2h9m ago
8a79085f  c281658a  httpsbackend-sni-1  22       run      running   2h9m ago  2h9m ago
f9ffef32  9e3ef19f  httpsbackend-sni-1  22       stop     complete  2h9m ago  44s ago

我们看到nomad job status命令输出的信息中多出了“Latest Deployment”一个小节,在该小节中,我们看到了一个ID为348ef16b的deployment正在run。这个deployment对应的就是这次的滚动更新,我们看到下面的allocations列表中,一个version为22的allocation已经stop,一个version为23的allocation已经run,这说明nomad已经完成了一个task实例的版本升级。

我们再来查看一下job执行的最终状态:

# nomad job status httpsbackend-sni-1
ID            = httpsbackend-sni-1
Name          = httpsbackend-sni-1
Submit Date   = 2019-04-08T13:06:35+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
httpsbackend-sni-1  0       0         3        0       6         0

Latest Deployment
ID          = 348ef16b
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group          Desired  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  3        3       3        0          2019-04-08T13:18:43+08:00

Allocations
ID        Node ID   Task Group          Version  Desired  Status    Created    Modified
da1b545b  7acdd7bc  httpsbackend-sni-1  23       run      running   34s ago    2s ago
44da5693  9e3ef19f  httpsbackend-sni-1  23       run      running   1m25s ago  36s ago
88c1a29e  7acdd7bc  httpsbackend-sni-1  23       run      running   2m10s ago  1m26s ago
7ac186b8  7acdd7bc  httpsbackend-sni-1  22       stop     complete  2h11m ago  1m24s ago
8a79085f  c281658a  httpsbackend-sni-1  22       stop     complete  2h11m ago  34s ago
f9ffef32  9e3ef19f  httpsbackend-sni-1  22       stop     complete  2h11m ago  2m10s ago

我们看到job执行的最终结果:ID为348ef16b的deployment执行成功;所有version 为23的allocations都处于running状态。task group的三个task实例都处于healthy状态。这说明滚动更新成功了!

我们也可以通过nomad提供的deployment子命令查看deployment的状态,deployment id作为命令参数:

# nomad deployment list
ID        Job ID              Job Version  Status      Description
348ef16b  httpsbackend-sni-1  23           successful  Deployment completed successfully

# nomad deployment status 348ef16b
ID          = 348ef16b
Job ID      = httpsbackend-sni-1
Job Version = 23
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group          Desired  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  3        3       3        0          2019-04-08T13:18:43+08:00

滚动更新后的路由:

img{512x368}

测试一下部署成功的新版本服务:

# curl -k https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.1

三. 金丝雀部署(canary deployment)

金丝雀部署是另外一种十分有用的部署模式,下面示意图来自blog.itaysk.com:

img{512x368}

金丝雀 (Canary)得名于矿工的一个工作习惯:下矿洞前,先会放一只金丝雀进去探测是否有有毒气体,看金丝雀能否活下来。如果金丝雀活下来,则继续下矿操作;否则停止下矿。金丝雀部署亦是先部署少量新版本的服务实例,发布后,开发者可简单地通过手工测试验证新版本实例,又或通过完善的自动化测试基础设施对新版本实例进行详尽验证;甚至是直接接收部分生产流量以充分验证新版本功能、稳定性、性能等,以给予开发者更多信心。如果金丝雀实例通过全部测试验证,则把所有老版本全部升级为新版本。如果金丝雀测试失败,则直接回退金丝雀实例,发布失败。

nomad支持两种模式的canary部署:既支持部署canary实例去直接接收生产流量(按比例权重),也可以将其与生产实例隔离开来(利用路由)单独测试验证,下面分别说说这两种模式。

1. 部署canary实例去直接接收生产流量(按比例权重)

我们创建一个新的nomad job文件:httpsbackend-tcp-sni-1-canary-1.nomad

# diff  httpsbackend-tcp-sni-1-canary-1.nomad  httpsbackend-tcp-sni-1-rolling-update.nomad
18d17
<       canary = 1
24c23
<         image = "bigwhite/httpsbackendservice:v1.0.2"
---
>         image = "bigwhite/httpsbackendservice:v1.0.1"

我们看到除了新版本task使用v1.0.2版image之外,最大的不同就是在update {…}配置区域增加了一行:

canary = 1

我们来plan一下该nomad文件:

# nomad job plan httpsbackend-tcp-sni-1-canary-1.nomad
+/- Job: "httpsbackend-sni-1"
+/- Task Group: "httpsbackend-sni-1" (1 canary, 3 ignore)
  +/- Update {
        AutoRevert:       "false"
    +/- Canary:           "0" => "1"
        HealthCheck:      "checks"
        HealthyDeadline:  "300000000000"
        MaxParallel:      "1"
        MinHealthyTime:   "30000000000"
        ProgressDeadline: "600000000000"
      }
  +/- Task: "httpsbackend-sni-1" (forces create/destroy update)
    +/- Config {
      +/- image:              "bigwhite/httpsbackendservice:v1.0.1" => "bigwhite/httpsbackendservice:v1.0.2"
          logging[0][type]:   "json-file"
          port_map[0][https]: "7777"
        }

Scheduler dry-run:
- All tasks successfully allocated.

... ...

我们看到nomad分析的结果是:需要创建一个canary实例,忽略三个已经存在的旧版本task实例。同时task group的canary属性从“0”变为了“1”。

我们来run该job:

# nomad job run httpsbackend-tcp-sni-1-canary-1.nomad
==> Monitoring evaluation "0494a8a9"
    Evaluation triggered by job "httpsbackend-sni-1"
    Evaluation within deployment: "3e541fb3"
    Allocation "4d678e67" created: node "c281658a", group "httpsbackend-sni-1"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "0494a8a9" finished with status "complete"

查看job的run状态:

# nomad job status httpsbackend-sni-1
ID            = httpsbackend-sni-1
Name          = httpsbackend-sni-1
Submit Date   = 2019-04-08T21:04:49+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
httpsbackend-sni-1  0       0         4        0       6         0

Latest Deployment
ID          = 3e541fb3
Status      = running
Description = Deployment is running but requires promotion

Deployed
Task Group          Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  false     3        1         1       0        0          2019-04-08T21:14:49+08:00

Allocations
ID        Node ID   Task Group          Version  Desired  Status    Created    Modified
4d678e67  c281658a  httpsbackend-sni-1  24       run      running   31s ago    15s ago
da1b545b  7acdd7bc  httpsbackend-sni-1  23       run      running   7h57m ago  7h56m ago
44da5693  9e3ef19f  httpsbackend-sni-1  23       run      running   7h57m ago  7h57m ago
88c1a29e  7acdd7bc  httpsbackend-sni-1  23       run      running   7h58m ago  7h58m ago

# nomad deployment status 3e541fb3
ID          = 3e541fb3
Job ID      = httpsbackend-sni-1
Job Version = 24
Status      = running
Description = Deployment is running but requires promotion

Deployed
Task Group          Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  false     3        1         1       1        0          2019-04-08T21:15:35+08:00

我们看到:

  • 处于running状态的allocations变成了4个,但是只有一个是version = 24的,其余都为version = 23。version = 24这个显然是我们新部署的canary实例,而另外三个则为原有的老版本实例。

  • 在Deployment输出信息中,我们看到了一个描述信息:“Deployment is running but requires promotion”,意思是此次用于部署canary实例的Deployment已经running了,但是还未到最终状态,还需要promote命令。只有promote后,整个的更新工作才算是ok。

下面是canary部署后的fabio的路由:

img{512x368}

我们看到canary实例与其余老版本的路由规则是一致的,并平分的负载权重。也就是说新部署的canary实例与老版本实例一起承载生产流量(canary实例占25%的权重),我们来验证一下:

# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.2
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.1
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.1
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.1

我们看到第一个请求的流量就打到了我们部署的Canary实例身上了。

如果经过一段时间的验证后,证明canary实例满足要求,我们就要继续推动部署的进程使得该nomad deployment走向最终状态:即将老版本的实例都升级为新版本。

# nomad deployment promote 3e541fb3
==> Monitoring evaluation "b5e29b1a"
    Evaluation triggered by job "httpsbackend-sni-1"
    Evaluation within deployment: "3e541fb3"
    Allocation "085a518e" created: node "7acdd7bc", group "httpsbackend-sni-1"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "b5e29b1a" finished with status "complete"

# nomad job status httpsbackend-sni-1
ID            = httpsbackend-sni-1
Name          = httpsbackend-sni-1
Submit Date   = 2019-04-08T21:04:49+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
httpsbackend-sni-1  0       0         3        0       9         0

Latest Deployment
ID          = 3e541fb3
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group          Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  true      3        1         3       3        0          2019-04-08T21:30:54+08:00

Allocations
ID        Node ID   Task Group          Version  Desired  Status    Created     Modified
40276d89  9e3ef19f  httpsbackend-sni-1  24       run      running   56s ago     11s ago
085a518e  7acdd7bc  httpsbackend-sni-1  24       run      running   1m49s ago   58s ago
4d678e67  c281658a  httpsbackend-sni-1  24       run      running   16m17s ago  1m49s ago
da1b545b  7acdd7bc  httpsbackend-sni-1  23       stop     complete  8h12m ago   56s ago
44da5693  9e3ef19f  httpsbackend-sni-1  23       stop     complete  8h13m ago   1m48s ago
88c1a29e  7acdd7bc  httpsbackend-sni-1  23       stop     complete  8h14m ago   1m47s ago

通过deployment promote命令使得canary deployment进程继续推进,直到将所有老版本的实例都用canary实例替换掉。也就是我们最终看到的上面的version = 24的allocations都处于running状态,并且一共是三个实例。

我们再来测试一下升级后的服务:

# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.2
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.2
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.2

我们看到:所有实例都升级到了v1.0.2版本。

2.将canary实例与生产实例隔离开来(利用路由)单独测试验证

如果开发者对自己的代码很有信心,不需要将canary实例暴露在生产流量中去验证,nomad也支持将canary实例与生产实例隔离开来(利用路由)单独测试验证。

我们基于httpsbackend-tcp-sni-1-canary-1.nomad改写出一个httpsbackend-tcp-sni-1-canary-2.nomad:

# diff httpsbackend-tcp-sni-1-canary-2.nomad httpsbackend-tcp-sni-1-canary-1.nomad
24c24
<         image = "bigwhite/httpsbackendservice:v1.0.3"
---
>         image = "bigwhite/httpsbackendservice:v1.0.2"
43d42
<     canary_tags = ["urlprefix-canary.mysite-sni-1.com/ proto=tcp+sni"]

我们看到,在新的job文件中,我们除了将image版本升级为v1.0.3,我们还在service{…}配置区域增加了下面这行:

canary_tags = ["urlprefix-canary.mysite-sni-1.com/ proto=tcp+sni"]

该配置是canary实例专有的,这里我们通过在canary_tags为canary实例单独定义了路由,以免和老版本实例共享路由分担生产流量。

我们照例运行该job并查看job执行后的status:

# nomad job run httpsbackend-tcp-sni-1-canary-2.nomad
==> Monitoring evaluation "44e36161"
    Evaluation triggered by job "httpsbackend-sni-1"
    Evaluation within deployment: "e43d2551"
    Allocation "73319890" created: node "7acdd7bc", group "httpsbackend-sni-1"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "44e36161" finished with status "complete"

# nomad job status httpsbackend-sni-1
ID            = httpsbackend-sni-1
Name          = httpsbackend-sni-1
Submit Date   = 2019-04-08T21:35:03+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
httpsbackend-sni-1  0       0         4        0       9         0

Latest Deployment
ID          = e43d2551
Status      = running
Description = Deployment is running but requires promotion

Deployed
Task Group          Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  false     3        1         1       1        0          2019-04-08T21:45:51+08:00

Allocations
ID        Node ID   Task Group          Version  Desired  Status    Created     Modified
73319890  7acdd7bc  httpsbackend-sni-1  25       run      running   2m24s ago   1m36s ago
40276d89  9e3ef19f  httpsbackend-sni-1  24       run      running   17m18s ago  16m33s ago
085a518e  7acdd7bc  httpsbackend-sni-1  24       run      running   18m11s ago  17m20s ago
4d678e67  c281658a  httpsbackend-sni-1  24       run      running   32m39s ago  18m11s ago

这个输出信息和之前的canary模式差别不大。但是从fabio路由表上我们看到如下信息:

img{512x368}

fabio单独为canary实例生成了一个新路由,以区别于老版本的三个实例的路由。

开发人员单独测试canary实例时,可以通过下面方式注入流量:

# curl -k  https://canary.mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.3

而生产流量依旧流入老版本的实例中:

# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.2
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.2
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.2

canary实例经过测试验证后,同样可以通过promote完成对老版本的升级部署:

# nomad deployment promote e43d2551
==> Monitoring evaluation "34a67391"
    Evaluation triggered by job "httpsbackend-sni-1"
    Evaluation within deployment: "e43d2551"
    Allocation "193cbc2f" created: node "c281658a", group "httpsbackend-sni-1"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "34a67391" finished with status "complete"

# nomad job status httpsbackend-sni-1
ID            = httpsbackend-sni-1
Name          = httpsbackend-sni-1
Submit Date   = 2019-04-08T21:35:03+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
httpsbackend-sni-1  0       0         3        0       12        0

Latest Deployment
ID          = e43d2551
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group          Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  true      3        1         3       3        0          2019-04-08T21:58:24+08:00

Allocations
ID        Node ID   Task Group          Version  Desired  Status    Created     Modified
528a75bd  7acdd7bc  httpsbackend-sni-1  25       run      running   51s ago     10s ago
193cbc2f  c281658a  httpsbackend-sni-1  25       run      running   1m39s ago   52s ago
73319890  7acdd7bc  httpsbackend-sni-1  25       run      running   13m31s ago  1m39s ago
40276d89  9e3ef19f  httpsbackend-sni-1  24       stop     complete  28m25s ago  50s ago
085a518e  7acdd7bc  httpsbackend-sni-1  24       stop     complete  29m18s ago  1m38s ago
4d678e67  c281658a  httpsbackend-sni-1  24       stop     complete  43m46s ago  1m39s ago

同时,canary实例在fabiolb上的路由也会自动删除掉。canary_tags在promote后将不再起作用,fabio使用的是tags。

# curl -k  https://canary.mysite-sni-1.com:9996/
curl: (35) gnutls_handshake() failed: The TLS connection was non-properly terminated.
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.3
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.3
# curl -k  https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.3

四. 蓝绿部署(blue-green deployment)

下面的蓝绿部署模式的示意图同样来自blog.itaysk.com:

img{512x368}

与之前的滚动更新、金丝雀部署不同的是,蓝绿部署需要“两套”环境,通过路由指向来切换流量究竟经过哪套环境。

但是在nomad官方关于blue-green部署的例子中,nomad实际只维护了一套环境,并且例子中是利用nomad的canary机制来实现的蓝绿部署。这种实现方式并非严格遵循“蓝绿部署”的公认的定义。

但nomad官方对于blue-green部署的理解似乎仅限如此。我们也来看一下nomad的这种“全量金丝雀”的蓝绿方案:

我们创建httpsbackend-tcp-sni-1-blue-green.nomad文件,重点内容差异如下:

# diff httpsbackend-tcp-sni-1-blue-green.nomad httpsbackend-tcp-sni-1-canary-1.nomad
18c18
<       canary = 3
---
>       canary = 1
24c24
<         image = "bigwhite/httpsbackendservice:v1.0.4"
---
>         image = "bigwhite/httpsbackendservice:v1.0.2"

我们看到这里canary = 3,与count值相同,这也是将其称为“全量金丝雀”的原因。

使用该文件部署新版本实例:

# nomad job run httpsbackend-tcp-sni-1-blue-green.nomad
==> Monitoring evaluation "7a5074f3"
    Evaluation triggered by job "httpsbackend-sni-1"
    Evaluation within deployment: "3c8740f2"
    Allocation "338ee344" created: node "c281658a", group "httpsbackend-sni-1"
    Allocation "3dec73d2" created: node "9e3ef19f", group "httpsbackend-sni-1"
    Allocation "e6975673" created: node "9e3ef19f", group "httpsbackend-sni-1"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7a5074f3" finished with status "complete"

# nomad job status httpsbackend-sni-1
ID            = httpsbackend-sni-1
Name          = httpsbackend-sni-1
Submit Date   = 2019-04-09T13:38:49+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
httpsbackend-sni-1  0       0         6        0       12        0

Latest Deployment
ID          = 3c8740f2
Status      = running
Description = Deployment is running but requires promotion

Deployed
Task Group          Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  false     3        3         3       3        0          2019-04-09T13:49:41+08:00

Allocations
ID        Node ID   Task Group          Version  Desired  Status   Created     Modified
338ee344  c281658a  httpsbackend-sni-1  26       run      running  57s ago     5s ago
3dec73d2  9e3ef19f  httpsbackend-sni-1  26       run      running  57s ago     11s ago
e6975673  9e3ef19f  httpsbackend-sni-1  26       run      running  57s ago     10s ago
528a75bd  7acdd7bc  httpsbackend-sni-1  25       run      running  15h52m ago  15h51m ago
193cbc2f  c281658a  httpsbackend-sni-1  25       run      running  15h52m ago  15h52m ago
73319890  7acdd7bc  httpsbackend-sni-1  25       run      running  16h4m ago   15h52m ago

部署ok后,6个实例共同接收生产流量。当然我们也可以通过canary_tags为新的部署设定不同路由,选择哪一种要看部署新实例后打算对新实例如何进行测试。

测试验证ok后,像canary deployment一样,通过promote命令用新版本替换老版本。

# nomad deployment promote 3c8740f2
==> Monitoring evaluation "fad3a69b"
    Evaluation triggered by job "httpsbackend-sni-1"
    Evaluation within deployment: "3c8740f2"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "fad3a69b" finished with status "complete"

# nomad job status httpsbackend-sni-1
ID            = httpsbackend-sni-1
Name          = httpsbackend-sni-1
Submit Date   = 2019-04-09T13:38:49+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group          Queued  Starting  Running  Failed  Complete  Lost
httpsbackend-sni-1  0       0         3        0       15        0

Latest Deployment
ID          = 3c8740f2
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group          Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
httpsbackend-sni-1  true      3        3         3       3        0          2019-04-09T13:49:41+08:00

Allocations
ID        Node ID   Task Group          Version  Desired  Status    Created     Modified
338ee344  c281658a  httpsbackend-sni-1  26       run      running   4m43s ago   15s ago
3dec73d2  9e3ef19f  httpsbackend-sni-1  26       run      running   4m43s ago   15s ago
e6975673  9e3ef19f  httpsbackend-sni-1  26       run      running   4m43s ago   15s ago
528a75bd  7acdd7bc  httpsbackend-sni-1  25       stop     complete  15h55m ago  14s ago
193cbc2f  c281658a  httpsbackend-sni-1  25       stop     complete  15h56m ago  15s ago
73319890  7acdd7bc  httpsbackend-sni-1  25       stop     complete  16h8m ago   14s ago

测试结果:

# curl -k https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.4

如果要快速切换回原来的版本,可以使用:

nomad job revert httpsbackend-sni-1 {old_allocation_version}

五. 其他

本文涉及到的nomad job文件源码可在这里下载。


我的网课“Kubernetes实战:高可用集群搭建、配置、运维与应用”在慕课网上线了,感谢小伙伴们学习支持!

我爱发短信:企业级短信平台定制开发专家 https://tonybai.com/
smspush : 可部署在企业内部的定制化短信平台,三网覆盖,不惧大并发接入,可定制扩展; 短信内容你来定,不再受约束, 接口丰富,支持长短信,签名可选。

著名云主机服务厂商DigitalOcean发布最新的主机计划,入门级Droplet配置升级为:1 core CPU、1G内存、25G高速SSD,价格5$/月。有使用DigitalOcean需求的朋友,可以打开这个链接地址:https://m.do.co/c/bff6eed92687 开启你的DO主机之路。

我的联系方式:

微博:https://weibo.com/bigwhite20xx
微信公众号:iamtonybai
博客:tonybai.com
github: https://github.com/bigwhite

微信赞赏:
img{512x368}

商务合作方式:撰稿、出书、培训、在线课程、合伙创业、咨询、广告合作。

记一次go panic问题的解决过程

一. Panic问题概述

本周收到客户在bugclose上填写的一个issue:添加一个下发通道后,pushd程序panic并退出了!程序panic时输出的stacktrace信息摘录如下:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8ca449]

goroutine 266900 [running]:
pkg.tonybai.com/smspush/vendor/github.com/bigwhite/gocmpp.(*Client).Connect(0xc42040c7f0, 0xc4203d29c0, 0x11, 0xc420423256, 0x6, 0xc420423260, 0x8, 0x37e11d600, 0x0, 0x0)
        /root/.go/src/pkg.tonybai.com/smspush/vendor/github.com/bigwhite/gocmpp/client.go:79 +0x239
pkg.tonybai.com/smspush/pkg/pushd/pusher.cmpp2Login(0xc4203d29c0, 0x11, 0xc420423256, 0x6, 0xc420423260, 0x8, 0x37e11d600, 0xc4203d29c0, 0x11, 0x73)
        /root/.go/src/pkg.tonybai.com/smspush/pkg/pushd/pusher/cmpp2_handler.go:25 +0x9a
pkg.tonybai.com/smspush/pkg/pushd/pusher.newCMPP2Loop(0xc42071f800, 0x4, 0xaaecd8)
        /root/.go/src/pkg.tonybai.com/smspush/pkg/pushd/pusher/cmpp2_handler.go:65 +0x226
pkg.tonybai.com/smspush/pkg/pushd/pusher.(*tchanSession).Run(0xc42071f800, 0xaba7c3, 0x17)
        /root/.go/src/pkg.tonybai.com/smspush/pkg/pushd/pusher/session.go:52 +0x98
pkg.tonybai.com/smspush/pkg/pushd/pusher.(*gateway).addSession.func1(0xc4200881a0, 0xc42071f800, 0xc42040c700)
        /root/.go/src/pkg.tonybai.com/smspush/pkg/pushd/pusher/gateway.go:61 +0x11e
created by pkg.tonybai.com/smspush/pkg/pushd/pusher.(*gateway).addSession
        /root/.go/src/pkg.tonybai.com/smspush/pkg/pushd/pusher/gateway.go:58 +0x350

印象中近大半年用Go写的程序,遇到panic情况不多。上一次是因为原生map变量的并发访问导致的panic,那次panic一眼就看到问题所在了。但这次又是因为啥呢?

二. 分析和debug过程

这个问题在印象中似乎出现过,不过由于当初没有复现,客户环境中又没有panic信息提供,那时没能定位和解决,后来问题并没有出现,显然这个问题是有一定“随机属性”。

对于panic,我们首先检查直接导致panic发生的那一行代码:

        /root/.go/src/pkg.tonybai.com/smspush/vendor/github.com/bigwhite/gocmpp/client.go:79 +0x239

下面是client.go 79行周围的代码片段:

img{512x368}

也许是疏忽大意,当时瞅了一眼后,就断定这块没有问题(更多从业务协议层面考虑),这也直接导致后面绕了一个大圈子才查到”真凶”。如果您还没看出来问题,那继续往下看。

定式思维让我认为很可能是函数栈中的内存问题,于是我开始调查panic输出的函数调用栈中参数是否正确。

要想知道函数调用栈中参数传递是否有问题,先要知晓panic后输出的栈帧信息都是什么!比如下面panic dump信息中参数中的各种magic number都代表什么!

gocmpp.(*Client).Connect(0xc42040c7f0, 0xc4203d29c0, 0x11, 0xc420423256, 0x6, 0xc420423260, 0x8, 0x37e11d600, 0x0, 0x0)

pusher.cmpp2Login(0xc4203d29c0, 0x11, 0xc420423256, 0x6, 0xc420423260, 0x8, 0x37e11d600, 0xc4203d29c0, 0x11, 0x73)

pusher.newCMPP2Loop(0xc42071f800, 0x4, 0xaaecd8)

在Joe Shaw的《Understanding Go panic output》和William Kennedy的《Stack Traces In Go》中有针对Stack trace输出信息的解析。关于Stack trace输出信息的识别,总体遵循几个要点:

  • stack trace中每个函数/方法后面的“参数数值”个数与函数/方法原型的参数个数不是一一对应的;

  • stack trace中每个函数/方法后面的“参数数值”是按照函数/方法原型参数列表中从左到右的参数类型的内存布局逐一展开的; 每个数值占用一个word(64位平台下面为8字节)

  • 如果是method,则第一个参数是receiver自身。如果reciever是指针类型,则第一个参数数值就是一个指针地址;如果是非指针的实例,则stack trace会按照其内存布局输出;

  • 函数/方法返回值放在stack trace的“参数数值”列表的后面;如果有多个返回值,则同样按从左到右顺序,按照返回值类型的内存布局输出;

  • 指针类型参数:占用stack trace的“参数数值”列表的1个位置;数值表示指针值,也是指针指向的对象的地址;

  • string类型参数:由于string在内存中由两个字(word)表示,第一个字是数据指针,第二个字是string的长度,因此在stack trace的“参数数值”列表中将占用两个位置;

  • slice类型参数:由于slice类型在内存中由三个字表示,第一个word是数据指针,第二个word是len,第三个字是cap,因此在stack trace的“参数数值”列表中将占用三个位置;

  • 内建整型(int,rune,byte):由于按word逐个输出,对于类型长度不足一个Word的参数,会做合并处理;比如:一个函数有5个int16类型的参数,那么在stack trace的信息中,这5个参数将占用stack trace的“参数数值”列表中的两个位置;第一个位置是前4个参数的“合体”,第二个位置则是最后那个int16类型的参数值。

  • struct类型参数: 会按照struct中字段的内存布局顺序在stack trace中展开。

  • interface类型参数:由于interface类型在内存中由两部分组成,一部分是接口类型的参数指针,一部分是接口值的参数指针,因此interface类型参数将用stack trace的“参数数值”列表中的两个位置。

  • stack trace输出的信息是在函数调用过程中的“快照”信息,因此一些输出数值看似不合理,但是由于其并不是最终值,所以问题不一定发生在这些参数身上,比如:返回值参数。

结合上面要点、函数/方法原型以及stack trace的输出,我们来“定位”一下stack trace输出的各个“参数”的含义:

cmpp2Login和Connect的原型以及调用关系如下:

func cmpp2Login(dstAddr, user, password string, connectTimeout time.Duration) (*cmpp.Client, error)

func (cli *Client) Connect(servAddr, user, password string, timeout time.Duration) error

func cmpp2Login(dstAddr, user, password string, connectTimeout time.Duration) (*cmpp.Client, error) {
    c := cmpp.NewClient(cmpp.V21)
    return c, c.Connect(dstAddr, user, password, connectTimeout)
}

对照后,我们得出下面对应关系:

pusher.cmpp2Login(
        0xc4203d29c0,  // dstAddr的data pointer
        0x11,                  // dstAddr string的length
        0xc420423256,  // user 的data pointer
        0x6,                    // user string的length
        0xc420423260,  // password的data pointer
        0x8,                    // password string的length
        0x37e11d600,    // connectTimeout
        0xc4203d29c0,  // 返回值:Client的指针
        0x11,                 // 返回值:error接口的type pointer
        0x73)                 // 返回值:error接口的data pointer

gocmpp.(*Client).Connect(
        0xc42040c7f0,   //cli的指针
        0xc4203d29c0,  //servAddr string的data pointer
        0x11,                  //servAddr string的 length
        0xc420423256,  // user string的data pointer
        0x6,                    // user string的length
        0xc420423260,  // password的data pointer
        0x8,                    // password string的length
        0x37e11d600,   // timeout
        0x0,                   // 返回值:error接口的type pointer
        0x0)                   // 返回值:error接口的data pointer

在这里,cmpp2Login的dstAddr、user、password、connectTimeout这些输入参数值都非常正常;看起来不正常的两个返回值在栈帧中的值其实意义不大,因为connect没有返回,所以这些值处于“非最终态”;而Connect执行到第79行panic,因此其返回值error的两个值也是处于“中间状态”。

这样一来,似乎没有参数是错误的!

三. 回到起点,捉住“真凶”

在反复查看代码和对比stack trace的参数列表后,依然没有找到蛛丝马迹。遂决定平复心情,从头再来,回到起点!

        var ok bool
        var status uint8
        if cli.typ == V20 || cli.typ == V21 {
                var rsp *Cmpp2ConnRspPkt
                rsp, ok = p.(*Cmpp2ConnRspPkt)
                status = rsp.Status
        } else {
                var rsp *Cmpp3ConnRspPkt
                rsp, ok = p.(*Cmpp3ConnRspPkt)
                status = uint8(rsp.Status)   <------ 79行
        }

        if !ok {
                err = ErrRespNotMatch
                return err
        }

又反复看了这段代码!程序正常执行时都是经过这段代码的,都是正常的。为何随机爆出panic呢?79行如果要panic,显然是rsp为nil或其他非法地址。但rsp是由p进行type assertion而来的!难道是type assertion失败了!!!

从正常业务流程来看,这里是不会失败的!这也是当初这里没有立即检查ok这个bool值的原因。但是特殊情况下,也就是当tcp连接建立后,conn包发出后,对方未必返回是conn response包,很可能是其他包回来(比如active test),这样就会导致这块的type assertion失败!这也与这个问题随机发生的情况吻合!

而且当初保留了“ok”,而不是用”_”代替,说明设计思路中是存在返回的包不是conn response包的情况。看来是当初coding时逻辑混乱了:(

这就是问题所在了!教训:type assertion后一定要在检查ok这个bool值之后再决定是否使用assertion之后的变量

四. 其他

借着这个问题的解决过程,再多说一句 stacktrace。在Go 1.11及以后版本中,go compiler做了更深入的优化,很多“简单”的函数或方法会被自动inline(内联)了,函数一旦内联化了,那么在stack trace中我们就无法看到栈帧信息了,就会看到如下在栈帧信息中存在省略号的情况:

 $go run stacktrace.go
panic: panic in foo

goroutine 1 [running]:
main.(*Y).foo(...)
    /Users/tony/test/go/stacktrace/stacktrace2.go:32
main.main()
    /Users/tony/test/go/stacktrace/stacktrace.go:51 +0x39
exit status 2

可以使用-gcflags=”-l”来告诉编译器不要inline。至于是否要这么做,就要看debug和性能之间您是如何权衡的了。


我的网课“Kubernetes实战:高可用集群搭建、配置、运维与应用”在慕课网上线了,感谢小伙伴们学习支持!

我爱发短信:企业级短信平台定制开发专家 https://tonybai.com/
smspush : 可部署在企业内部的定制化短信平台,三网覆盖,不惧大并发接入,可定制扩展; 短信内容你来定,不再受约束, 接口丰富,支持长短信,签名可选。

著名云主机服务厂商DigitalOcean发布最新的主机计划,入门级Droplet配置升级为:1 core CPU、1G内存、25G高速SSD,价格5$/月。有使用DigitalOcean需求的朋友,可以打开这个链接地址:https://m.do.co/c/bff6eed92687 开启你的DO主机之路。

我的联系方式:

微博:https://weibo.com/bigwhite20xx
微信公众号:iamtonybai
博客:tonybai.com
github: https://github.com/bigwhite

微信赞赏:
img{512x368}

商务合作方式:撰稿、出书、培训、在线课程、合伙创业、咨询、广告合作。

如发现本站页面被黑,比如:挂载广告、挖矿等恶意代码,请朋友们及时联系我。十分感谢! Go语言第一课 Go语言精进之路1 Go语言精进之路2 Go语言编程指南
商务合作请联系bigwhite.cn AT aliyun.com

欢迎使用邮件订阅我的博客

输入邮箱订阅本站,只要有新文章发布,就会第一时间发送邮件通知你哦!

这里是 Tony Bai的个人Blog,欢迎访问、订阅和留言! 订阅Feed请点击上面图片

如果您觉得这里的文章对您有帮助,请扫描上方二维码进行捐赠 ,加油后的Tony Bai将会为您呈现更多精彩的文章,谢谢!

如果您希望通过微信捐赠,请用微信客户端扫描下方赞赏码:

如果您希望通过比特币或以太币捐赠,可以扫描下方二维码:

比特币:

以太币:

如果您喜欢通过微信浏览本站内容,可以扫描下方二维码,订阅本站官方微信订阅号“iamtonybai”;点击二维码,可直达本人官方微博主页^_^:
本站Powered by Digital Ocean VPS。
选择Digital Ocean VPS主机,即可获得10美元现金充值,可 免费使用两个月哟! 著名主机提供商Linode 10$优惠码:linode10,在 这里注册即可免费获 得。阿里云推荐码: 1WFZ0V立享9折!


View Tony Bai's profile on LinkedIn
DigitalOcean Referral Badge

文章

评论

  • 正在加载...

分类

标签

归档



View My Stats