Loadbalance | Tony Bai

标签 loadbalance 下的文章

使用nomad实现集群管理和微服务部署调度

三月 30, 2019
4 条评论

在“云原生”、“容器化”、“微服务”、“服务网格”等概念大行其道的今天，一提到集群管理、容器工作负载调度，人们首先想到的是Kubernetes。

Kubernetes经过多年的发展，目前已经成为了云原生计算平台的事实标准，得到了诸如谷歌、微软、红帽、亚马逊、IBM、阿里等大厂的大力支持，各大云计算提供商也都提供了专属Kubernetes集群服务。开发人员可以一键在这些大厂的云上创建k8s集群。对于那些不愿被cloud provider绑定的组织或开发人员，Kubernetes也提供了诸如Kubeadm这样的k8s集群引导工具，帮助大家在裸金属机器上搭建自己的k8s集群，当然这样做的门槛较高（如果您想学习自己搭建和管理k8s集群，可以参考我在慕课网上发布的实战课《高可用集群搭建、配置、运维与应用》）。

Kubernetes的学习曲线是公认的较高，尤其是对于应用开发人员。再加上Kubernetes发展很快，越来越多的概念和功能加入到k8s技术栈，这让人们不得不考虑建立和维护这样一套集群所要付出的成本。人们也在考虑是否所有场景都需要部署一个k8s集群，是否有轻量级的且能满足自身需求的集群管理和微服务部署调度方案呢？外国朋友Matthias Endler就在其文章《也许你不需要Kubernetes》中给出一个轻量级的集群管理方案 – 使用hashicorp开源的nomad工具。

这让我想起了去年写的《基于consul实现微服务的服务发现和负载均衡》一文。文中虽然实现了基于consul的服务注册、发现以及负载均衡，但是缺少一个环节：那就是整个集群管理以及工作负载部署调度自动化的缺乏。nomad应该恰好可以补足这一短板，并且它足够轻量。本文我们就来探索和实践一下使用nomad实现集群管理和微服务部署调度。

一. 安装nomad集群

nomad是Hashicorp公司出品的集群管理和工作负荷调度器，支持多种驱动形式的工作负载调度，包括Docker容器、虚拟机、原生可执行程序等，并支持跨数据中心调度。Nomad不负责服务发现或密钥管理等，它将这些功能分别留给了HashiCorp的Consul和Vault。HashiCorp的创始人认为，这会使得Nomad更为轻量级，调度性能更高。

nomad使用Go语言实现，因此其本身仅仅是一个可执行的二进制文件。和Hashicorp其他工具产品(诸如：consul等)类似，nomad一个可执行文件既可以以server模式运行，亦可以client模式运行，甚至可以启动一个实例，既是server，也是client。

下面是nomad集群的架构图(来自hashicorp官方）:

img{512x368}

一个nomad集群至少要包含一个server，作为集群的控制平面；一个或多个client则用于承载工作负荷。通常生产环境nomad集群的控制平面至少要有5个及以上的server才能在高可用上有一定保证。

建立一个nomad集群有多种方法，包括手工建立、基于consul自动建立和基于云自动建立。考虑到后续涉及微服务的注册发现，这里我们采用基于consul自动建立nomad集群的方法，下面是部署示意图：

img{512x368}

我这里的试验环境仅有三台hosts，因此这三台host既承载consul集群，也承载nomad集群（包括server和client），即nomad的控制平面和工作负荷由这三台host一并承担了。

1. consul集群启动

在之前的《基于consul实现微服务的服务发现和负载均衡》一文中，我对consul集群的建立做过详细地说明，因此这里只列出步骤，不详细解释了。注意：这次consul的版本升级到了consul v1.4.4了。

在每个node上分别下载consul 1.4.4：

# wget -c https://releases.hashicorp.com/consul/1.4.4/consul_1.4.4_linux_amd64.zip
# unzip consul_1.4.4_linux_amd64.zip

# cp consul /usr/local/bin

# consul -v

Consul v1.4.4
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

启动consul集群：(每个node上创建~/.bin/consul-install目录，并进入该目录下执行)

dxnode1:

# nohup consul agent -server -ui -dns-port=53 -bootstrap-expect=3 -data-dir=~/.bin/consul-install/consul-data -node=consul-1 -client=0.0.0.0 -bind=172.16.66.102 -datacenter=dc1 > consul-1.log & 2>&1

dxnode2:

# nohup consul agent -server -ui -dns-port=53  -bootstrap-expect=3 -data-dir=/root/consul-install/consul-data -node=consul-2 -client=0.0.0.0 -bind=172.16.66.103 -datacenter=dc1 -join 172.16.66.102 > consul-2.log & 2>&1

dxnode3:

nohup consul agent -server -ui -dns-port=53  -bootstrap-expect=3 -data-dir=/root/consul-install/consul-data -node=consul-3 -client=0.0.0.0 -bind=172.16.66.104 -datacenter=dc1 -join 172.16.66.102 > consul-3.log & 2>&1

consul集群启动结果查看如下：

# consul members
Node      Address             Status  Type    Build  Protocol  DC   Segment
consul-1  172.16.66.102:8301  alive   server  1.4.4  2         dc1  <all>
consul-2  172.16.66.103:8301  alive   server  1.4.4  2         dc1  <all>
consul-3  172.16.66.104:8301  alive   server  1.4.4  2         dc1  <all>

# consul operator raft list-peers
Node      ID                                    Address             State     Voter  RaftProtocol
consul-3  d048e55b-5f6a-34a4-784c-e6607db0e89e  172.16.66.104:8300  leader    true   3
consul-1  160a7a20-f177-d2f5-0765-e6d1a9a1a9a4  172.16.66.102:8300  follower  true   3
consul-2  6795cd2c-fad5-9d4f-2531-13b0a65e0893  172.16.66.103:8300  follower  true   3

2. DNS设置（可选）

如果采用基于consul DNS的方式进行服务发现，那么在每个nomad client node上设置DNS则很必要。否则如果要是基于consul service catalog的API去查找service，则可忽略这个步骤。设置步骤如下：

在每个node上，创建和编辑/etc/resolvconf/resolv.conf.d/base，填入如下内容：

nameserver {consul-1-ip}
nameserver {consul-2-ip}

然后重启resolvconf服务:

#  /etc/init.d/resolvconf restart
[ ok ] Restarting resolvconf (via systemctl): resolvconf.service.

新的resolv.conf将变成：

# cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver {consul-1-ip}
nameserver {consul-2-ip}
nameserver 100.100.2.136
nameserver 100.100.2.138
options timeout:2 attempts:3 rotate single-request-reopen

这样无论是在host上，还是在新启动的container里就都可以访问到xx.xx.consul域名的服务了：

# ping -c 3 consul.service.dc1.consul
PING consul.service.dc1.consul (172.16.66.103) 56(84) bytes of data.
64 bytes from 172.16.66.103: icmp_seq=1 ttl=64 time=0.227 ms
64 bytes from 172.16.66.103: icmp_seq=2 ttl=64 time=0.158 ms
^C
--- consul.service.dc1.consul ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.158/0.192/0.227/0.037 ms

# docker run busybox ping -c 3 consul.service.dc1.consul

PING consul.service.dc1.consul (172.16.66.104): 56 data bytes
64 bytes from 172.16.66.104: seq=0 ttl=64 time=0.067 ms
64 bytes from 172.16.66.104: seq=1 ttl=64 time=0.061 ms
64 bytes from 172.16.66.104: seq=2 ttl=64 time=0.076 ms

--- consul.service.dc1.consul ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.061/0.068/0.076 ms

3. 基于consul集群引导启动nomad集群

按照之前的拓扑图，我们需先在每个node上分别下载nomad：

# wget -c https://releases.hashicorp.com/nomad/0.8.7/nomad_0.8.7_linux_amd64.zip

# unzip nomad_0.8.7_linux_amd64.zip.zip

# cp ./nomad /usr/local/bin

# nomad -v

Nomad v0.8.7 (21a2d93eecf018ad2209a5eab6aae6c359267933+CHANGES)

我们已经建立了consul集群，因为我们将采用基于consul集群引导启动nomad集群这一创建nomad集群的最Easy方式。同时，我们每个node上既要运行nomad server，也要nomad client，于是我们在nomad的配置文件中，对server和client都设置为”enabled = true”。下面是nomad启动的配置文件，每个node上的nomad均将该配置文件作为为输入：

// agent.hcl

data_dir = "/root/.bin/nomad-install/nomad.d"

server {
  enabled = true
  bootstrap_expect = 3
}

client {
  enabled = true
}

下面是在各个节点上启动nomad的操作步骤：

dxnode1:

# nohup nomad agent -config=/root/.bin/nomad-install/agent.hcl  > nomad-1.log & 2>&1

dxnode2:

# nohup nomad agent -config=/root/.bin/nomad-install/agent.hcl  > nomad-2.log & 2>&1

dxnode3:

# nohup nomad agent -config=/root/.bin/nomad-install/agent.hcl  > nomad-3.log & 2>&1

查看nomad集群的启动结果：

#  nomad server members
Name            Address        Port  Status  Leader  Protocol  Build  Datacenter  Region
dxnode1.global  172.16.66.102  4648  alive   true    2         0.8.7  dc1         global
dxnode2.global  172.16.66.103  4648  alive   false   2         0.8.7  dc1         global
dxnode3.global  172.16.66.104  4648  alive   false   2         0.8.7  dc1         global

# nomad operator raft list-peers

Node            ID                  Address             State     Voter  RaftProtocol
dxnode1.global  172.16.66.102:4647  172.16.66.102:4647  leader    true   2
dxnode2.global  172.16.66.103:4647  172.16.66.103:4647  follower  true   2
dxnode3.global  172.16.66.104:4647  172.16.66.104:4647  follower  true   2

# nomad node-status
ID        DC   Name     Class   Drain  Eligibility  Status
7acdd7bc  dc1  dxnode1  <none>  false  eligible     ready
c281658a  dc1  dxnode3  <none>  false  eligible     ready
9e3ef19f  dc1  dxnode2  <none>  false  eligible     ready

以上这些命令的结果都显示nomad集群工作正常！

nomad还提供一个ui界面（http://nomad-node-ip:4646/ui），可以让运维人员以可视化的方式直观看到当前nomad集群的状态，包括server、clients、工作负载(job)的情况：

img{512x368}

nomad ui首页

img{512x368}

nomad server列表和状态

img{512x368}

nomad client列表和状态

二. 部署工作负载

引导启动成功nomad集群后，我们接下来就要向集群中添加“工作负载”了。

在Kubernetes中，我们可以通过创建deployment、pod等向集群添加工作负载；在nomad中我们也可以通过类似的声明式的方法向nomad集群添加工作负载。不过nomad相对简单许多，它仅提供了一种名为job的抽象，并给出了job的specification。nomad集群所有关于工作负载的操作均通过job描述文件和nomad job相关子命令完成。下面是通过job部署工作负载的流程示意图：

img{512x368}

从图中可以看到，我们需要做的仅仅是将编写好的job文件提交给nomad即可。

Job spec定义了：job -> group -> task的层次关系。每个job文件只有一个job，但是一个job可能有多个group，每个group可能有多个task。group包含一组要放在同一个集群中调度的task。一个Nomad task是由其驱动程序（driver）在Nomad client节点上执行的命令、服务、应用程序或其他工作负载。task可以是短时间的批处理作业（batch）或长时间运行的服务(service)，例如web应用程序、数据库服务器或API。

Tasks是在用HCL语法的声明性job规范中定义的。Job文件提交给Nomad服务端，服务端决定在何处以及如何将job文件中定义的task分配给客户端节点。另一种概念化的理解是:job规范表示工作负载的期望状态，Nomad服务端创建并维护其实际状态。

通过job，开发人员还可以为工作负载定义约束和资源。约束（constraint）通过内核类型和版本等属性限制了工作负载在节点上的位置。资源（resources）需求包括运行task所需的内存、网络、CPU等。

有三种类型的job：system、service和batch，它们决定Nomad将用于此job中task的调度器。service 调度器被设计用来调度永远不会宕机的长寿命服务。batch作业对短期性能波动的敏感性要小得多，寿命也很短，几分钟到几天就可以完成。system调度器用于注册应该在满足作业约束的所有nomad client上运行的作业。当某个client加入到nomad集群或转换到就绪状态时也会调用它。

Nomad允许job作者为自动重新启动失败和无响应的任务指定策略，并自动将失败的任务重新调度到其他节点，从而使任务工作负载具有弹性。

如果对应到k8s中的概念，group更像是某种controller，而task更类似于pod，是被真实调度的实体。Job spec对应某个k8s api object的spec，具体体现在某个yaml文件中。

下面我们就来真实地在nomad集群中创建一个工作负载。我们使用之前在《基于consul实现微服务的服务发现和负载均衡》一文中使用过的那几个demo image，这里我们先使用httpbackendservice镜像来创建一个job。

下面是httpbackend的job文件：

// httpbackend-1.nomad

job "httpbackend" {
  datacenters = ["dc1"]
  type = "service"

  group "httpbackend" {
    count = 2

    task "httpbackend" {
      driver = "docker"
      config {
        image = "bigwhite/httpbackendservice:v1.0.0"
        port_map {
          http = 8081
        }
        logging {
          type = "json-file"
        }
      }

      resources {
        network {
          mbits = 10
          port "http" {}
        }
      }

      service {
        name = "httpbackend"
        port = "http"
      }
    }
  }
}

这个文件基本都是自解释的，重点提几个地方：

job type: service ：说明该job创建和调度的是一个service类型的工作负载；
count = 2 ：类似于k8s的replicas字段，期望在nomad集群中运行2个httpbackend服务实例，nomad来保证始终处于期望状态。
关于port：port_map指定了task中容器的监听端口。network中的port “http” {}没有指定静态IP，因此将采用动态主机端口。service中的port则指明使用”http”这个tag的动态主机端口。这和k8s中service中port使用名称匹配的方式映射到具体pod中的port的方法类似。

我们使用nomad job子命令来创建该工作负载。正式创建之前，我们可以先通过nomad job plan来dry-run一下，一是看job文件格式是否ok；二来检查一下nomad集群是否有空余资源创建和调度新的工作负载：

# nomad job plan httpbackend-1.nomad
+/- Job: "httpbackend"
+/- Stop: "true" => "false"
    Task Group: "httpbackend" (2 create)
      Task: "httpbackend"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 4248
To submit the job with version verification run:

nomad job run -check-index 4248 httpbackend-1.nomad

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

如果plan的输出结果没有问题，则可以用nomad job run正式创建和调度job：

# nomad job run httpbackend-1.nomad
==> Monitoring evaluation "40c63529"
    Evaluation triggered by job "httpbackend"
    Allocation "6b0b83de" created: node "9e3ef19f", group "httpbackend"
    Allocation "d0710b85" created: node "7acdd7bc", group "httpbackend"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "40c63529" finished with status "complete"

接下来，我们可以使用nomad job status命令查看job的创建情况以及某个job的详细状态信息：

# nomad job status
ID                  Type     Priority  Status   Submit Date
httpbackend         service  50        running  2019-03-30T04:58:09+08:00

# nomad job status httpbackend
ID            = httpbackend
Name          = httpbackend
Submit Date   = 2019-03-30T04:58:09+08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group   Queued  Starting  Running  Failed  Complete  Lost
httpbackend  0       0         2        0       0         0

Allocations
ID        Node ID   Task Group   Version  Desired  Status    Created    Modified
6b0b83de  9e3ef19f  httpbackend  11       run      running   8m ago     7m50s ago
d0710b85  7acdd7bc  httpbackend  11       run      running   8m ago     7m39s ago

前面说过，nomad只是集群管理和负载调度，服务发现它是不管的，并且服务发现的问题早已经被consul解决掉了。所以httpbackend创建后，要想使用该服务，我们还得走consul提供的路线：

DNS方式(前面已经做过铺垫了)：

# dig SRV httpbackend.service.dc1.consul

; <<>> DiG 9.10.3-P4-Ubuntu <<>> SRV httpbackend.service.dc1.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7742
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 5
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;httpbackend.service.dc1.consul.    IN    SRV

;; ANSWER SECTION:
httpbackend.service.dc1.consul.    0 IN    SRV    1 1 23578 consul-1.node.dc1.consul.
httpbackend.service.dc1.consul.    0 IN    SRV    1 1 22819 consul-2.node.dc1.consul.

;; ADDITIONAL SECTION:
consul-1.node.dc1.consul. 0    IN    A    172.16.66.102
consul-1.node.dc1.consul. 0    IN    TXT    "consul-network-segment="
consul-2.node.dc1.consul. 0    IN    A    172.16.66.103
consul-2.node.dc1.consul. 0    IN    TXT    "consul-network-segment="

;; Query time: 471 msec
;; SERVER: 172.16.66.102#53(172.16.66.102)
;; WHEN: Sat Mar 30 05:07:54 CST 2019
;; MSG SIZE  rcvd: 251

# curl http://172.16.66.102:23578
this is httpbackendservice, version: v1.0.0

# curl http://172.16.66.103:22819
this is httpbackendservice, version: v1.0.0

或http api方式(可通过官方API查询服务)：

# curl http://127.0.0.1:8500/v1/health/service/httpbackend

[
    {
        "Node": {"ID":"160a7a20-f177-d2f5-0765-e6d1a9a1a9a4","Node":"consul-1","Address":"172.16.66.102","Datacenter":"dc1","TaggedAddresses":{"lan":"172.16.66.102","wan":"172.16.66.102"},"Meta":{"consul-network-segment":""},"CreateIndex":7,"ModifyIndex":10},
        "Service": {"ID":"_nomad-task-5uxc3b7hjzivbklslt4yj5bpsfagibrb","Service":"httpbackend","Tags":[],"Address":"172.16.66.102","Meta":null,"Port":23578,"Weights":{"Passing":1,"Warning":1},"EnableTagOverride":false,"ProxyDestination":"","Proxy":{},"Connect":{},"CreateIndex":30727,"ModifyIndex":30727},
        "Checks": [{"Node":"consul-1","CheckID":"serfHealth","Name":"Serf Health Status","Status":"passing","Notes":"","Output":"Agent alive and reachable","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{},"CreateIndex":7,"ModifyIndex":7}]
    },
    {
        "Node": {"ID":"6795cd2c-fad5-9d4f-2531-13b0a65e0893","Node":"consul-2","Address":"172.16.66.103","Datacenter":"dc1","TaggedAddresses":{"lan":"172.16.66.103","wan":"172.16.66.103"},"Meta":{"consul-network-segment":""},"CreateIndex":5,"ModifyIndex":5},
        "Service": {"ID":"_nomad-task-hvqnbklzqr6q5mpspqcqbnhxdil4su4d","Service":"httpbackend","Tags":[],"Address":"172.16.66.103","Meta":null,"Port":22819,"Weights":{"Passing":1,"Warning":1},"EnableTagOverride":false,"ProxyDestination":"","Proxy":{},"Connect":{},"CreateIndex":30725,"ModifyIndex":30725},
        "Checks": [{"Node":"consul-2","CheckID":"serfHealth","Name":"Serf Health Status","Status":"passing","Notes":"","Output":"Agent alive and reachable","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{},"CreateIndex":8,"ModifyIndex":8}]
    }
]

三. 将服务暴露到外部以及负载均衡

集群内部的东西向流量可以通过consul的服务发现来实现，南北向流量则需要我们将部分服务暴露到外部才能实现流量导入。在《基于consul实现微服务的服务发现和负载均衡》一文中，我们是通过nginx实现服务暴露和负载均衡的，但是需要consul-template的协助，并且自己需要实现一个nginx的配置模板，门槛较高也比较复杂。

nomad的官方文档推荐了fabio这个反向代理和负载均衡工具。fabio最初由位于荷兰的“eBay Classifieds Group”开发，它为荷兰（marktplaats.nl），澳大利亚（gumtree.com.au）和意大利（www.kijiji.it）的一些最大网站提供支持。自2015年9月以来，它为这些站点提供23000个请求/秒的处理能力(性能应对一般中等流量是没有太大问题的)，没有发现重大问题。

与consul-template+nginx的组合不同，fabio无需开发人员做任何二次开发，也不需要自定义模板，它直接从consul读取service list并生成相关路由。至于哪些服务要暴露在外部，路由形式是怎样的，是需要在服务启动时为服务设置特定的tag，fabio定义了一套灵活的路由匹配描述方法。

下面我们就来部署fabio，并将上面的httpbackend暴露到外部。

1. 部署fabio

fabio也是nomad集群的一个工作负载，因此我们可以像普通job那样部署fabio。我们先来使用nomad官方文档中给出fabio.nomad：

//fabio.nomad

job "fabio" {
  datacenters = ["dc1"]
  type = "system"

  group "fabio" {
    task "fabio" {
      driver = "docker"
      config {
        image = "fabiolb/fabio"
        network_mode = "host"
        logging {
          type = "json-file"
        }
      }

      resources {
        cpu    = 200
        memory = 128
        network {
          mbits = 20
          port "lb" {
            static = 9999
          }
          port "ui" {
            static = 9998
          }
        }
      }
    }
  }
}

这里有几点值得注意：

fabio job的类型是”system”，也就是说该job会被部署到job可以匹配到（通过设定的约束条件）的所有nomad client上，且每个client上仅部署一个实例，有些类似于k8s的daemonset控制下的pod；
network_mode = “host” 告诉fabio的驱动docker：fabio容器使用host网络，即与主机同网络namespace；
static = 9999和static = 9998，说明fabio在每个nomad client上监听固定的静态端口而不是使用动态端口。这也要求了每个nomad client上不允许存在与fabio端口冲突的应用启动。

我们来plan和run一下这个fabio job：

# nomad job plan fabio.nomad

+ Job: "fabio"
+ Task Group: "fabio" (3 create)
  + Task: "fabio" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 fabio.nomad

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

# nomad job run fabio.nomad
==> Monitoring evaluation "97bfc16d"
    Evaluation triggered by job "fabio"
    Allocation "1b77dcfa" created: node "c281658a", group "fabio"
    Allocation "da35a778" created: node "7acdd7bc", group "fabio"
    Allocation "fc915ab7" created: node "9e3ef19f", group "fabio"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "97bfc16d" finished with status "complete"

查看一下fabio job的运行状态：

# nomad job status fabio

ID            = fabio
Name          = fabio
Submit Date   = 2019-03-27T14:30:29+08:00
Type          = system
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
fabio       0       0         3        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
1b77dcfa  c281658a  fabio       0        run      running  1m11s ago  58s ago
da35a778  7acdd7bc  fabio       0        run      running  1m11s ago  54s ago
fc915ab7  9e3ef19f  fabio       0        run      running  1m11s ago  58s ago

通过9998端口，可以查看fabio的ui页面，这个页面主要展示的是fabio生成的路由信息：

img{512x368}

由于尚未暴露任何服务，因此fabio的路由表为空。

fabio的流量入口为9999端口，不过由于没有配置路由和upstream service，因此如果此时向9999端口发送http请求，将会得到404的应答。

2. 暴露HTTP服务到外部

接下来，我们就将上面创建的httpbackend服务通过fabiolb暴露到外部，使得特定条件下通过fabiolb进入集群内部的流量可以被准确路由到集群中的httpbackend实例上面。

下面是fabio将nomad集群内部服务暴露在外部的原理图：

img{512x368}

我们看到原理图中最为关键的一点就是service tag，该信息由nomad在创建job时写入到consul集群；fabio监听consul集群service信息变更，读取有新变动的job，解析job的service tag，生成路由规则。fabio关注所有带有”urlprefix-”前缀的service tag。

fabio启动时监听的9999端口，默认是http接入。我们修改一下之前的httpbackend.nomad，为该job中的service增加tag字段：

// httpbackend.nomad

... ...

     service {
        name = "httpbackend"
        tags = ["urlprefix-mysite.com:9999/"]
        port = "http"
        check {
          name     = "alive"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

对于上面httpbackend.nomad中service块的变更，主要有两点：

1) 增加tag：匹配的路由信息为：“mysite.com:9999/”

2) 增加check块：如果没有check设置，该路由信息将不会在fabio中生效

更新一下httpbackend:

# nomad job run httpbackend-2.nomad
==> Monitoring evaluation "c83af3d3"
    Evaluation triggered by job "httpbackend"
    Allocation "6b0b83de" modified: node "9e3ef19f", group "httpbackend"
    Allocation "d0710b85" modified: node "7acdd7bc", group "httpbackend"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "c83af3d3" finished with status "complete"

查看fabio的route表，可以看到增加了两条新路由信息：

img{512x368}

我们通过fabio来访问一下httpbackend服务：

# curl http://mysite.com:9999/      --- 注意：事先已经在/etc/hosts中添加了 mysite.com的地址为127.0.0.1
this is httpbackendservice, version: v1.0.0

我们看到httpbackend service已经被成功暴露到lb的外部了。

四. 暴露HTTPS、TCP服务到外部

1. 定制fabio

我们的目标是将https、tcp服务暴露到lb的外部，nomad官方文档中给出的fabio.nomad将不再适用，我们需要让fabio监听多个端口，每个端口有着不同的用途。同时，我们通过给fabio传入适当的命令行参数来帮助我们查看fabio的详细access日志信息，并让fabio支持TRACE机制。

fabio.nomad调整如下：

job "fabio" {
  datacenters = ["dc1"]
  type = "system"

  group "fabio" {
    task "fabio" {
      driver = "docker"
      config {
        image = "fabiolb/fabio"
        network_mode = "host"
        logging {
          type = "json-file"
        }
        args = [
          "-proxy.addr=:9999;proto=http,:9997;proto=tcp,:9996;proto=tcp+sni",
          "-log.level=TRACE",
          "-log.access.target=stdout"
        ]
      }

      resources {
        cpu    = 200
        memory = 128
        network {
          mbits = 20
        }
      }
    }
  }
}

我们让fabio监听三个端口：

9999: http端口
9997: tcp端口
9996: tcp+sni端口

后续会针对这三个端口暴露的不同服务做细致说明。

我们将fabio的日志级别调低为TRACE级别，以便能查看到fabio日志中输出的trace信息，帮助我们进行路由匹配的诊断。

重新nomad job run fabio.nomad后，我们来看看TRACE的效果：

//访问后端服务，在http header中添加"Trace: abc"：

# curl -H 'Trace: abc' 'http://mysite.com:9999/'
this is httpbackendservice, version: v1.0.0

//查看fabio的访问日志：

2019/03/30 08:13:15 [TRACE] abc Tracing mysite.com:9999/
2019/03/30 08:13:15 [TRACE] abc Matching hosts: [mysite.com:9999]
2019/03/30 08:13:15 [TRACE] abc Match mysite.com:9999/
2019/03/30 08:13:15 [TRACE] abc Routing to service httpbackend on http://172.16.66.102:23578/
127.0.0.1 - - [30/Mar/2019:08:13:15 +0000] "GET / HTTP/1.1" 200 44

我们可以清晰的看到fabio收到请求后，匹配到一条路由：”mysite.com:9999/”，然后将http请求转发到 172.16.66.102:23578这个httpbackend服务实例上去了。

2. https服务

接下来，我们考虑将一个https服务暴露在lb外部。

一种方案是fabiolb做ssl termination，然后再在与upstream https服务建立的ssl连接上传递数据。这种两段式https通信是比较消耗资源的，fabio要对数据进行两次加解密。

另外一种方案是fabiolb将收到的请求透传给后面的upsteam https服务，由client与upsteam https服务直接建立“安全数据通道”，这个方案我们在后续会提到。

第三种方案，那就是对外依旧暴露http，但是fabiolb与upsteam之间通过https通信。我们先来看一下这种“间接暴露https”的方案。

// httpsbackend-upstreamhttps.nomad

job "httpsbackend" {
  datacenters = ["dc1"]
  type = "service"

  group "httpsbackend" {
    count = 2
    restart {
      attempts = 2
      interval = "30m"
      delay = "15s"
      mode = "fail"
    }

    task "httpsbackend" {
      driver = "docker"
      config {
        image = "bigwhite/httpsbackendservice:v1.0.0"
        port_map {
          https = 7777
        }
        logging {
          type = "json-file"
        }
      }

      resources {
        network {
          mbits = 10
          port "https" {}
        }
      }

      service {
        name = "httpsbackend"
        tags = ["urlprefix-mysite-https.com:9999/ proto=https tlsskipverify=true"]
        port = "https"
        check {
          name     = "alive"
          type     = "tcp"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

我们将创建名为httpsbackend的job，job中Task对应的tag为：”urlprefix-mysite-https.com:9999/ proto=https tlsskipverify=true”。解释为：路由mysite-https.com:9999/，上游upstream服务为https服务，fabio不验证upstream服务的公钥数字证书。

我们创建该job：

# nomad job run httpsbackend-upstreamhttps.nomad
==> Monitoring evaluation "ba7af6d4"
    Evaluation triggered by job "httpsbackend"
    Allocation "3127aac8" created: node "7acdd7bc", group "httpsbackend"
    Allocation "b5f1b7a7" created: node "9e3ef19f", group "httpsbackend"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ba7af6d4" finished with status "complete"

我们来通过fabiolb访问一下httpsbackend这个服务：

# curl -H "Trace: abc"  http://mysite-https.com:9999/
this is httpsbackendservice, version: v1.0.0

// fabiolb 日志

2019/03/30 09:35:48 [TRACE] abc Tracing mysite-https.com:9999/
2019/03/30 09:35:48 [TRACE] abc Matching hosts: [mysite-https.com:9999]
2019/03/30 09:35:48 [TRACE] abc Match mysite-https.com:9999/
2019/03/30 09:35:48 [TRACE] abc Routing to service httpsbackend on https://172.16.66.103:29248
127.0.0.1 - - [30/Mar/2019:09:35:48 +0000] "GET / HTTP/1.1" 200 45

3. 基于tcp代理暴露https服务

上面的方案虽然将https暴露在外面，但是client到fabio这个环节的数据传输不是在安全通道中。上面提到的方案2：fabiolb将收到的请求透传给后面的upsteam https服务，由client与upsteam https服务直接建立“安全数据通道”似乎更佳。fabiolb支持tcp端口的反向代理，我们基于tcp代理来暴露https服务到外部。

我们建立httpsbackend-tcp.nomad文件，考虑篇幅有限，我们仅列出差异化的部分：

job "httpsbackend-tcp" {

 ... ...

    service {
        name = "httpsbackend-tcp"
        tags = ["urlprefix-:9997 proto=tcp"]
        port = "https"
        check {
          name     = "alive"
          type     = "tcp"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

... ...

}

从httpsbackend-tcp.nomad文件，我们看到我们在9997这个tcp端口上暴露服务，tag为：“urlprefix-:9997 proto=tcp”，即凡是到达9997端口的流量，无论应用协议类型是什么，都转发到httpsbackend-tcp上，且通过tcp协议转发。

我们创建并测试一下该方案：

# nomad job run httpsbackend-tcp.nomad

# curl -k https://localhost:9997   //由于使用的是自签名证书，所有告诉curl不校验server端公钥数字证书
this is httpsbackendservice, version: v1.0.0

4. 多个https服务共享一个fabio端口

上面的基于tcp代理暴露https服务的方案还有一个问题，那就是每个https服务都要独占一个fabio listen的端口。那是否可以实现多个https服务使用一个fabio端口，并通过host name route呢？fabio支持tcp+sni的route策略。

SNI, 全称Server Name Indication，即服务器名称指示。它是一个扩展的TLS计算机联网协议。该协议允许在握手过程开始时通过客户端告诉它正在连接的服务器的主机名称。这允许服务器在相同的IP地址和TCP端口号上呈现多个证书，也就是允许在相同的IP地址上提供多个安全HTTPS网站（或其他任何基于TLS的服务），而不需要所有这些站点使用相同的证书。

接下来，我们就来看一下如何在fabio中让多个后端https服务共享一个Fabio服务端口(9996)。我们建立两个job：httpsbackend-sni-1和httpsbackend-sni-2。

//httpsbackend-tcp-sni-1.nomad

job "httpsbackend-sni-1" {

... ...

    service {
        name = "httpsbackend-sni-1"
        tags = ["urlprefix-mysite-sni-1.com/ proto=tcp+sni"]
        port = "https"
        check {
          name     = "alive"
          type     = "tcp"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

.... ...

}

//httpsbackend-tcp-sni-2.nomad

job "httpsbackend-sni-2" {

... ...

   task "httpsbackend-sni-2" {
      driver = "docker"
      config {
        image = "bigwhite/httpsbackendservice:v1.0.1"
        port_map {
          https = 7777
        }
        logging {
          type = "json-file"
        }
    }

    service {
        name = "httpsbackend-sni-2"
        tags = ["urlprefix-mysite-sni-2.com/ proto=tcp+sni"]
        port = "https"
        check {
          name     = "alive"
          type     = "tcp"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

.... ...

}

我们看到与之前的server tag不同的是：这里proto=tcp+sni，即告诉fabio建立sni路由。httpsbackend-sni-2 task与httpsbackend-sni-1不同之处在于其使用image为bigwhite/httpsbackendservice:v1.0.1，为的是能通过https的应答结果，将这两个服务区分开来。

除此之外，我们还看到tag中并不包含端口号了，而是直接采用host name作为路由匹配标识。

创建这两个job：

# nomad job run httpsbackend-tcp-sni-1.nomad
==> Monitoring evaluation "af170d98"
    Evaluation triggered by job "httpsbackend-sni-1"
    Allocation "8ea1cc8d" modified: node "7acdd7bc", group "httpsbackend-sni-1"
    Allocation "e16cdc73" modified: node "9e3ef19f", group "httpsbackend-sni-1"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "af170d98" finished with status "complete"

# nomad job run httpsbackend-tcp-sni-2.nomad
==> Monitoring evaluation "a77d3799"
    Evaluation triggered by job "httpsbackend-sni-2"
    Allocation "32df450c" modified: node "c281658a", group "httpsbackend-sni-2"
    Allocation "e1bf4871" modified: node "7acdd7bc", group "httpsbackend-sni-2"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "a77d3799" finished with status "complete"

我们来分别访问这两个服务：

# curl -k https://mysite-sni-1.com:9996/
this is httpsbackendservice, version: v1.0.0

# curl -k https://mysite-sni-2.com:9996/
this is httpsbackendservice, version: v1.0.1

从返回的结果我们看到，通过9996，我们成功暴露出两个不同的https服务。

五. 小结

到这里，我们实现了我们的既定目标：

使用nomad实现了工作负载的创建和调度；
东西向流量通过consul机制实现；
通过fabio实现了http、https(through tcp)、多https(though tcp+sni)的服务暴露和负载均衡。

后续我们将进一步探索基于nomad实现负载的多种场景的升降级操作(滚动、金丝雀、蓝绿部署)、对非host网络的支持（比如weave network)等。

本文涉及到的源码文件在这里可以下载。

六. 参考资料

我的网课“Kubernetes实战：高可用集群搭建、配置、运维与应用”在慕课网上线了，感谢小伙伴们学习支持！

我爱发短信：企业级短信平台定制开发专家 https://tonybai.com/
smspush : 可部署在企业内部的定制化短信平台，三网覆盖，不惧大并发接入，可定制扩展；短信内容你来定，不再受约束, 接口丰富，支持长短信，签名可选。

著名云主机服务厂商DigitalOcean发布最新的主机计划，入门级Droplet配置升级为：1 core CPU、1G内存、25G高速SSD，价格5$/月。有使用DigitalOcean需求的朋友，可以打开这个链接地址：https://m.do.co/c/bff6eed92687 开启你的DO主机之路。

我的联系方式：

微博：https://weibo.com/bigwhite20xx
微信公众号：iamtonybai
博客：tonybai.com
github: https://github.com/bigwhite

微信赞赏：
img{512x368}

商务合作方式：撰稿、出书、培训、在线课程、合伙创业、咨询、广告合作。

基于consul实现微服务的服务发现和负载均衡

九月 10, 2018
0 条评论

一. 背景

随着2018年年初国务院办公厅联合多个部委共同发布了《国务院办公厅关于促进“互联网+医疗健康”发展的意见(国办发〔2018〕26号)》，国内医疗IT领域又迎来了一波互联网医院建设的高潮。不过互联网医院多基于实体医院建设，虽说挂了一个“互联网”的名号，但互联网医院系统也多与传统的院内系统，比如：HIS、LIS、PACS、EMR等共享院内的IT基础设施。

如果你略微了解过国内医院院内IT系统的现状，你就知道目前的多数医院的IT系统相比于互联网行业、电信等行业来说是相对“落伍”的，这种落伍不仅体现在IT基础设施的专业性和数量上，更体现在对新概念、新技术、新设计理念等应用上。虽然国内医院IT系统在技术层面呈现出“多样性”的特征，但整体上偏陈旧和保守 – - 你可以在全国范围内找到10-15年前的各种主流语言(VB、delphi、c#等实现的IT系统，并且系统架构多为两层C/S结构的。

近几年“互联网+医疗”的兴起的确在一些方面提升了医院的服务效率和水平，但这些互联网医疗系统多部署于院外，并主要集中在“做入口”。它们并不算是医院的核心系统：即没有这些互联网系统，医院的业务也是照常进行的(患者可以在传统的窗口办理所有院内业务，就是效率低罢了)。因此，虽然这些互联网医疗系统采用了先进的互联网系统设计理念和技术，但并没有真正提升院内系统的技术水平，它们也只能与院内那些“陈旧”的、难于扩展的系统做对接。

不过互联网医院与这些系统有所不同，虽然它依然“可有可无”，但它却是部署在院内IT基础设施上的系统，同时也受到了院内IT基础设施条件的限制。在我们即将上线的一个针对医院集团的互联网医院版本中，我们就遇到了“被限制”的问题。我们本想上线的Kubernetes集群因为院方提供的硬件“不足”而无法实施，只能“降级”为手工打造的基于consul的微服务服务发现和负载均衡平台，初步满足我们的系统需要。而从k8s到consul的实践过程，总是让我有一种从工业时代回到的农业时代或是“消费降级”的赶脚^_^。

本文就来说说基于当前较新版本的consul实现微服务的服务发现和负载均衡的过程。

二. 实验环境

这里有三台阿里云的ECS，即用作部署consul集群，也用来承载工作负载的节点（这点与真实生产环境还是蛮像的，医院也仅能提供类似的这点儿可怜的设备）：

consul-1: 192.168.0.129
consul-2: 192.168.0.130
consul-3: 192.168.0.131

操作系统：Ubuntu server 16.04.4 LTS
内核版本：4.4.0-117-generic

实验环境安装有：

实验所用的样例程序镜像：

三. 目标及方案原理

本次实验的最基础、最朴素的两个目标：

所有业务应用均基于容器运行
某业务服务容器启动后，会被自动注册服务，同时其他服务可以自动发现该服务并调用，并且到达这个服务的请求会负载均衡到服务的多个实例。

这里选择了与编程语言技术栈无关的、可搭建微服务的服务发现和负载均衡的Hashicorp的consul。关于consul是什么以及其基本原理和应用，可以参见我多年前写的这篇有关consul的文章。

但是光有consul还不够，我们还需要结合consul-template、gliderlab的registrator以及nginx共同来实现上述目标，原理示意图如下：

img{512x368}

原理说明：

对于每个biz node上启动的容器，位于每个node上的Registrator实例会监听到该节点上容器的创建和停止的event，并将容器的信息以consul service的形式写入consul或从consul删除。
位于每个nginx node上的consul-template实例会watch consul集群，监听到consul service的相关event，并将需要expose到external的service信息获取，按照事先定义好的nginx conf template重新生成nginx.conf并reload本节点的nginx，使得nginx的新配置生效。
对于内部服务来说（不通过nginx暴露到外部)，在被registrator写入consul的同时，也完成了在consul DNS的注册，其他服务可以通过特定域名的方式获取该内部服务的IP列表（A地址)和其他信息，比如端口(SRV)，并进而实现与这些内部服务的通信。

参考该原理，落地到我们实验环境的部署示意图如下：

img{512x368}

四. 步骤

下面说说详细的实验步骤。

1. 安装consul集群

首先我们先来安装consul集群。consul既支持二进制程序直接部署，也支持Docker容器化部署。如果consul集群单独部署在几个专用节点上，那么consul可以使用二种方式的任何一种。但是如果consul所在节点还承载工作负载，考虑consul作为整个分布式平台的核心，降低它与docker engine引擎的耦合（docker engine可能会因各种情况经常restart），还是建议以二进制程序形式直接部署在物理机或vm上。这里的实验环境资源有限，我们采用的是以二进制程序形式直接部署的方式。

consul最新版本是1.2.2（截至发稿时），consul 1.2.x版本与consul 1.1.x版本最大的不同在于consul 1.2.x支持service mesh了，这对于consul来说可是革新性的变化，因此这里担心其初期的稳定性，因此我们选择consul 1.1.0版本。

我们下载consul 1.1.0安装包后，将其解压到/usr/local/bin下。

在$HOME下建立consul-install目录，并在其下面存放consul集群的运行目录consul-data。在consul-install目录下，执行命令启动节点consul-1上的consul：

consul-1 node:

# nohup consul agent -server -ui -dns-port=53 -bootstrap-expect=3 -data-dir=/root/consul-install/consul-data -node=consul-1 -client=0.0.0.0 -bind=192.168.0.129 -datacenter=dc1 > consul-1.log & 2>&1

# tail -100f consul-1.log
bootstrap_expect > 0: expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
           Version: 'v1.1.0'
           Node ID: 'd23b9495-4caa-9ef2-a1d5-7f20aa39fd15'
         Node name: 'consul-1'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 53)
      Cluster Addr: 192.168.0.129 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2018/09/10 10:21:09 [INFO] raft: Initial configuration (index=0): []
    2018/09/10 10:21:09 [INFO] raft: Node at 192.168.0.129:8300 [Follower] entering Follower state (Leader: "")
    2018/09/10 10:21:09 [INFO] serf: EventMemberJoin: consul-1.dc1 192.168.0.129
    2018/09/10 10:21:09 [INFO] serf: EventMemberJoin: consul-1 192.168.0.129
    2018/09/10 10:21:09 [INFO] consul: Adding LAN server consul-1 (Addr: tcp/192.168.0.129:8300) (DC: dc1)
    2018/09/10 10:21:09 [INFO] consul: Handled member-join event for server "consul-1.dc1" in area "wan"
    2018/09/10 10:21:09 [INFO] agent: Started DNS server 0.0.0.0:53 (tcp)
    2018/09/10 10:21:09 [INFO] agent: Started DNS server 0.0.0.0:53 (udp)
    2018/09/10 10:21:09 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
    2018/09/10 10:21:09 [INFO] agent: started state syncer
==> Newer Consul version available: 1.2.2 (currently running: 1.1.0)
    2018/09/10 10:21:15 [WARN] raft: no known peers, aborting election
    2018/09/10 10:21:17 [ERR] agent: failed to sync remote state: No cluster leader

我们的三个节点的consul都以server角色启动（consul agent -server）,consul集群初始有三个node( -bootstrap-expect=3)，均位于dc1 datacenter(-datacenter=dc1)，服务bind地址为192.168.0.129(-bind=192.168.0.129 )，允许任意client连接（ -client=0.0.0.0）。我们启动了consul ui(-ui)，便于以图形化的方式查看consul集群的状态。我们设置了consul DNS服务的端口号为53（-dns-port=53），这个后续会起到重要作用，这里先埋下小伏笔。

这里我们使用nohup+&符号的方式将consul运行于后台。生产环境建议使用systemd这样的init系统对consul的启停和配置更新进行管理。

从consul-1的输出日志来看，单节点并没有选出leader。我们需要继续在consul-2和consul-3两个节点上也重复consul-1上的操作，启动consul：

consul-2 node:

#nohup consul agent -server -ui -dns-port=53  -bootstrap-expect=3 -data-dir=/root/consul-install/consul-data -node=consul-2 -client=0.0.0.0 -bind=192.168.0.130 -datacenter=dc1 -join 192.168.0.129 > consul-2.log & 2>&1

consul-3 node:

# nohup consul agent -server -ui -dns-port=53  -bootstrap-expect=3 -data-dir=/root/consul-install/consul-data -node=consul-3 -client=0.0.0.0 -bind=192.168.0.131 -datacenter=dc1 -join 192.168.0.129 > consul-3.log & 2>&1

启动后，我们查看到consul-3.log中的日志:

    2018/09/10 10:24:01 [INFO] consul: New leader elected: consul-3
    2018/09/10 10:24:01 [WARN] raft: AppendEntries to {Voter a215865f-dba7-5caa-cfb3-6850316199a3 192.168.0.130:8300} rejected, sending older logs (next: 1)
    2018/09/10 10:24:01 [INFO] raft: pipelining replication to peer {Voter a215865f-dba7-5caa-cfb3-6850316199a3 192.168.0.130:8300}
    2018/09/10 10:24:01 [WARN] raft: AppendEntries to {Voter d23b9495-4caa-9ef2-a1d5-7f20aa39fd15 192.168.0.129:8300} rejected, sending older logs (next: 1)
    2018/09/10 10:24:01 [INFO] raft: pipelining replication to peer {Voter d23b9495-4caa-9ef2-a1d5-7f20aa39fd15 192.168.0.129:8300}
    2018/09/10 10:24:01 [INFO] consul: member 'consul-1' joined, marking health alive
    2018/09/10 10:24:01 [INFO] consul: member 'consul-2' joined, marking health alive
    2018/09/10 10:24:01 [INFO] consul: member 'consul-3' joined, marking health alive
    2018/09/10 10:24:01 [INFO] agent: Synced node info
==> Newer Consul version available: 1.2.2 (currently running: 1.1.0)

consul-3 node上的consul被选为初始leader了。我们可以通过consul提供的子命令查看集群状态：

#  consul operator raft list-peers
Node      ID                                    Address             State     Voter  RaftProtocol
consul-3  0020b7aa-486a-5b44-b5fd-be000a380a89  192.168.0.131:8300  leader  true   3
consul-1  d23b9495-4caa-9ef2-a1d5-7f20aa39fd15  192.168.0.129:8300  follower  true   3
consul-2  a215865f-dba7-5caa-cfb3-6850316199a3  192.168.0.130:8300  follower    true   3

我们还可以通过consul ui以图形化方式查看集群状态和集群内存储的各种配置信息：

img{512x368}

至此，consul集群就搭建ok了。

2. 安装Nginx、consul-template和Registrator

根据前面的“部署示意图”，我们在consul-1和consul-2上安装nginx、consul-template和Registrator，在consul-3上安装Registrator。

a) Nginx的安装

我们使用ubuntu 16.04.4默认源中的nginx版本:1.10.3，通过apt-get install nginx安装nginx，这个无须赘述了。

b) consul-template的安装

consul-template是一个将consul集群中存储的信息转换为文件形式的工具。常用的场景是监听consul集群中数据的变化，并结合模板将数据持久化到某个文件中，再执行某一关联的action。比如我们这里通过consul-template监听consul集群中service信息的变化，并将service信息数据与nginx的配置模板结合，生成nginx可用的nginx.conf配置文件，并驱动nginx重新reload配置文件，使得nginx的配置更新生效。因此一般来说，哪里部署有nginx，我们就应该有一个配对的consul-template部署。

在我们的实验环境中consul-1和consul-2两个节点部署了nginx，因此我们需要在consul-1和consul-2两个节点上部署consul-template。我们直接安装comsul-template的二进制程序（我们使用0.19.5版本），下载安装包并解压后，将consul-template放入/usr/local/bin目录下：

# wget -c https://releases.hashicorp.com/consul-template/0.19.5/consul-template_0.19.5_linux_amd64.zip

# unzip consul-template_0.19.5_linux_amd64.zip
# mv consul-tempate /usr/local/bin
# consul-template -v
consul-template v0.19.5 (57b6c71)

这里先不启动consul-template，后续在注册不同服务的场景中，我们再启动consul-template。

c) Registrator的安装

Registrator是另外一种工具，它监听Docker引擎上发生的容器创建和停止事件，并将启动的容器信息以consul service的形式存储在consul集群中。因此，Registrator和node上的docker engine对应，有docker engine部署的节点上都应该安装有对应的Registator。因此我们要在实验环境的三个节点上都部署Registrator。

Registrator官方推荐的就是以Docker容器方式运行，但这里我并不使用lastest版本，而是用master版本，因为只有最新的master版本才支持service meta数据的写入，而当前的latest版本是v7版本，年头较长，并不支持service meta数据写入。

在所有实验环境节点上执行：

 # docker run --restart=always -d \
    --name=registrator \
    --net=host \
    --volume=/var/run/docker.sock:/tmp/docker.sock \
    gliderlabs/registrator:master\
      consul://localhost:8500

我们看到registrator将node节点上的/var/run/docker.sock映射到容器内部的/tmp/docker.sock上，通过这种方式registrator可以监听到node上docker引擎上的事件变化。registrator的另外一个参数：consul://localhost:8500则是Registrator要写入信息的consul地址（当然Registrator不仅仅支持consul，还支持etcd、zookeeper等），这里传入的是本node上consul server的地址和服务端口。

Registrator的启动日志如下：

# docker logs -f registrator
2018/09/10 05:56:39 Starting registrator v7 ...
2018/09/10 05:56:39 Using consul adapter: consul://localhost:8500
2018/09/10 05:56:39 Connecting to backend (0/0)
2018/09/10 05:56:39 consul: current leader  192.168.0.130:8300
2018/09/10 05:56:39 Listening for Docker events ...
2018/09/10 05:56:39 Syncing services on 1 containers
2018/09/10 05:56:39 ignored: 6ef6ae966ee5 no published ports

在所有节点都启动完Registrator后，我们来先查看一下当前consul集群中service的catelog以及每个catelog下的service的详细信息：

// consul-1:

# curl  http://localhost:8500/v1/catalog/services
{"consul":[]}

目前只有consul自己内置的consul service catelog，我们查看一下consul这个catelog service的详细信息：

// consul-1:

# curl  localhost:8500/v1/catalog/service/consul|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1189  100  1189    0     0   180k      0 --:--:-- --:--:-- --:--:--  193k
[
  {
    "ID": "d23b9495-4caa-9ef2-a1d5-7f20aa39fd15",
    "Node": "consul-1",
    "Address": "192.168.0.129",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "192.168.0.129",
      "wan": "192.168.0.129"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "consul",
    "ServiceName": "consul",
    "ServiceTags": [],
    "ServiceAddress": "",
    "ServiceMeta": {},
    "ServicePort": 8300,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 5,
    "ModifyIndex": 5
  },
  {
    "ID": "a215865f-dba7-5caa-cfb3-6850316199a3",
    "Node": "consul-2",
    "Address": "192.168.0.130",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "192.168.0.130",
      "wan": "192.168.0.130"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "consul",
    "ServiceName": "consul",
    "ServiceTags": [],
    "ServiceAddress": "",
    "ServiceMeta": {},
    "ServicePort": 8300,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 6,
    "ModifyIndex": 6
  },
  {
    "ID": "0020b7aa-486a-5b44-b5fd-be000a380a89",
    "Node": "consul-3",
    "Address": "192.168.0.131",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "192.168.0.131",
      "wan": "192.168.0.131"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "consul",
    "ServiceName": "consul",
    "ServiceTags": [],
    "ServiceAddress": "",
    "ServiceMeta": {},
    "ServicePort": 8300,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 7,
    "ModifyIndex": 7
  }
]

3. 内部http服务的注册和发现

对于微服务而言，有暴露到外面的，也有仅运行在内部，被内部服务调用的。我们先来看看内部服务，这里以一个http服务为例。

对于暴露到外部的微服务而言，可以通过域名、路径、端口等来发现。但是对于内部服务，我们怎么发现呢？k8s中我们可以通过k8s集群的DNS插件进行自动域名解析实现，每个pod中container的DNS server指向的就是k8s dns server。这样service之间可以通过使用固定规则的域名(比如：your_svc.default.svc.cluster.local)来访问到另外一个service(仅需配置一个service name)，再通过service实现该服务请求负载均衡到service关联的后端endpoint(pod container)上。consul集群也可以做到这点，并使用consul提供的DNS服务来实现内部服务的发现。

我们需要对三个节点的DNS配置进行update，将consul DNS server加入到主机DNS resolver(这也是之前在启动consul时将consul DNS的默认监听端口从8600改为53的原因)，步骤如下：

编辑/etc/resolvconf/resolv.conf.d/base，加入一行：

nameserver 127.0.0.1

重启resolveconf服务

 /etc/init.d/resolvconf restart

再查看/etc/resolve.conf文件：

# cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 100.100.2.136
nameserver 100.100.2.138
nameserver 127.0.0.1
options timeout:2 attempts:3 rotate single-request-reopen

我们发现127.0.0.1这个DNS server地址已经被加入到/etc/resolv.conf中了（切记：不要直接手工修改/etc/resolve.conf）。

好了！有了consul DNS，我们就可以发现consul中的服务了。consul给其集群内部的service一个默认的域名：your_svc.service.{data-center}.consul. 之前我们查看了cluster中只有一个consul catelog service，我们就来访问一下该consul service：

# ping -c 3 consul.service.dc1.consul
PING consul.service.dc1.consul (192.168.0.129) 56(84) bytes of data.
64 bytes from iZbp15tvx7it019hvy750tZ (192.168.0.129): icmp_seq=1 ttl=64 time=0.029 ms
64 bytes from iZbp15tvx7it019hvy750tZ (192.168.0.129): icmp_seq=2 ttl=64 time=0.025 ms
64 bytes from iZbp15tvx7it019hvy750tZ (192.168.0.129): icmp_seq=3 ttl=64 time=0.031 ms

# ping -c 3 consul.service.dc1.consul
PING consul.service.dc1.consul (192.168.0.130) 56(84) bytes of data.
64 bytes from 192.168.0.130: icmp_seq=1 ttl=64 time=0.186 ms
64 bytes from 192.168.0.130: icmp_seq=2 ttl=64 time=0.136 ms
64 bytes from 192.168.0.130: icmp_seq=3 ttl=64 time=0.195 ms

# ping -c 3 consul.service.dc1.consul
PING consul.service.dc1.consul (192.168.0.131) 56(84) bytes of data.
64 bytes from 192.168.0.131: icmp_seq=1 ttl=64 time=0.149 ms
64 bytes from 192.168.0.131: icmp_seq=2 ttl=64 time=0.184 ms
64 bytes from 192.168.0.131: icmp_seq=3 ttl=64 time=0.179 ms

我们看到consul服务有三个实例，因此DNS轮询在不同ping命令执行时返回了不同的地址。

现在在主机层面上，我们可以发现consul中的service了。如果我们的服务调用者跑在docker container中，我们还能找到consul服务么？

# docker run busybox ping consul.service.dc1.consul
ping: bad address 'consul.service.dc1.consul'

事实告诉我们：不行！

那么我们如何让运行于docker container中的服务调用者也能发现consul中的service呢？我们需要给docker引擎指定DNS：

在/etc/docker/daemon.json中添加下面配置:

{
    "dns": ["node_ip", "8.8.8.8"] //node_ip： consul_1为192.168.0.129、consul_2为192.168.0.130、consul_3为192.168.0.131
}

重启docker引擎后，再尝试在容器内发现consul服务：

# docker run busybox ping consul.service.dc1.consul
PING consul.service.dc1.consul (192.168.0.131): 56 data bytes
64 bytes from 192.168.0.131: seq=0 ttl=63 time=0.268 ms
64 bytes from 192.168.0.131: seq=1 ttl=63 time=0.245 ms
64 bytes from 192.168.0.131: seq=2 ttl=63 time=0.235 ms

这次就ok了！

接下来我们在三个节点上以容器方式启动我们的一个内部http服务demo httpbackend：

# docker run --restart=always -d  -l "SERVICE_NAME=httpbackend" -p 8081:8081 bigwhite/httpbackendservice:v1.0.0

我们查看一下consul集群内的httpbackend service信息：

# curl  localhost:8500/v1/catalog/service/httpbackend|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1374  100  1374    0     0   519k      0 --:--:-- --:--:-- --:--:--  670k
[
  {
    "ID": "d23b9495-4caa-9ef2-a1d5-7f20aa39fd15",
    "Node": "consul-1",
    "Address": "192.168.0.129",
   ...
  },
  {
    "ID": "a215865f-dba7-5caa-cfb3-6850316199a3",
    "Node": "consul-2",
    "Address": "192.168.0.130",
   ...
  },
  {
    "ID": "0020b7aa-486a-5b44-b5fd-be000a380a89",
    "Node": "consul-3",
    "Address": "192.168.0.131",
   ...
  }
]

再访问一下该服务：

# curl httpbackend.service.dc1.consul:8081
this is httpbackendservice, version: v1.0.0

内部服务发现成功！

4. 暴露外部http服务

说完了内部服务，我们再来说说那些要暴露到外部的服务，这个环节就轮到consul-template登场了！在我们的实验中，consul-template读取consul中service信息，并结合模板生成nginx配置文件。我们基于默认安装的/etc/nginx/nginx.conf文件内容来编写我们的模板。我们先实验暴露http服务到外面。下面是模板样例：

//nginx.conf.template

.... ...

http {
        ... ...
        ##
        # Virtual Host Configs
        ##

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;

        #
        # http server config
        #

        {{range services -}}
        {{$name := .Name}}
        {{$service := service .Name}}
        {{- if in .Tags "http" -}}
        upstream {{$name}} {
          zone upstream-{{$name}} 64k;
          {{range $service}}
          server {{.Address}}:{{.Port}} max_fails=3 fail_timeout=60 weight=1;
          {{end}}
        }{{end}}
        {{end}}

        {{- range services -}} {{$name := .Name}}
        {{- if in .Tags "http" -}}
        server {
          listen 80;
          server_name {{$name}}.tonybai.com;

          location / {
            proxy_pass http://{{$name}};
          }
        }
        {{end}}
        {{end}}

}

consul-template使用的模板采用的是go template的语法。我们看到在http block中，我们要为consul中的每个要expose到外部的catelog service定义一个server block(对应的域名为your_svc.tonybai.com)和一个upstream block。

对上面的模板做简单的解析，弄明白三点，模板基本就全明白了：

{{- range services -}}：标准的{{ range pipeline }}模板语法，services这个pipeline的调用相当于： curl localhost:8500/v1/catalog/services，即获取catelog services列表。这个列表中的每项仅有Name和Tags两个字段可用。
{{- if in .Tags “http” -}}：判断语句，即如果Tags字段中有http这个tag，那么则暴露该catelog service。
{{range $service}}：也是标准的{{ range pipeline }}模板语法，$service这个pipeline调用相当于curl localhost:8500/v1/catalog/service/xxxx，即获取某个service xxx的详细信息，包括Address、Port、Tag、Meta等。

接下来，我们在consul-1和consul-2上启动consul-template：

consul-1:
# nohup  consul-template -template "/root/consul-install/templates/nginx.conf.template:/etc/nginx/nginx.conf:nginx -s reload" > consul-template.log & 2>&1

consul-2:
# nohup  consul-template -template "/root/consul-install/templates/nginx.conf.template:/etc/nginx/nginx.conf:nginx -s reload" > consul-template.log & 2>&1

查看/etc/nginx/nginx.conf，你会发现http server config下面并没有生成任何配置，因为consul集群中还没有满足Tag条件的service（包含tag “http”)。现在我们就来在三个node上创建httpfront services。

# docker run --restart=always -d -l "SERVICE_NAME=httpfront" -l "SERVICE_TAGS=http" -P bigwhite/httpfrontservice:v1.0.0

查看生成的nginx.conf:

upstream httpfront {
      zone upstream-httpfront 64k;

          server 192.168.0.129:32769 max_fails=3 fail_timeout=60 weight=1;

          server 192.168.0.130:32768 max_fails=3 fail_timeout=60 weight=1;

          server 192.168.0.131:32768 max_fails=3 fail_timeout=60 weight=1;

    }

    server {
      listen 80;
          server_name httpfront.tonybai.com;

      location / {
        proxy_pass http://httpfront;
      }
    }

测试一下httpfront.tonybai.com(可通过修改/etc/hosts)，httpfront service会调用内部服务httpbackend(通过httpbackend.service.dc1.consul:8081访问)：

# curl httpfront.tonybai.com
this is httpfrontservice, version: v1.0.0, calling backendservice ok, its resp: [this is httpbackendservice, version: v1.0.0
]

可以在各个节点上查看httpfront的日志：(通过docker logs)，你会发现到httpfront.tonybai.com的请求被均衡到了各个节点上的httpfront service上了：

{GET / HTTP/1.0 1 0 map[Connection:[close] User-Agent:[curl/7.47.0] Accept:[*/*]] {} <nil> 0 [] true httpfront map[] map[] <nil> map[] 192.168.0.129:35184 / <nil> <nil> <nil> 0xc0000524c0}
calling backendservice...
{200 OK 200 HTTP/1.1 1 1 map[Date:[Mon, 10 Sep 2018 08:23:33 GMT] Content-Length:[44] Content-Type:[text/plain; charset=utf-8]] 0xc0000808c0 44 [] false false map[] 0xc000132600 <nil>}
this is httpbackendservice, version: v1.0.0

5. 暴露外部tcp服务

我们的微服务可不仅仅有http服务的，还有直接暴露tcp socket服务的。nginx对tcp的支持是通过stream block支持的。在stream block中，我们来为每个要暴露在外面的tcp service生成server block和upstream block，这部分模板内容如下：

stream {
   {{- range services -}}
   {{$name := .Name}}
   {{$service := service .Name}}
     {{- if in .Tags "tcp" -}}
  upstream {{$name}} {
    least_conn;
    {{- range $service}}
    server {{.Address}}:{{.Port}} max_fails=3 fail_timeout=30s weight=5;
    {{ end }}
  }
     {{end}}
  {{end}}

   {{- range services -}}
   {{$name := .Name}}
   {{$nameAndPort := $name | split "-"}}
    {{- if in .Tags "tcp" -}}
  server {
      listen {{ index $nameAndPort 1 }};
      proxy_pass {{$name}};
  }
    {{end}}
   {{end}}
}

和之前的http服务模板相比，这里的Tag过滤词换为了“tcp”，并且由于端口具有排他性，这里用”名字-端口”串来作为service的name以及upstream block的标识。用一个例子来演示会更加清晰。由于修改了nginx模板，在演示demo前，需要重启一下各个consul-template。

然后我们在各个节点上启动tcpfront service（注意服务名为tcpfront-9999，9999是tcpfrontservice expose到外部的端口）：

# docker run -d --restart=always -l "SERVICE_TAGS=tcp" -l "SERVICE_NAME=tcpfront-9999" -P bigwhite/tcpfrontservice:v1.0.0

启动后，我们查看一下生成的nginx.conf:

stream {

   upstream tcpfront-9999 {
    least_conn;
    server 192.168.0.129:32770 max_fails=3 fail_timeout=30s weight=5;

    server 192.168.0.130:32769 max_fails=3 fail_timeout=30s weight=5;

    server 192.168.0.131:32769 max_fails=3 fail_timeout=30s weight=5;

  }

   server {
      listen 9999;
      proxy_pass tcpfront-9999;
  }

}

nginx对外的9999端口对应到集群内的tcpfront服务！这个tcpfront是一个echo服务，我们来测试一下：

# telnet localhost 9999
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
[v1.0.0]2018-09-10 08:56:15.791728641 +0000 UTC m=+531.620462772 [hello
]
tonybai
[v1.0.0]2018-09-10 08:56:17.658482957 +0000 UTC m=+533.487217127 [tonybai
]

基于暴露tcp服务，我们还可以实现将全透传的https服务暴露到外部。所谓全透传的https服务，即ssl证书配置在服务自身，而不是nginx上面。其实现方式与暴露tcp服务相似，这里就不举例了。

五. 小结

以上基于consul+consul-template+registrator+nginx实现了一个基本的微服务服务发现和负载均衡框架，但要应用到生产环境还需一些进一步的考量。

关于服务治理的一些功能，consul 1.2.x版本已经加入了service mesh的support，后续在成熟后可以考虑upgrade consul cluster。

consul-template在v0.19.5中还不支持servicemeta的，但在master版本中已经支持，后续利用新版本的consul-template可以实现功能更为丰富的模板，比如实现灰度发布等。

51短信平台：企业级短信平台定制开发专家 https://tonybai.com/
smspush : 可部署在企业内部的定制化短信平台，三网覆盖，不惧大并发接入，可定制扩展；短信内容你来定，不再受约束, 接口丰富，支持长短信，签名可选。

我的联系方式：

微博：https://weibo.com/bigwhite20xx
微信公众号：iamtonybai
博客：tonybai.com
github: https://github.com/bigwhite

微信赞赏：
img{512x368}

商务合作方式：撰稿、出书、培训、在线课程、合伙创业、咨询、广告合作。

标签 loadbalance 下的文章

使用nomad实现集群管理和微服务部署调度

一. 安装nomad集群

1. consul集群启动

2. DNS设置（可选）

3. 基于consul集群引导启动nomad集群

二. 部署工作负载

三. 将服务暴露到外部以及负载均衡

1. 部署fabio

2. 暴露HTTP服务到外部

四. 暴露HTTPS、TCP服务到外部

1. 定制fabio

2. https服务

3. 基于tcp代理暴露https服务

4. 多个https服务共享一个fabio端口

五. 小结

六. 参考资料

基于consul实现微服务的服务发现和负载均衡

一. 背景

二. 实验环境

三. 目标及方案原理

四. 步骤

1. 安装consul集群

2. 安装Nginx、consul-template和Registrator

3. 内部http服务的注册和发现

4. 暴露外部http服务

5. 暴露外部tcp服务

五. 小结

欢迎使用邮件订阅我的博客

文章

评论

分类

归档

链接

开源项目

翻译项目