Raft | Tony Bai

标签 raft 下的文章

基于consul实现微服务的服务发现和负载均衡

九月 10, 2018
0 条评论

一. 背景

随着2018年年初国务院办公厅联合多个部委共同发布了《国务院办公厅关于促进“互联网+医疗健康”发展的意见(国办发〔2018〕26号)》，国内医疗IT领域又迎来了一波互联网医院建设的高潮。不过互联网医院多基于实体医院建设，虽说挂了一个“互联网”的名号，但互联网医院系统也多与传统的院内系统，比如：HIS、LIS、PACS、EMR等共享院内的IT基础设施。

如果你略微了解过国内医院院内IT系统的现状，你就知道目前的多数医院的IT系统相比于互联网行业、电信等行业来说是相对“落伍”的，这种落伍不仅体现在IT基础设施的专业性和数量上，更体现在对新概念、新技术、新设计理念等应用上。虽然国内医院IT系统在技术层面呈现出“多样性”的特征，但整体上偏陈旧和保守 – - 你可以在全国范围内找到10-15年前的各种主流语言(VB、delphi、c#等实现的IT系统，并且系统架构多为两层C/S结构的。

近几年“互联网+医疗”的兴起的确在一些方面提升了医院的服务效率和水平，但这些互联网医疗系统多部署于院外，并主要集中在“做入口”。它们并不算是医院的核心系统：即没有这些互联网系统，医院的业务也是照常进行的(患者可以在传统的窗口办理所有院内业务，就是效率低罢了)。因此，虽然这些互联网医疗系统采用了先进的互联网系统设计理念和技术，但并没有真正提升院内系统的技术水平，它们也只能与院内那些“陈旧”的、难于扩展的系统做对接。

不过互联网医院与这些系统有所不同，虽然它依然“可有可无”，但它却是部署在院内IT基础设施上的系统，同时也受到了院内IT基础设施条件的限制。在我们即将上线的一个针对医院集团的互联网医院版本中，我们就遇到了“被限制”的问题。我们本想上线的Kubernetes集群因为院方提供的硬件“不足”而无法实施，只能“降级”为手工打造的基于consul的微服务服务发现和负载均衡平台，初步满足我们的系统需要。而从k8s到consul的实践过程，总是让我有一种从工业时代回到的农业时代或是“消费降级”的赶脚^_^。

本文就来说说基于当前较新版本的consul实现微服务的服务发现和负载均衡的过程。

二. 实验环境

这里有三台阿里云的ECS，即用作部署consul集群，也用来承载工作负载的节点（这点与真实生产环境还是蛮像的，医院也仅能提供类似的这点儿可怜的设备）：

consul-1: 192.168.0.129
consul-2: 192.168.0.130
consul-3: 192.168.0.131

操作系统：Ubuntu server 16.04.4 LTS
内核版本：4.4.0-117-generic

实验环境安装有：

实验所用的样例程序镜像：

三. 目标及方案原理

本次实验的最基础、最朴素的两个目标：

所有业务应用均基于容器运行
某业务服务容器启动后，会被自动注册服务，同时其他服务可以自动发现该服务并调用，并且到达这个服务的请求会负载均衡到服务的多个实例。

这里选择了与编程语言技术栈无关的、可搭建微服务的服务发现和负载均衡的Hashicorp的consul。关于consul是什么以及其基本原理和应用，可以参见我多年前写的这篇有关consul的文章。

但是光有consul还不够，我们还需要结合consul-template、gliderlab的registrator以及nginx共同来实现上述目标，原理示意图如下：

img{512x368}

原理说明：

对于每个biz node上启动的容器，位于每个node上的Registrator实例会监听到该节点上容器的创建和停止的event，并将容器的信息以consul service的形式写入consul或从consul删除。
位于每个nginx node上的consul-template实例会watch consul集群，监听到consul service的相关event，并将需要expose到external的service信息获取，按照事先定义好的nginx conf template重新生成nginx.conf并reload本节点的nginx，使得nginx的新配置生效。
对于内部服务来说（不通过nginx暴露到外部)，在被registrator写入consul的同时，也完成了在consul DNS的注册，其他服务可以通过特定域名的方式获取该内部服务的IP列表（A地址)和其他信息，比如端口(SRV)，并进而实现与这些内部服务的通信。

参考该原理，落地到我们实验环境的部署示意图如下：

img{512x368}

四. 步骤

下面说说详细的实验步骤。

1. 安装consul集群

首先我们先来安装consul集群。consul既支持二进制程序直接部署，也支持Docker容器化部署。如果consul集群单独部署在几个专用节点上，那么consul可以使用二种方式的任何一种。但是如果consul所在节点还承载工作负载，考虑consul作为整个分布式平台的核心，降低它与docker engine引擎的耦合（docker engine可能会因各种情况经常restart），还是建议以二进制程序形式直接部署在物理机或vm上。这里的实验环境资源有限，我们采用的是以二进制程序形式直接部署的方式。

consul最新版本是1.2.2（截至发稿时），consul 1.2.x版本与consul 1.1.x版本最大的不同在于consul 1.2.x支持service mesh了，这对于consul来说可是革新性的变化，因此这里担心其初期的稳定性，因此我们选择consul 1.1.0版本。

我们下载consul 1.1.0安装包后，将其解压到/usr/local/bin下。

在$HOME下建立consul-install目录，并在其下面存放consul集群的运行目录consul-data。在consul-install目录下，执行命令启动节点consul-1上的consul：

consul-1 node:

# nohup consul agent -server -ui -dns-port=53 -bootstrap-expect=3 -data-dir=/root/consul-install/consul-data -node=consul-1 -client=0.0.0.0 -bind=192.168.0.129 -datacenter=dc1 > consul-1.log & 2>&1

# tail -100f consul-1.log
bootstrap_expect > 0: expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
           Version: 'v1.1.0'
           Node ID: 'd23b9495-4caa-9ef2-a1d5-7f20aa39fd15'
         Node name: 'consul-1'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 53)
      Cluster Addr: 192.168.0.129 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2018/09/10 10:21:09 [INFO] raft: Initial configuration (index=0): []
    2018/09/10 10:21:09 [INFO] raft: Node at 192.168.0.129:8300 [Follower] entering Follower state (Leader: "")
    2018/09/10 10:21:09 [INFO] serf: EventMemberJoin: consul-1.dc1 192.168.0.129
    2018/09/10 10:21:09 [INFO] serf: EventMemberJoin: consul-1 192.168.0.129
    2018/09/10 10:21:09 [INFO] consul: Adding LAN server consul-1 (Addr: tcp/192.168.0.129:8300) (DC: dc1)
    2018/09/10 10:21:09 [INFO] consul: Handled member-join event for server "consul-1.dc1" in area "wan"
    2018/09/10 10:21:09 [INFO] agent: Started DNS server 0.0.0.0:53 (tcp)
    2018/09/10 10:21:09 [INFO] agent: Started DNS server 0.0.0.0:53 (udp)
    2018/09/10 10:21:09 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
    2018/09/10 10:21:09 [INFO] agent: started state syncer
==> Newer Consul version available: 1.2.2 (currently running: 1.1.0)
    2018/09/10 10:21:15 [WARN] raft: no known peers, aborting election
    2018/09/10 10:21:17 [ERR] agent: failed to sync remote state: No cluster leader

我们的三个节点的consul都以server角色启动（consul agent -server）,consul集群初始有三个node( -bootstrap-expect=3)，均位于dc1 datacenter(-datacenter=dc1)，服务bind地址为192.168.0.129(-bind=192.168.0.129 )，允许任意client连接（ -client=0.0.0.0）。我们启动了consul ui(-ui)，便于以图形化的方式查看consul集群的状态。我们设置了consul DNS服务的端口号为53（-dns-port=53），这个后续会起到重要作用，这里先埋下小伏笔。

这里我们使用nohup+&符号的方式将consul运行于后台。生产环境建议使用systemd这样的init系统对consul的启停和配置更新进行管理。

从consul-1的输出日志来看，单节点并没有选出leader。我们需要继续在consul-2和consul-3两个节点上也重复consul-1上的操作，启动consul：

consul-2 node:

#nohup consul agent -server -ui -dns-port=53  -bootstrap-expect=3 -data-dir=/root/consul-install/consul-data -node=consul-2 -client=0.0.0.0 -bind=192.168.0.130 -datacenter=dc1 -join 192.168.0.129 > consul-2.log & 2>&1

consul-3 node:

# nohup consul agent -server -ui -dns-port=53  -bootstrap-expect=3 -data-dir=/root/consul-install/consul-data -node=consul-3 -client=0.0.0.0 -bind=192.168.0.131 -datacenter=dc1 -join 192.168.0.129 > consul-3.log & 2>&1

启动后，我们查看到consul-3.log中的日志:

    2018/09/10 10:24:01 [INFO] consul: New leader elected: consul-3
    2018/09/10 10:24:01 [WARN] raft: AppendEntries to {Voter a215865f-dba7-5caa-cfb3-6850316199a3 192.168.0.130:8300} rejected, sending older logs (next: 1)
    2018/09/10 10:24:01 [INFO] raft: pipelining replication to peer {Voter a215865f-dba7-5caa-cfb3-6850316199a3 192.168.0.130:8300}
    2018/09/10 10:24:01 [WARN] raft: AppendEntries to {Voter d23b9495-4caa-9ef2-a1d5-7f20aa39fd15 192.168.0.129:8300} rejected, sending older logs (next: 1)
    2018/09/10 10:24:01 [INFO] raft: pipelining replication to peer {Voter d23b9495-4caa-9ef2-a1d5-7f20aa39fd15 192.168.0.129:8300}
    2018/09/10 10:24:01 [INFO] consul: member 'consul-1' joined, marking health alive
    2018/09/10 10:24:01 [INFO] consul: member 'consul-2' joined, marking health alive
    2018/09/10 10:24:01 [INFO] consul: member 'consul-3' joined, marking health alive
    2018/09/10 10:24:01 [INFO] agent: Synced node info
==> Newer Consul version available: 1.2.2 (currently running: 1.1.0)

consul-3 node上的consul被选为初始leader了。我们可以通过consul提供的子命令查看集群状态：

#  consul operator raft list-peers
Node      ID                                    Address             State     Voter  RaftProtocol
consul-3  0020b7aa-486a-5b44-b5fd-be000a380a89  192.168.0.131:8300  leader  true   3
consul-1  d23b9495-4caa-9ef2-a1d5-7f20aa39fd15  192.168.0.129:8300  follower  true   3
consul-2  a215865f-dba7-5caa-cfb3-6850316199a3  192.168.0.130:8300  follower    true   3

我们还可以通过consul ui以图形化方式查看集群状态和集群内存储的各种配置信息：

img{512x368}

至此，consul集群就搭建ok了。

2. 安装Nginx、consul-template和Registrator

根据前面的“部署示意图”，我们在consul-1和consul-2上安装nginx、consul-template和Registrator，在consul-3上安装Registrator。

a) Nginx的安装

我们使用ubuntu 16.04.4默认源中的nginx版本:1.10.3，通过apt-get install nginx安装nginx，这个无须赘述了。

b) consul-template的安装

consul-template是一个将consul集群中存储的信息转换为文件形式的工具。常用的场景是监听consul集群中数据的变化，并结合模板将数据持久化到某个文件中，再执行某一关联的action。比如我们这里通过consul-template监听consul集群中service信息的变化，并将service信息数据与nginx的配置模板结合，生成nginx可用的nginx.conf配置文件，并驱动nginx重新reload配置文件，使得nginx的配置更新生效。因此一般来说，哪里部署有nginx，我们就应该有一个配对的consul-template部署。

在我们的实验环境中consul-1和consul-2两个节点部署了nginx，因此我们需要在consul-1和consul-2两个节点上部署consul-template。我们直接安装comsul-template的二进制程序（我们使用0.19.5版本），下载安装包并解压后，将consul-template放入/usr/local/bin目录下：

# wget -c https://releases.hashicorp.com/consul-template/0.19.5/consul-template_0.19.5_linux_amd64.zip

# unzip consul-template_0.19.5_linux_amd64.zip
# mv consul-tempate /usr/local/bin
# consul-template -v
consul-template v0.19.5 (57b6c71)

这里先不启动consul-template，后续在注册不同服务的场景中，我们再启动consul-template。

c) Registrator的安装

Registrator是另外一种工具，它监听Docker引擎上发生的容器创建和停止事件，并将启动的容器信息以consul service的形式存储在consul集群中。因此，Registrator和node上的docker engine对应，有docker engine部署的节点上都应该安装有对应的Registator。因此我们要在实验环境的三个节点上都部署Registrator。

Registrator官方推荐的就是以Docker容器方式运行，但这里我并不使用lastest版本，而是用master版本，因为只有最新的master版本才支持service meta数据的写入，而当前的latest版本是v7版本，年头较长，并不支持service meta数据写入。

在所有实验环境节点上执行：

 # docker run --restart=always -d \
    --name=registrator \
    --net=host \
    --volume=/var/run/docker.sock:/tmp/docker.sock \
    gliderlabs/registrator:master\
      consul://localhost:8500

我们看到registrator将node节点上的/var/run/docker.sock映射到容器内部的/tmp/docker.sock上，通过这种方式registrator可以监听到node上docker引擎上的事件变化。registrator的另外一个参数：consul://localhost:8500则是Registrator要写入信息的consul地址（当然Registrator不仅仅支持consul，还支持etcd、zookeeper等），这里传入的是本node上consul server的地址和服务端口。

Registrator的启动日志如下：

# docker logs -f registrator
2018/09/10 05:56:39 Starting registrator v7 ...
2018/09/10 05:56:39 Using consul adapter: consul://localhost:8500
2018/09/10 05:56:39 Connecting to backend (0/0)
2018/09/10 05:56:39 consul: current leader  192.168.0.130:8300
2018/09/10 05:56:39 Listening for Docker events ...
2018/09/10 05:56:39 Syncing services on 1 containers
2018/09/10 05:56:39 ignored: 6ef6ae966ee5 no published ports

在所有节点都启动完Registrator后，我们来先查看一下当前consul集群中service的catelog以及每个catelog下的service的详细信息：

// consul-1:

# curl  http://localhost:8500/v1/catalog/services
{"consul":[]}

目前只有consul自己内置的consul service catelog，我们查看一下consul这个catelog service的详细信息：

// consul-1:

# curl  localhost:8500/v1/catalog/service/consul|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1189  100  1189    0     0   180k      0 --:--:-- --:--:-- --:--:--  193k
[
  {
    "ID": "d23b9495-4caa-9ef2-a1d5-7f20aa39fd15",
    "Node": "consul-1",
    "Address": "192.168.0.129",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "192.168.0.129",
      "wan": "192.168.0.129"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "consul",
    "ServiceName": "consul",
    "ServiceTags": [],
    "ServiceAddress": "",
    "ServiceMeta": {},
    "ServicePort": 8300,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 5,
    "ModifyIndex": 5
  },
  {
    "ID": "a215865f-dba7-5caa-cfb3-6850316199a3",
    "Node": "consul-2",
    "Address": "192.168.0.130",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "192.168.0.130",
      "wan": "192.168.0.130"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "consul",
    "ServiceName": "consul",
    "ServiceTags": [],
    "ServiceAddress": "",
    "ServiceMeta": {},
    "ServicePort": 8300,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 6,
    "ModifyIndex": 6
  },
  {
    "ID": "0020b7aa-486a-5b44-b5fd-be000a380a89",
    "Node": "consul-3",
    "Address": "192.168.0.131",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "192.168.0.131",
      "wan": "192.168.0.131"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceID": "consul",
    "ServiceName": "consul",
    "ServiceTags": [],
    "ServiceAddress": "",
    "ServiceMeta": {},
    "ServicePort": 8300,
    "ServiceEnableTagOverride": false,
    "CreateIndex": 7,
    "ModifyIndex": 7
  }
]

3. 内部http服务的注册和发现

对于微服务而言，有暴露到外面的，也有仅运行在内部，被内部服务调用的。我们先来看看内部服务，这里以一个http服务为例。

对于暴露到外部的微服务而言，可以通过域名、路径、端口等来发现。但是对于内部服务，我们怎么发现呢？k8s中我们可以通过k8s集群的DNS插件进行自动域名解析实现，每个pod中container的DNS server指向的就是k8s dns server。这样service之间可以通过使用固定规则的域名(比如：your_svc.default.svc.cluster.local)来访问到另外一个service(仅需配置一个service name)，再通过service实现该服务请求负载均衡到service关联的后端endpoint(pod container)上。consul集群也可以做到这点，并使用consul提供的DNS服务来实现内部服务的发现。

我们需要对三个节点的DNS配置进行update，将consul DNS server加入到主机DNS resolver(这也是之前在启动consul时将consul DNS的默认监听端口从8600改为53的原因)，步骤如下：

编辑/etc/resolvconf/resolv.conf.d/base，加入一行：

nameserver 127.0.0.1

重启resolveconf服务

 /etc/init.d/resolvconf restart

再查看/etc/resolve.conf文件：

# cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 100.100.2.136
nameserver 100.100.2.138
nameserver 127.0.0.1
options timeout:2 attempts:3 rotate single-request-reopen

我们发现127.0.0.1这个DNS server地址已经被加入到/etc/resolv.conf中了（切记：不要直接手工修改/etc/resolve.conf）。

好了！有了consul DNS，我们就可以发现consul中的服务了。consul给其集群内部的service一个默认的域名：your_svc.service.{data-center}.consul. 之前我们查看了cluster中只有一个consul catelog service，我们就来访问一下该consul service：

# ping -c 3 consul.service.dc1.consul
PING consul.service.dc1.consul (192.168.0.129) 56(84) bytes of data.
64 bytes from iZbp15tvx7it019hvy750tZ (192.168.0.129): icmp_seq=1 ttl=64 time=0.029 ms
64 bytes from iZbp15tvx7it019hvy750tZ (192.168.0.129): icmp_seq=2 ttl=64 time=0.025 ms
64 bytes from iZbp15tvx7it019hvy750tZ (192.168.0.129): icmp_seq=3 ttl=64 time=0.031 ms

# ping -c 3 consul.service.dc1.consul
PING consul.service.dc1.consul (192.168.0.130) 56(84) bytes of data.
64 bytes from 192.168.0.130: icmp_seq=1 ttl=64 time=0.186 ms
64 bytes from 192.168.0.130: icmp_seq=2 ttl=64 time=0.136 ms
64 bytes from 192.168.0.130: icmp_seq=3 ttl=64 time=0.195 ms

# ping -c 3 consul.service.dc1.consul
PING consul.service.dc1.consul (192.168.0.131) 56(84) bytes of data.
64 bytes from 192.168.0.131: icmp_seq=1 ttl=64 time=0.149 ms
64 bytes from 192.168.0.131: icmp_seq=2 ttl=64 time=0.184 ms
64 bytes from 192.168.0.131: icmp_seq=3 ttl=64 time=0.179 ms

我们看到consul服务有三个实例，因此DNS轮询在不同ping命令执行时返回了不同的地址。

现在在主机层面上，我们可以发现consul中的service了。如果我们的服务调用者跑在docker container中，我们还能找到consul服务么？

# docker run busybox ping consul.service.dc1.consul
ping: bad address 'consul.service.dc1.consul'

事实告诉我们：不行！

那么我们如何让运行于docker container中的服务调用者也能发现consul中的service呢？我们需要给docker引擎指定DNS：

在/etc/docker/daemon.json中添加下面配置:

{
    "dns": ["node_ip", "8.8.8.8"] //node_ip： consul_1为192.168.0.129、consul_2为192.168.0.130、consul_3为192.168.0.131
}

重启docker引擎后，再尝试在容器内发现consul服务：

# docker run busybox ping consul.service.dc1.consul
PING consul.service.dc1.consul (192.168.0.131): 56 data bytes
64 bytes from 192.168.0.131: seq=0 ttl=63 time=0.268 ms
64 bytes from 192.168.0.131: seq=1 ttl=63 time=0.245 ms
64 bytes from 192.168.0.131: seq=2 ttl=63 time=0.235 ms

这次就ok了！

接下来我们在三个节点上以容器方式启动我们的一个内部http服务demo httpbackend：

# docker run --restart=always -d  -l "SERVICE_NAME=httpbackend" -p 8081:8081 bigwhite/httpbackendservice:v1.0.0

我们查看一下consul集群内的httpbackend service信息：

# curl  localhost:8500/v1/catalog/service/httpbackend|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1374  100  1374    0     0   519k      0 --:--:-- --:--:-- --:--:--  670k
[
  {
    "ID": "d23b9495-4caa-9ef2-a1d5-7f20aa39fd15",
    "Node": "consul-1",
    "Address": "192.168.0.129",
   ...
  },
  {
    "ID": "a215865f-dba7-5caa-cfb3-6850316199a3",
    "Node": "consul-2",
    "Address": "192.168.0.130",
   ...
  },
  {
    "ID": "0020b7aa-486a-5b44-b5fd-be000a380a89",
    "Node": "consul-3",
    "Address": "192.168.0.131",
   ...
  }
]

再访问一下该服务：

# curl httpbackend.service.dc1.consul:8081
this is httpbackendservice, version: v1.0.0

内部服务发现成功！

4. 暴露外部http服务

说完了内部服务，我们再来说说那些要暴露到外部的服务，这个环节就轮到consul-template登场了！在我们的实验中，consul-template读取consul中service信息，并结合模板生成nginx配置文件。我们基于默认安装的/etc/nginx/nginx.conf文件内容来编写我们的模板。我们先实验暴露http服务到外面。下面是模板样例：

//nginx.conf.template

.... ...

http {
        ... ...
        ##
        # Virtual Host Configs
        ##

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;

        #
        # http server config
        #

        {{range services -}}
        {{$name := .Name}}
        {{$service := service .Name}}
        {{- if in .Tags "http" -}}
        upstream {{$name}} {
          zone upstream-{{$name}} 64k;
          {{range $service}}
          server {{.Address}}:{{.Port}} max_fails=3 fail_timeout=60 weight=1;
          {{end}}
        }{{end}}
        {{end}}

        {{- range services -}} {{$name := .Name}}
        {{- if in .Tags "http" -}}
        server {
          listen 80;
          server_name {{$name}}.tonybai.com;

          location / {
            proxy_pass http://{{$name}};
          }
        }
        {{end}}
        {{end}}

}

consul-template使用的模板采用的是go template的语法。我们看到在http block中，我们要为consul中的每个要expose到外部的catelog service定义一个server block(对应的域名为your_svc.tonybai.com)和一个upstream block。

对上面的模板做简单的解析，弄明白三点，模板基本就全明白了：

{{- range services -}}：标准的{{ range pipeline }}模板语法，services这个pipeline的调用相当于： curl localhost:8500/v1/catalog/services，即获取catelog services列表。这个列表中的每项仅有Name和Tags两个字段可用。
{{- if in .Tags “http” -}}：判断语句，即如果Tags字段中有http这个tag，那么则暴露该catelog service。
{{range $service}}：也是标准的{{ range pipeline }}模板语法，$service这个pipeline调用相当于curl localhost:8500/v1/catalog/service/xxxx，即获取某个service xxx的详细信息，包括Address、Port、Tag、Meta等。

接下来，我们在consul-1和consul-2上启动consul-template：

consul-1:
# nohup  consul-template -template "/root/consul-install/templates/nginx.conf.template:/etc/nginx/nginx.conf:nginx -s reload" > consul-template.log & 2>&1

consul-2:
# nohup  consul-template -template "/root/consul-install/templates/nginx.conf.template:/etc/nginx/nginx.conf:nginx -s reload" > consul-template.log & 2>&1

查看/etc/nginx/nginx.conf，你会发现http server config下面并没有生成任何配置，因为consul集群中还没有满足Tag条件的service（包含tag “http”)。现在我们就来在三个node上创建httpfront services。

# docker run --restart=always -d -l "SERVICE_NAME=httpfront" -l "SERVICE_TAGS=http" -P bigwhite/httpfrontservice:v1.0.0

查看生成的nginx.conf:

upstream httpfront {
      zone upstream-httpfront 64k;

          server 192.168.0.129:32769 max_fails=3 fail_timeout=60 weight=1;

          server 192.168.0.130:32768 max_fails=3 fail_timeout=60 weight=1;

          server 192.168.0.131:32768 max_fails=3 fail_timeout=60 weight=1;

    }

    server {
      listen 80;
          server_name httpfront.tonybai.com;

      location / {
        proxy_pass http://httpfront;
      }
    }

测试一下httpfront.tonybai.com(可通过修改/etc/hosts)，httpfront service会调用内部服务httpbackend(通过httpbackend.service.dc1.consul:8081访问)：

# curl httpfront.tonybai.com
this is httpfrontservice, version: v1.0.0, calling backendservice ok, its resp: [this is httpbackendservice, version: v1.0.0
]

可以在各个节点上查看httpfront的日志：(通过docker logs)，你会发现到httpfront.tonybai.com的请求被均衡到了各个节点上的httpfront service上了：

{GET / HTTP/1.0 1 0 map[Connection:[close] User-Agent:[curl/7.47.0] Accept:[*/*]] {} <nil> 0 [] true httpfront map[] map[] <nil> map[] 192.168.0.129:35184 / <nil> <nil> <nil> 0xc0000524c0}
calling backendservice...
{200 OK 200 HTTP/1.1 1 1 map[Date:[Mon, 10 Sep 2018 08:23:33 GMT] Content-Length:[44] Content-Type:[text/plain; charset=utf-8]] 0xc0000808c0 44 [] false false map[] 0xc000132600 <nil>}
this is httpbackendservice, version: v1.0.0

5. 暴露外部tcp服务

我们的微服务可不仅仅有http服务的，还有直接暴露tcp socket服务的。nginx对tcp的支持是通过stream block支持的。在stream block中，我们来为每个要暴露在外面的tcp service生成server block和upstream block，这部分模板内容如下：

stream {
   {{- range services -}}
   {{$name := .Name}}
   {{$service := service .Name}}
     {{- if in .Tags "tcp" -}}
  upstream {{$name}} {
    least_conn;
    {{- range $service}}
    server {{.Address}}:{{.Port}} max_fails=3 fail_timeout=30s weight=5;
    {{ end }}
  }
     {{end}}
  {{end}}

   {{- range services -}}
   {{$name := .Name}}
   {{$nameAndPort := $name | split "-"}}
    {{- if in .Tags "tcp" -}}
  server {
      listen {{ index $nameAndPort 1 }};
      proxy_pass {{$name}};
  }
    {{end}}
   {{end}}
}

和之前的http服务模板相比，这里的Tag过滤词换为了“tcp”，并且由于端口具有排他性，这里用”名字-端口”串来作为service的name以及upstream block的标识。用一个例子来演示会更加清晰。由于修改了nginx模板，在演示demo前，需要重启一下各个consul-template。

然后我们在各个节点上启动tcpfront service（注意服务名为tcpfront-9999，9999是tcpfrontservice expose到外部的端口）：

# docker run -d --restart=always -l "SERVICE_TAGS=tcp" -l "SERVICE_NAME=tcpfront-9999" -P bigwhite/tcpfrontservice:v1.0.0

启动后，我们查看一下生成的nginx.conf:

stream {

   upstream tcpfront-9999 {
    least_conn;
    server 192.168.0.129:32770 max_fails=3 fail_timeout=30s weight=5;

    server 192.168.0.130:32769 max_fails=3 fail_timeout=30s weight=5;

    server 192.168.0.131:32769 max_fails=3 fail_timeout=30s weight=5;

  }

   server {
      listen 9999;
      proxy_pass tcpfront-9999;
  }

}

nginx对外的9999端口对应到集群内的tcpfront服务！这个tcpfront是一个echo服务，我们来测试一下：

# telnet localhost 9999
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
hello
[v1.0.0]2018-09-10 08:56:15.791728641 +0000 UTC m=+531.620462772 [hello
]
tonybai
[v1.0.0]2018-09-10 08:56:17.658482957 +0000 UTC m=+533.487217127 [tonybai
]

基于暴露tcp服务，我们还可以实现将全透传的https服务暴露到外部。所谓全透传的https服务，即ssl证书配置在服务自身，而不是nginx上面。其实现方式与暴露tcp服务相似，这里就不举例了。

五. 小结

以上基于consul+consul-template+registrator+nginx实现了一个基本的微服务服务发现和负载均衡框架，但要应用到生产环境还需一些进一步的考量。

关于服务治理的一些功能，consul 1.2.x版本已经加入了service mesh的support，后续在成熟后可以考虑upgrade consul cluster。

consul-template在v0.19.5中还不支持servicemeta的，但在master版本中已经支持，后续利用新版本的consul-template可以实现功能更为丰富的模板，比如实现灰度发布等。

51短信平台：企业级短信平台定制开发专家 https://tonybai.com/
smspush : 可部署在企业内部的定制化短信平台，三网覆盖，不惧大并发接入，可定制扩展；短信内容你来定，不再受约束, 接口丰富，支持长短信，签名可选。

著名云主机服务厂商DigitalOcean发布最新的主机计划，入门级Droplet配置升级为：1 core CPU、1G内存、25G高速SSD，价格5$/月。有使用DigitalOcean需求的朋友，可以打开这个链接地址：https://m.do.co/c/bff6eed92687 开启你的DO主机之路。

我的联系方式：

微博：https://weibo.com/bigwhite20xx
微信公众号：iamtonybai
博客：tonybai.com
github: https://github.com/bigwhite

微信赞赏：
img{512x368}

商务合作方式：撰稿、出书、培训、在线课程、合伙创业、咨询、广告合作。

weed-fs使用简介

八月 22, 2015
38 条评论

weed-fs，全名Seaweed-fs，是一种用golang实现的简单且高可用的分布式文件系统。该系统的目标有二：

- 存储billions of files
- serve the files fast

weed-fs起初是为了搞一个基于Fackbook的Haystack论文的实现，Haystack旨在优化Fackbook内部图片存储和获取。后在这个基础上，weed-fs作者又增加了若干feature，形成了目前的weed-fs。

这里并不打算深入分析weed-fs源码，仅仅是从黑盒角度介绍weed-fs的使用，发掘weed-fs的功能、长处和不足。

一、weed-fs集群简介

weed-fs集群的拓扑(Topology)由DataCenter、Rack(机架)、Machine(或叫Node)组成。最初版本的weed-fs应该可以通过配置文件来描述整个集群的拓扑结构，配置文件采用xml格式，官方给出的样例如下：

但目前的版本中，该配置文件在help说明中被置为“Deprecating!”了：

$weed master -help
…
-conf="/etc/weedfs/weedfs.conf": Deprecating! xml configuration file
…

0.70版本的weed-fs在Master中维护集群拓扑，master会根据master与master、volume与master的连接情况实时合成拓扑结构了。

weed-fs自身可以在两种模式下运行，一种是Master，另外一种则是Volume。集群的维护以及强一致性的保证由master们保证，master间通过raft协议实现强一致性。Volume是实际管理和存储数据的运行实例。数据的可靠性则可以通过weed-fs提供的 replication机制保证。

weed-fs提供了若干种replication策略(rack – 机架，一个逻辑上的概念)：

000 no replication, just one copy
001 replicate once on the same rack
010 replicate once on a different rack in the same data center
100 replicate once on a different data center
200 replicate twice on two other different data center
110 replicate once on a different rack, and once on a different data center

选择数据更可靠的策略，则会带来一些性能上的代价，这始终是一个权衡的问题。

更多的细节以及Scaling、数据迁移等方面，下面将逐一说明。

二、weed-fs集群的启动

为了实验方便，我们定义了一个weed-fs集群拓扑：

三个master:
    master1 – localhost:9333
    master2 – localhost:9334
    master3 – localhost:9335

replication策略：100(即在另外一个不同的datacenter中复制一份)

三个volume:
         volume1 – localhost:8081 dc1
    volume2 – localhost:8082 dc1
    volume3 – localhost:8083 dc2

集群启动首先启动master们，启动顺序: master1、master2、master3：

master1:

$ weed -v=3 master -port=9333 -mdir=./m1 -peers=localhost:9333,localhost:9334,localhost:9335 -defaultReplication=100
I0820 14:37:17 07606 file_util.go:20] Folder ./m1 Permission: -rwxrwxr-x
I0820 14:37:17 07606 topology.go:86] Using default configurations.
I0820 14:37:17 07606 master_server.go:59] Volume Size Limit is 30000 MB
I0820 14:37:17 07606 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9333
I0820 14:37:17 07606 raft_server.go:50] Starting RaftServer with IP:localhost:9333:
I0820 14:37:17 07606 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335
I0820 14:37:17 07606 raft_server.go:134] Attempting to connect to: http://localhost:9334/cluster/join
I0820 14:37:17 07606 raft_server.go:139] Post returned error: Post http://localhost:9334/cluster/join: dial tcp 127.0.0.1:9334: connection refused
I0820 14:37:17 07606 raft_server.go:134] Attempting to connect to: http://localhost:9335/cluster/join
I0820 14:37:17 07606 raft_server.go:139] Post returned error: Post http://localhost:9335/cluster/join: dial tcp 127.0.0.1:9335: connection refused
I0820 14:37:17 07606 raft_server.go:78] No existing server found. Starting as leader in the new cluster.
I0820 14:37:17 07606 master_server.go:93] [ localhost:9333 ] I am the leader!

I0820 14:37:52 07606 raft_server_handlers.go:16] Processing incoming join. Current Leader localhost:9333 Self localhost:9333 Peers map[]
I0820 14:37:52 07606 raft_server_handlers.go:20] Command:{"name":"localhost:9334","connectionString":"http://localhost:9334"}
I0820 14:37:52 07606 raft_server_handlers.go:27] join command from Name localhost:9334 Connection http://localhost:9334

I0820 14:38:02 07606 raft_server_handlers.go:16] Processing incoming join. Current Leader localhost:9333 Self localhost:9333 Peers map[localhost:9334:0xc20800f730]
I0820 14:38:02 07606 raft_server_handlers.go:20] Command:{"name":"localhost:9335","connectionString":"http://localhost:9335"}
I0820 14:38:02 07606 raft_server_handlers.go:27] join command from Name localhost:9335 Connection http://localhost:9335

master2:

$ weed -v=3 master -port=9334 -mdir=./m2 -peers=localhost:9333,localhost:9334,localhost:9335 -defaultReplication=100
I0820 14:37:52 07616 file_util.go:20] Folder ./m2 Permission: -rwxrwxr-x
I0820 14:37:52 07616 topology.go:86] Using default configurations.
I0820 14:37:52 07616 master_server.go:59] Volume Size Limit is 30000 MB
I0820 14:37:52 07616 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9334
I0820 14:37:52 07616 raft_server.go:50] Starting RaftServer with IP:localhost:9334:
I0820 14:37:52 07616 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335
I0820 14:37:52 07616 raft_server.go:134] Attempting to connect to: http://localhost:9333/cluster/join
I0820 14:37:52 07616 raft_server.go:179] Post returned status: 200

master3:

$ weed -v=3 master -port=9335 -mdir=./m3 -peers=localhost:9333,localhost:9334,localhost:9335 -defaultReplication=100
I0820 14:38:02 07626 file_util.go:20] Folder ./m3 Permission: -rwxrwxr-x
I0820 14:38:02 07626 topology.go:86] Using default configurations.
I0820 14:38:02 07626 master_server.go:59] Volume Size Limit is 30000 MB
I0820 14:38:02 07626 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9335
I0820 14:38:02 07626 raft_server.go:50] Starting RaftServer with IP:localhost:9335:
I0820 14:38:02 07626 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335
I0820 14:38:02 07626 raft_server.go:134] Attempting to connect to: http://localhost:9333/cluster/join
I0820 14:38:03 07626 raft_server.go:179] Post returned status: 200

master1启动后，发现其他两个peer master尚未启动，于是将自己选为leader。master2、master3启动后，加入到以master1为leader的 master集群。

接下来我们来启动volume servers：

volume1:

$ weed -v=3 volume -port=8081 -dir=./v1 -mserver=localhost:9333 -dataCenter=dc1
I0820 14:44:29 07642 file_util.go:20] Folder ./v1 Permission: -rwxrwxr-x
I0820 14:44:29 07642 store.go:225] Store started on dir: ./v1 with 0 volumes max 7
I0820 14:44:29 07642 volume.go:136] Start Seaweed volume server 0.70 beta at 0.0.0.0:8081
I0820 14:44:29 07642 volume_server.go:70] Volume server bootstraps with master localhost:9333
I0820 14:44:29 07642 list_masters.go:18] list masters result :{"IsLeader":true,"Leader":"localhost:9333","Peers":["localhost:9334","localhost:9335"]}
I0820 14:44:29 07642 store.go:65] current master nodes is nodes:[localhost:9334 localhost:9335 localhost:9333 localhost:9333], lastNode:3

volume server的启动大致相同，volume2和volume3的输出日志这里就不详细列出了。

volume2:

$weed -v=3 volume -port=8082 -dir=./v2 -mserver=localhost:9334 -dataCenter=dc1

volume3:

$weed -v=3 volume -port=8083 -dir=./v3 -mserver=localhost:9335 -dataCenter=dc2

三个volume server启动后，我们在leader master(9333)上能看到如下日志：

I0820 14:44:29 07606 node.go:208] topo adds child dc1
I0820 14:44:29 07606 node.go:208] topo:dc1 adds child DefaultRack
I0820 14:44:29 07606 node.go:208] topo:dc1:DefaultRack adds child 127.0.0.1:8081
I0820 14:47:09 07606 node.go:208] topo:dc1:DefaultRack adds child 127.0.0.1:8082
I0820 14:47:21 07606 node.go:208] topo adds child dc2
I0820 14:47:21 07606 node.go:208] topo:dc2 adds child DefaultRack
I0820 14:47:21 07606 node.go:208] topo:dc2:DefaultRack adds child 127.0.0.1:8083

至此，整个weed-fs集群已经启动了。初始启动后的master会在-mdir下建立一些目录和文件：

$ ls m1
conf log snapshot

但volume在-dir下没有做任何操作，volume server会在第一次写入数据时建立相应的.idx文件和.dat文件。

三、基本操作：存储、获取和删除文件

创建一个hello.txt文件，内容为"hello weed-fs!"，用于我们测试weed-fs的基本操作。weed-fs提供了HTTP REST API接口，我们可以很方便的使用其基本功能(这里客户端使用curl)。

1、存储

我们来将hello.txt文件存储在weed-fs文件系统中，我们通过master提供的submit API接口来完成这一操作：

$ curl -F file=@hello.txt http://localhost:9333/submit
{"fid":"6,01fc4a422c","fileName":"hello.txt","fileUrl":"127.0.0.1:8082/6,01fc4a422c","size":39}

我们看到master给我们返回了一行json数据，其中:

fid是一个逗号分隔的字符串，按照repository中文档的说明，这个字符串应该由volume id, key uint64和cookie code构成。其中逗号前面的6就是volume id, 01fc4a422c则是key和cookie组成的串。fid是文件hello.txt在集群中的唯一ID。后续查看、获取以及删除该文件数据都需要使用这个fid。

fileUrl是该文件在weed-fs中的一个访问地址(非唯一哦)，这里是127.0.0.1:8082/6,01fc4a422c，可以看出weed-fs在volume server2上存储了一份hello.txt的数据。

这一存储操作引发了物理volume的创建，我们可以看到volume server的-dir下发生了变化，多了很多.idx和.dat文件：

$ ls v1 v2 v3
v1:
3.dat 3.idx 4.dat 4.idx 5.dat 5.idx

v2:
1.dat 1.idx 2.dat 2.idx 6.dat 6.idx

v3:
1.dat 1.idx 2.dat 2.idx 3.dat 3.idx 4.dat 4.idx 5.dat 5.idx 6.dat 6.idx

并且这个创建过程是在master leader的控制之下的：

I0820 15:06:02 07606 volume_growth.go:204] Created Volume 3 on topo:dc1:DefaultRack:127.0.0.1:8081
I0820 15:06:02 07606 volume_growth.go:204] Created Volume 3 on topo:dc2:DefaultRack:127.0.0.1:8083

我们从文件的size可以看出，hello.txt文件被存储在了v2和v3下的id为6的卷(6.dat和6.idx)中：

v2:
-rw-r–r– 1 tonybai tonybai 104 8月20 15:06 6.dat
-rw-r–r– 1 tonybai tonybai 16 8月20 15:06 6.idx

v3:
-rw-r–r– 1 tonybai tonybai 104 8月20 15:06 6.dat
-rw-r–r– 1 tonybai tonybai 16 8月20 15:06 6.idx

v2和v3中的6.dat是一模一样的，6.idx也是一样的（后续在做数据迁移时，这点极其重要）。

2、获取

前面提到master给我们返回了一个fid:6,01fc4a422c以及fileUrl":"127.0.0.1:8082/6,01fc4a422c"。

通过这个fileUrl，我们可以获取到hello.txt的数据：

$ curl http://127.0.0.1:8082/6,01fc4a422c
hello weed-fs!

根据我们的replication策略，hello.txt应该还存储在v3下，我们换成8083这个volume，应该也可以得到 hello.txt数据：

$ curl http://127.0.0.1:8083/6,01fc4a422c
hello weed-fs!

如果我们通过volume1 (8081)查，应该得不到数据：

$ curl http://127.0.0.1:8081/6,01fc4a422c
<a href="http://127.0.0.1:8082/6,01fc4a422c">Moved Permanently</a>.

这里似乎是重定向了。我们给curl加上重定向处理选项再试一次：

$ curl -L http://127.0.0.1:8081/6,01fc4a422c
hello weed-fs!

居然也能得到相应数据，从volume1的日志来看，volume1也能获取到hello.txt的正确地址，并将返回重定向请求，这样curl 就能从正确的machine上获取数据了。

如果我们通过master来获取hello.txt数据，会是什么结果呢？

$ curl -L http://127.0.0.1:9335/6,01fc4a422c
hello weed-fs!

同样master返回重定向地址，curl从volume节点获取到正确数据。我们看看master是如何返回重定向地址的？

$ curl http://127.0.0.1:9335/6,01fc4a422c
<a href="http://127.0.0.1:8082/6,01fc4a422c">Moved Permanently</a>.
$ curl http://127.0.0.1:9335/6,01fc4a422c
<a href="http://127.0.0.1:8083/6,01fc4a422c">Moved Permanently</a>.

可以看到master会自动均衡负载，轮询式的返回8082和8083。0.70版本以前，通过非leader master是无法得到正确结果的，只能通过leader master得到，0.70版本fix了这个问题。

3、删除

通过fileUrl地址直接删除hello.txt：

$ curl -X DELETE http://127.0.0.1:8082/6,01fc4a422c
{"size":39}

操作成功后，我们再来get一下hello.txt:

$ curl -i http://127.0.0.1:8082/6,01fc4a422c
HTTP/1.1 404 Not Found
Date: Thu, 20 Aug 2015 08:13:28 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

$ curl -i -L http://127.0.0.1:9335/6,01fc4a422c
HTTP/1.1 301 Moved Permanently
Content-Length: 69
Content-Type: text/html; charset=utf-8
Date: Thu, 20 Aug 2015 08:13:56 GMT
Location: http://127.0.0.1:8082/6,01fc4a422c

HTTP/1.1 404 Not Found
Date: Thu, 20 Aug 2015 08:13:56 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

可以看出，无论是直接通过volume还是间接通过master都无法获取到hello.txt了，hello.txt被成功删除了。

不过删除hello.txt后，volume server下的数据文件的size却并没有随之减小，别担心，这就是weed-fs的处理方法，这些数据删除后遗留下来的空洞需要手工清除（对数据文件进行手工紧缩）：

$ curl "http://localhost:9335/vol/vacuum"
{"Topology":{"DataCenters":[{"Free":8,"Id":"dc1","Max":14,"Racks":[{"DataNodes":[{"Free":4,"Max":7,"PublicUrl":"127.0.0.1:8081","Url":"127.0.0.1:8081","Volumes":3},{"Free":4,"Max":7,"PublicUrl":"127.0.0.1:8082","Url":"127.0.0.1:8082","Volumes":3}],”Free”:8,”Id”:”DefaultRack”,”Max”:14}]},{“Free”:1,”Id”:”dc2″,”Max”:7,”Racks”:[{"DataNodes":[{"Free":1,"Max":7,"PublicUrl":"127.0.0.1:8083","Url":"127.0.0.1:8083","Volumes":6}],”Free”:1,”Id”:”DefaultRack”,”Max”:7}]}],”Free”:9,”Max”:21,”layouts”:[{"collection":"","replication":"100","ttl":"","writables":[1,2,3,4,5,6]}]},"Version":"0.70 beta"}

紧缩后，你再查看v1, v2, v3下的文件size，真的变小了。

四、一致性（consistency）

在分布式系统中，“一致性”是永恒的难题。weed-fs支持replication，其多副本的数据一致性需要保证。

weed-fs理论上采用了是一种“强一致性”的策略，即：

存储文件时，当多个副本都存储成功后，才会返回成功；任何一个副本存储失败，此次存储操作则返回失败。
删除文件时，当所有副本都删除成功后，才返回成功；任何一个副本删除失败，则此次删除操作返回失败。

我们来验证一下weed-fs是否做到了以上两点：

1、存储的一致性保证

我们先将volume3停掉(即dc2)，这样在replication 策略为100时，向weed-fs存储hello.txt时会发生如下结果：

$ curl -F file=@hello.txt http://localhost:9333/submit
{"error":"Cannot grow volume group! Not enough data node found!"}

master根据100策略，需要在dc2选择一个volume存储hello.txt的副本，但dc2所有machine都down掉了，因此没有存储空间，于是master认为此次操作无法继续进行，返回失败。这点符合存储一致性的要求。

2、删除的一致性保证

恢复dc2，将hello.txt存入：

$ curl -F file=@hello.txt http://localhost:9333/submit
{"fid":"6,04dce94a72","fileName":"hello.txt","fileUrl":"127.0.0.1:8082/6,04dce94a72","size":39}

再次停掉dc2，之后尝试删除hello.txt（通过master删除)：

$ curl -L -X DELETE http://127.0.0.1:9333/6,04dce94a72
{"error":"Deletion Failed."}

虽然返回的是delete failed，但从8082上的日志来看，似乎8082已经将hello.txt删除了：

I0820 17:32:20 07653 volume_server_handlers_write.go:53] deleting Cookie:3706276466, Id:4, Size:0, DataSize:0, Name: , Mime:

我们再从8082获取一下hello.txt：

$ curl http://127.0.0.1:8082/6,04dce94a72

结果是什么也没有返回。

从8082日志来看：

I0820 17:33:24 07653 volume_server_handlers_read.go:53] read error: File Entry Not Found. Needle 70 Memory 0 /6,04dce94a72

hello.txt的确被删除了！

这时将dc2(8083)重新启动！我们尝试从8083获取hello.txt：

$ curl http://127.0.0.1:8083/6,04dce94a72
hello weed-fs!

8083上的hello.txt依旧存在，可以被读取。

再试试通过master来获取hello.txt：

$ curl -L http://127.0.0.1:9333/6,04dce94a72
$ curl -L http://127.0.0.1:9333/6,04dce94a72
hello weed-fs!

结果是有时能返回hello.txt内容，有时不行。显然这是与master的自动负载均衡有关，返回8082这个重定向地址，则curl无法得到结果；但若返回8083这个重定向地址，我们就可以得到hello.txt的内容。

这样来看，目前weed-fs的删除操作还无法保证强一致性。weed-fs github.com上已有若干issues(#172，#179，#182)是关于这个问题的。在大数据量(TB、PB级别)的情况下，这种不一致性最大的问题是导致storage leak，即空间被占用而无法回收，volume将被逐个逐渐占满，期待后续的解决方案吧。

五、目录支持

weed-fs还支持像传统文件系统那样，将文件放在目录下管理，并通过文件路径对文件进行存储、获取和删除操作。weed-fs对目录的支持是通过另外一个server实现的：filer server。也就是说如果想拥有对目录的支持，则必须启动一个(或若干个) filer server，并且所有的操作都要通过filer server进行。

$ weed filer -port=8888 -dir=./f1 -master=localhost:9333 -defaultReplicaPlacement=100
I0820 22:09:40 08238 file_util.go:20] Folder ./f1 Permission: -rwxrwxr-x
I0820 22:09:40 08238 filer.go:88] Start Seaweed Filer 0.70 beta at port 8888

1、存储

$curl -F "filename=@hello.txt" "http://localhost:8888/foo/"
{"name":"hello.txt","size":39}

2、获取

$ curl http://localhost:8888/foo/hello.txt
hello weed-fs!

3、查询目录文件列表

$ curl "http://localhost:8888/foo/?pretty=y"
{
"Directory": "/foo/",
"Files": [
    {
      "name": "hello.txt",
      "fid": "6,067281a126"
    }
],
"Subdirectories": null
}

4、删除

$ curl -X DELETE http://localhost:8888/foo/hello.txt
{"error":""}

再尝试获取hello.txt：

$curl http://localhost:8888/foo/hello.txt
返回空。hello.txt已被删除。

5、多filer server

weed filer server是单点，我们再来启动一个filer server。

$ weed filer -port=8889 -dir=./f2 -master=localhost:9333 -defaultReplicaPlacement=100
I0821 13:47:52 08973 file_util.go:20] Folder ./f2 Permission: -rwxrwxr-x
I0821 13:47:52 08973 filer.go:88] Start Seaweed Filer 0.70 beta at port 8889

两个filer节点间是否有协调呢？我们来测试一下：我们从8888存储一个文件，然后从8889获取这个文件：

$ curl -F "filename=@hello.txt" "http://localhost:8888/foo/"
{"name":"hello.txt","size":39}
$ curl http://localhost:8888/foo/hello.txt
hello weed-fs!
$ curl http://localhost:8889/foo/hello.txt
空

从测试结果来看，二者各自独立工作，并没有任何联系，也就是说没有共享“文件full path”到"fid"的索引关系。默认情况下 filer server都是工作在standalone模式下的。

weed-fs官方给出了filer的集群方案，即使用redis或Cassandra作为后端，在多个filer节点间共享“文件full path”到"fid"的索引关系。

我们启动一个redis-server(2.8.21)，监听在默认的6379端口。用下面命令重启两个filer server节点：

$ weed filer -port=8888 -dir=./f1 -master=localhost:9333 -defaultReplicaPlacement=100 -redis.server=localhost:6379
$ weed filer -port=8889 -dir=./f2 -master=localhost:9333 -defaultReplicaPlacement=100 -redis.server=localhost:6379

重复一下上面的测试步骤：
$ curl -F "filename=@hello.txt" "http://localhost:8888/foo/"
{"name":"hello.txt","size":39}

$ curl http://localhost:8889/foo/hello.txt
hello weed-fs!

可以看到从8888存储的文件，可以被从8889获取到。

我们删除这个文件：
$ curl -X DELETE http://localhost:8889/foo/hello.txt
{"error":"Invalid fileId "}

提示error，但实际上文件已经被删除了！这块可能是个小bug(#183)。

虽然filer是集群了，但其后端的redis依旧是单点，如果考虑高可靠性，redis显然也要做好集群。

六、Collection

Collection，顾名思义是“集合”，在weed-fs中，它指的是物理volume的集合。前面我们在存储文件时并没有指定 collection，因此weed-fs采用默认collection(空)。如果我们指定集合，结果会是什么样子呢？

$ curl -F file=@hello.txt "http://localhost:9333/submit?collection=picture"
{"fid":"7,0c4f5dc90f","fileName":"hello.txt","fileUrl":"127.0.0.1:8083/7,0c4f5dc90f","size":39}

$ ls v1 v2 v3
v1:
3.dat 3.idx 4.dat 4.idx 5.dat 5.idx picture_7.dat picture_7.idx
v2:
1.dat 1.idx 2.dat 2.idx 6.dat 6.idx
v3:
1.dat 1.idx 2.dat 2.idx 3.dat 3.idx 4.dat 4.idx 5.dat 5.idx 6.dat 6.idx picture_7.dat picture_7.idx

可以看出volume server在自己的-dir下面建立了一个collection名字为prefix的idx和dat文件，上述例子中hello.txt被分配到 8081和8083两个volume server上，因此这两个volume server各自建立了picture_7.dat和picture_7.idx。以picture为前缀的idx和dat文件只是用来存放存储在 collection=picture的文件数据，其他数据要么存储在默认collection中，要么存储在其他名字的collection 中。

collection就好比为Windows下位驱动器存储卷起名。比如C:叫"系统盘"，D叫“程序盘”，E叫“数据盘”。这里各个 volume server下的picture_7.dat和picture_7.idx被起名为picture卷。如果还有video collection，那么它可能由各个volume server下的video_8.dat和video_8.idx。

不过由于默认情况下，weed volume的默认-max="7"，因此在实验环境下每个volume server最多在-dir下建立7个物理卷(七对.idx和.dat)。如果此时我还想建立video卷会怎么样呢？

$ curl -F file=@hello.txt "http://localhost:9333/submit?collection=video"
{"error":"Cannot grow volume group! Not enough data node found!"}

volume server们返回失败结果，提示无法再扩展volume了。这时你需要重启各个volume server，将-max值改大，比如100。

比如：$weed -v=3 volume -port=8083 -dir=./v3 -mserver=localhost:9335 -dataCenter=dc2 -max=100

重启后，我们再来建立video collection:

$ curl -F file=@hello.txt "http://localhost:9333/submit?collection=video"
{"fid":"11,0ee98ca54d","fileName":"hello.txt","fileUrl":"127.0.0.1:8083/11,0ee98ca54d","size":39}

$ ls v1 v2 v3
v1:
3.dat 4.dat 5.dat picture_7.dat video_10.dat video_11.dat video_12.dat video_13.dat video_9.dat
3.idx 4.idx 5.idx picture_7.idx video_10.idx video_11.idx video_12.idx video_13.idx video_9.idx

v2:
1.dat 1.idx 2.dat 2.idx 6.dat 6.idx video_8.dat video_8.idx

v3:
1.dat 2.dat 3.dat 4.dat 5.dat 6.dat picture_7.dat video_10.dat video_11.dat video_12.dat video_13.dat video_8.dat video_9.dat
1.idx 2.idx 3.idx 4.idx 5.idx 6.idx picture_7.idx video_10.idx video_11.idx video_12.idx video_13.idx video_8.idx video_9.idx

可以看到每个datacenter的volume server一次分配了6个volume作为video collection的存储卷。

七、伸缩(Scaling)

对于分布式系统来说，Scaling是不得不考虑的问题，也是极为常见的操作。

1、伸（scale up)

weed-fs对“伸"的支持是很好的，我们分角色说。

【master】
master间采用的是raft协议，增加一个master，对于集群来说是最最基本的操作：

$weed -v=3 master -port=9336 -mdir=./m4 -peers=localhost:9333,localhost:9334,localhost:9335,localhost:9336 -defaultReplication=100
I0821 15:45:47 12398 file_util.go:20] Folder ./m4 Permission: -rwxrwxr-x
I0821 15:45:47 12398 topology.go:86] Using default configurations.
I0821 15:45:47 12398 master_server.go:59] Volume Size Limit is 30000 MB
I0821 15:45:47 12398 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9336
I0821 15:45:47 12398 raft_server.go:50] Starting RaftServer with IP:localhost:9336:
I0821 15:45:47 12398 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335,localhost:9336
I0821 15:45:48 12398 raft_server.go:134] Attempting to connect to: http://localhost:9333/cluster/join
I0821 15:45:49 12398 raft_server.go:179] Post returned status: 200

新master节点启动后，会通过raft协议自动加入到以9333为leader的master集群中。

【volume】

和master一样，volume本身就是靠master管理的，volume server之间没有什么联系，增加一个volume server要做的就是启动一个新的volume server就好了：

$ weed -v=3 volume -port=8084 -dir=./v4 -mserver=localhost:9335 -dataCenter=dc2
I0821 15:48:21 12412 file_util.go:20] Folder ./v4 Permission: -rwxrwxr-x
I0821 15:48:21 12412 store.go:225] Store started on dir: ./v4 with 0 volumes max 7
I0821 15:48:21 12412 volume.go:136] Start Seaweed volume server 0.70 beta at 0.0.0.0:8084
I0821 15:48:21 12412 volume_server.go:70] Volume server bootstraps with master localhost:9335
I0821 15:48:22 12412 list_masters.go:18] list masters result :
I0821 15:48:22 12412 list_masters.go:18] list masters result :{"IsLeader":true,"Leader":"localhost:9333","Peers":["localhost:9334","localhost:9335","localhost:9336"]}
I0821 15:48:22 12412 store.go:65] current master nodes is nodes:[localhost:9334 localhost:9335 localhost:9336 localhost:9333 localhost:9333], lastNode:4
I0821 15:48:22 12412 volume_server.go:82] Volume Server Connected with master at localhost:9333

新volume server节点启动后，同样会自动加入集群，后续master就会自动在其上存储数据了。

【filer】

前面已经谈到了，无论是standalone模式，还是distributed模式，filter都可以随意增减，这里就不再重复赘述了。

2、缩(scale down)

master的缩是极其简单的，只需将相应节点shutdown即可；如果master是leader，则其他master会检测到leader shutdown，并自动重新选出新leader。不过在leader选举的过程中，整个集群的服务将短暂停止，直到leader选出。

filer在standalone模式下，谈伸缩是毫无意义的；对于distributed模式下，filter节点和master节点缩的方法一致，shutdown即可。

唯一的麻烦就是volume节点，因为数据存储在volume节点下，我们不能简单的停掉volume，我们需要考虑在不同 replication策略下是否可以做数据迁移，如何做数据迁移。这就是下一节我们要详细描述的。

八、数据迁移

下面我们就来探讨一下weed-fs的volume数据迁移问题。

1、000复制策略下的数据迁移

为方便测试，我简化一下实验环境（一个master+3个volume）：

master:

$ weed -v=3 master -port=9333 -mdir=./m1 -defaultReplication=000

volume:

$ weed -v=3 volume -port=8081 -dir=./v1 -mserver=localhost:9333 -dataCenter=dc1
$ weed -v=3 volume -port=8082 -dir=./v2 -mserver=localhost:9333 -dataCenter=dc1
$ weed -v=3 volume -port=8083 -dir=./v3 -mserver=localhost:9333 -dataCenter=dc1

和之前一样，启动后，v1，v2，v3目录下面是空的，卷的创建要等到第一份数据存入时。000策略就是没有副本的策略，你存储的文件在 weed-fs中只有一份数据。

我们上传一份文件：

$ curl -F filename=@hello1.txt "http://localhost:9333/submit"
{"fid":"1,01655ab58e","fileName":"hello1.txt","fileUrl":"127.0.0.1:8081/1,01655ab58e","size":40}

$ ll v1 v2 v3

v1:
-rw-r–r– 1 tonybai tonybai 104 8 21 21:31 1.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 21:31 1.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 4.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 4.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 7.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 7.idx

v2:
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 2.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 3.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 6.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 6.idx

v3:
-rw-r–r– 1 tonybai tonybai 8 8 21 21:31 5.dat
-rw-r–r– 1 tonybai tonybai 0 8 21 21:31 5.idx

可以看到hello1.txt被存储在v1下，同时可以看出不同的物理卷分别存放在不同节点下（由于不需要do replication）。

在这种情况(000)下，如果要将v1数据迁移到v2或v3中，只需将v1停掉，将v1下的文件mv到v2或v3中，重启volume server2或volume server3即可。

2、001复制策略下的数据迁移

001复制策略是weed-fs默认的复制策略，weed-fs会为每个文件在同Rack下复制一个副本。我们还利用上面的环境，不过需要停掉 weed-fs，清空目录下的文件，重启后使用，别忘了-defaultReplication=001。

我们连续存储三个文件：

$ curl -F filename=@hello1.txt "http://localhost:9333/submit"
{"fid":"2,01ea84980d","fileName":"hello1.txt","fileUrl":"127.0.0.1:8082/2,01ea84980d","size":40}

$ curl -F filename=@hello2.txt "http://localhost:9333/submit"
{"fid":"1,027883baa8","fileName":"hello2.txt","fileUrl":"127.0.0.1:8083/1,027883baa8","size":40}

$ curl -F filename=@hello3.txt "http://localhost:9333/submit"
{"fid":"6,03220f577e","fileName":"hello3.txt","fileUrl":"127.0.0.1:8081/6,03220f577e","size":40}

可以看出三个文件分别被存储在vol2, vol1和vol6中，我们查看一下v1, v2, v3中的文件情况：

$ ll v1 v2 v3
v1:
-rw-r–r– 1 tonybai tonybai 104 8 21 22:00 1.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:00 1.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 3.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 4.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 4.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:02 6.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:02 6.idx

v2:
-rw-r–r– 1 tonybai tonybai 104 8 21 21:56 2.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 21:56 2.idx
-rw-r–r– 1 tonybai tonybai 8 8 21 21:56 5.dat
-rw-r–r– 1 tonybai tonybai 0 8 21 21:56 5.idx

v3:
-rw-r–r– 1 tonybai tonybai 104 8 21 22:00 1.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:00 1.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 21:56 2.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 21:56 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 3.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 4.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 4.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 5.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 5.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:02 6.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:02 6.idx

假设我们现在要shutdown v3，将v3数据迁移到其他volume server，我们有3种做法：

1) 不迁移
2) 将v3下的所有文件mv到v2或v1中
3) 将v3下的所有文件先后覆盖到v1和v2中

我们来逐个分析每种做法的后果：

1) 不迁移

001策略下，每份数据有两个copy，v3中的数据其他两个v1+v2总是有的，因此即便不迁移，v1+v2中也会有一份数据copy。你可以测试一下当shutdown volume3后：

$ curl -L "http://localhost:9333/2,01ea84980d"
hello weed-fs1!
$ curl -L "http://localhost:9333/1,027883baa8"
hello weed-fs2!
$ curl -L "http://localhost:9333/6,03220f577e"
hello weed-fs3!

针对每一份文件，你都可以多get几次，都会得到正确的结果。但此时的不足也很明显，那就是存量数据不再拥有另外一份备份。

2) 将v3下的所有文件mv到v2或v1中

还是根据001策略，将v3数据mv到v2或v1中，结果会是什么呢，这里就以v3 mv到 v1举例：

- 对于v1和v3都有的卷id，比如1，两者的文件1.idx和1.dat是一模一样的。这是001策略决定的。但一旦迁移后，系统中的数据就由2份变成1份了。
- 对于v1有，而v3没有的，那自然不必说了。
- 对于v1没有，而v3有的，mv过去就成为了v1的数据。

为此，这种做法依旧不够完美。

3）将v3下的所有文件覆盖到v1和v2中

结合上面的方法，只有此种迁移方式才能保证迁移后，系统中的数据不丢失，且每个都是按照001策略所说的2份，这才是正确的方法。

我们来测试一下：

   – 停掉volume3；
   – 停掉volume1，将v3下的文件copy到v1下，启动volume1
   – 停掉volume2，将v3下的文件copy到v2下，启动volume2

$ curl "http://localhost:9333/6,03220f577e"
<a href="http://127.0.0.1:8081/6,03220f577e">Moved Permanently</a>.

$ curl "http://localhost:9333/6,03220f577e"
<a href="http://127.0.0.1:8082/6,03220f577e">Moved Permanently</a>.

可以看到，master返回了重定向地址8081和8082，说明8083迁移到8082上的数据也生效了。

3、100复制策略下的数据迁移

测试环境稍作变化：

master:

$ weed -v=3 master -port=9333 -mdir=./m1 -defaultReplication=100

volume:

和之前一样，我们上传三份文件：

$ curl -F filename=@hello1.txt "http://localhost:9333/submit"
{"fid":"4,01d937dd30","fileName":"hello1.txt","fileUrl":"127.0.0.1:8083/4,01d937dd30","size":40}

$ curl -F filename=@hello2.txt "http://localhost:9333/submit"
{"fid":"2,025efbef14","fileName":"hello2.txt","fileUrl":"127.0.0.1:8082/2,025efbef14","size":40}

$ curl -F filename=@hello3.txt "http://localhost:9333/submit"
{"fid":"2,03be936488","fileName":"hello3.txt","fileUrl":"127.0.0.1:8082/2,03be936488","size":40}

$ ll v1 v2 v3
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 3.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:58 4.dat
-rw-r–r– 1 tonybai tonybai   16 8 21 22:58 4.idx

v2:
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 1.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 1.idx
-rw-r–r– 1 tonybai tonybai 200 8 21 22:59 2.dat
-rw-r–r– 1 tonybai tonybai   32 8 21 22:59 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 5.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 5.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 6.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 6.idx

v3:
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 1.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 1.idx
-rw-r–r– 1 tonybai tonybai 200 8 21 22:59 2.dat
-rw-r–r– 1 tonybai tonybai   32 8 21 22:59 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 3.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:58 4.dat
-rw-r–r– 1 tonybai tonybai   16 8 21 22:58 4.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 5.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 5.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 6.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 6.idx

由于100策略是在不同DataCenter中各保持一份copy，因此数据的迁移不应该在数据中心间进行，而同一数据中心内的迁移又回归到了 “000”策略的情形。

其他策略的分析方式也是如此，这里就不长篇大论了。

九、Benchmark

在HP ProLiant DL380 G4, Intel(R) Xeon(TM) CPU 3.60GHz 4核，6G内存的机器(非SSD硬盘)上，执行benchmark test:

$ weed benchmark -server=localhost:9333

This is SeaweedFS version 0.70 beta linux amd64

———— Writing Benchmark ———-
Concurrency Level:      16
Time taken for tests:   831.583 seconds
Complete requests:      1048576
Failed requests:        0
Total transferred:      1106794545 bytes
Requests per second:    1260.94 [#/sec]
Transfer rate:          1299.75 [Kbytes/sec]

Connection Times (ms)
min avg max std
Total: 2.2 12.5 1118.4 9.3

Percentage of the requests served within a certain time (ms)
   50%     11.4 ms
   66%     13.3 ms
   75%     14.8 ms
   80%     15.9 ms
   90%     19.2 ms
   95%     22.6 ms
   98%     27.4 ms
   99%     31.2 ms
100%    1118.4 ms

———— Randomly Reading Benchmark ———-
Concurrency Level:      16
Time taken for tests:   151.480 seconds
Complete requests:      1048576
Failed requests:        0
Total transferred:      1106791113 bytes
Requests per second:    6922.22 [#/sec]
Transfer rate:          7135.28 [Kbytes/sec]

Connection Times (ms)
min avg max std
Total: 0.1 2.2 116.7 3.9

Percentage of the requests served within a certain time (ms)
   50%      1.6 ms
   66%      2.1 ms
   75%      2.5 ms
   80%      2.8 ms
   90%      3.7 ms
   95%      4.8 ms
   98%      7.4 ms
   99%     11.1 ms
100%    116.7 ms

这个似乎比作者在mac笔记本(SSD)上性能还要差些，当然此次我们用的策略是100，并且这个服务器上还运行着其他程序。但即便如此，感觉weed-fs还是有较大优化的空间的。

作者在官网上将weed-fs与其他分布式文件系统如Ceph，hdfs等做了简要对比，强调了weed-fs相对于其他分布式文件系统的优点。

十、其它

weed-fs使用google glog，因此所有log的级别设置以及log定向的方法均与glog一致。

weed-fs提供了backup命令，用来在同机上备份volume server上的数据。

weed-fs没有提供官方client包，但在wiki上列出多种第三方client包（各种语言），就Go client包来看，似乎还没有特别理想的。

weed-fs目前还没有web console，只能通过命令行进行操作。

使用weed-fs时，别忘了将open files no limit调大，否则可能会导致volume server crash。

十一、小结

weed-fs为想寻找开源分布式文件系统的朋友们提供了一个新选择。尤其是在存储大量小图片时，weed-fs自身就是基于haystack这一优化图片存储的论文的。另外weed-fs使用起来的确十分简单，分分钟就可以建立起一个分布式系统，部署容易，几乎不需要什么配置。但weed-fs目前最大的问题似乎是没有重量级的使用案例，自身也还有不少不足，但希望通过这篇文章能让更多人认识weed-fs，并使用weed-fs，帮助改善weed-fs吧。