High-Available | Tony Bai

标签 High-Available 下的文章

一步步打造基于Kubeadm的高可用Kubernetes集群-第一部分

五月 15, 2017
28 条评论

Kubernetes集群的核心是其master node，但目前默认情况下master node只有一个，一旦master node出现问题，Kubernetes集群将陷入“瘫痪”，对集群的管理、Pod的调度等均将无法实施，即便此时某些用户的Pod依旧可以正常运行。这显然不能符合我们对于运行于生产环境下的Kubernetes集群的要求，我们需要一个高可用的Kubernetes集群。

不过，目前Kubernetes官方针对构建高可用(high-availability)的集群的支持还是非常有限的，只是针对少数cloud-provider提供了粗糙的部署方法，比如：使用kube-up.sh脚本在GCE上、使用kops在AWS上等等。

高可用Kubernetes集群是Kubernetes演进的必然方向，官方在“Building High-Availability Clusters”一文中给出了当前搭建HA cluster的粗略思路。Kubeadm也将HA列入了后续版本的里程碑计划，并且已经出了一版使用kubeadm部署高可用cluster的方法提议草案。

在kubeadm没有真正支持自动bootstrap的HA Kubernetes cluster之前，如果要搭建一个HA k8s cluster，我们应该如何做呢？本文将探索性地一步一步的给出打造一个HA K8s cluster的思路和具体步骤。不过需要注意的是：这里搭建的HA k8s cluser仅在实验室中测试ok，还并未在生产环境中run过，因此在某些未知的细节方面可能存在思路上的纰漏。

一、测试环境

高可用Kubernetes集群主要就是master node的高可用，因此，我们申请了三台美国西部区域的阿里云ECS作为三个master节点。通过hostnamectl将这三个节点的static hostname分别改为shaolin、wudang和emei：

shaolin: 10.27.53.32
wudang: 10.24.138.208
emei: 10.27.52.72

三台主机运行的都是Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-63-generic x86_64)，使用root用户。

Docker版本如下：

root@shaolin:~# docker version
Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64
 Experimental: false

Ubuntu上Docker CE版本的安装步骤可以参看这里，由于我的服务器在美西，因此不存在”墙”的问题。对于主机在国内的朋友，你需要根据安装过程中是否输出错误日志自行决定是否需要配置一个加速器。另外，这里用的docker版本有些新，Kubernetes官网上提及最多的、兼容最好的还是docker 1.12.x版本，你也可以直接安装这个版本。

二、Master节点高可用的思路

通过对single-master node的探索，我们知道master节点上运行着如下几个Kubernetes组件：

kube-apiserver：集群核心，集群API接口、集群各个组件通信的中枢；集群安全控制；
etcd：集群的数据中心；
kube-scheduler：集群Pod的调度中心；
kube-controller-manager：集群状态管理器，当集群状态与期望不同时，kcm会努力让集群恢复期望状态，比如：当一个pod死掉，kcm会努力新建一个pod来恢复对应replicas set期望的状态；
kubelet: kubernetes node agent，负责与node上的docker engine打交道；
kubeproxy: 每个node上一个，负责service vip到endpoint pod的流量转发，当前主要通过设置iptables规则实现。

Kubernetes集群的高可用就是master节点的高可用，master节点的高可用归根结底就是上述这些运行于master node上的组件的高可用。因此，我们的思路就是考量如何让这些组件高可用起来！综合Kubernetes官方提供的资料以及一些proposal draft，我们知道完全从头搭建的hard way形式似乎不甚理智^0^，将一个由kubeadm创建的k8s cluster改造为一个ha的k8s cluster似乎更可行。下面是我的思路方案：

img{512x368}

前面提到过，我们的思路是基于kubeadm启动的kubernetes集群，通过逐步修改配置或替换，形成最终HA的k8s cluster。上图是k8s ha cluster的最终图景，我们可以看到：

kube-apiserver：得益于apiserver的无状态，每个master节点的apiserver都是active的，并处理来自Load Balance分配过来的流量；
etcd：状态的集中存储区。通过将多个master节点上的etcd组成一个etcd集群，使得apiserver共享集群状态和数据；
kube-controller-manager：kcm自带leader-elected功能，多个master上的kcm构成一个集群，但只有被elected为leader的kcm在工作。每个master节点上的kcm都连接本node上的apiserver；
kube-scheduler：scheduler自带leader-elected功能，多个master上的scheduler构成一个集群，但只有被elected为leader的scheduler在工作。每个master节点上的scheduler都连接本node上的apiserver；
kubelet: 由于master上的各个组件均以container的形式呈现，因此不承担workload的master节点上的kubelet更多是用来管理这些master组件容器。每个master节点上的kubelet都连接本node上的apiserver；
kube-proxy: 由于master节点不承载workload，因此master节点上的kube-proxy同样仅服务于一些特殊的服务，比如: kube-dns等。由于kubeadm下kube-proxy没有暴露出可供外部调整的配置，因此kube-proxy需要连接Load Balance暴露的apiserver的端口。

接下来，我们就来一步步按照我们的思路，对kubeadm启动的single-master node k8s cluster进行改造，逐步演进到我们期望的ha cluster状态。

三、第一步：使用kubeadm安装single-master k8s cluster

距离第一次使用kubeadm安装kubernetes 1.5.1集群已经有一些日子了，kubernetes和kubeadm都有了一些变化。当前kubernetes和kubeadm的最新release版都是1.6.2版本：

root@wudang:~# kubeadm version
kubeadm version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

root@wudang:~# docker images
REPOSITORY                                               TAG                 IMAGE ID            CREATED             SIZE
gcr.io/google_containers/kube-proxy-amd64                v1.6.2              7a1b61b8f5d4        3 weeks ago         109 MB
gcr.io/google_containers/kube-controller-manager-amd64   v1.6.2              c7ad09fe3b82        3 weeks ago         133 MB
gcr.io/google_containers/kube-apiserver-amd64            v1.6.2              e14b1d5ee474        3 weeks ago         151 MB
gcr.io/google_containers/kube-scheduler-amd64            v1.6.2              b55f2a2481b9        3 weeks ago         76.8 MB
... ...

虽然kubeadm版本有更新，但安装过程没有太多变化，这里仅列出一些关键步骤，一些详细信息输出就在这里省略了。

我们先在shaolin node上安装相关程序文件：

root@shaolin:~# apt-get update && apt-get install -y apt-transport-https

root@shaolin:~# curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
OK

root@shaolin:~# cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
> deb http://apt.kubernetes.io/ kubernetes-xenial main
> EOF

root@shaolin:~# apt-get update

root@shaolin:~# apt-get install -y kubelet kubeadm kubectl kubernetes-cni

接下来，使用kubeadm启动集群。注意：由于在aliyun上flannel 网络插件一直不好用，这里还是使用weave network。

root@shaolin:~/k8s-install# kubeadm init --apiserver-advertise-address 10.27.53.32
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[init] Using Kubernetes version: v1.6.2
[init] Using Authorization mode: RBAC
[preflight] Running pre-flight checks
[preflight] WARNING: docker version is greater than the most recently validated version. Docker version: 17.03.1-ce. Max validated version: 1.12
[preflight] Starting the kubelet service
[certificates] Generated CA certificate and key.
[certificates] Generated API server certificate and key.
[certificates] API Server serving cert is signed for DNS names [shaolin kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.27.53.32]
[certificates] Generated API server kubelet client certificate and key.
[certificates] Generated service account token signing key and public key.
[certificates] Generated front-proxy CA certificate and key.
[certificates] Generated front-proxy client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf"
[apiclient] Created API client, waiting for the control plane to become ready
[apiclient] All control plane components are healthy after 17.045449 seconds
[apiclient] Waiting for at least one node to register
[apiclient] First node has registered after 5.008588 seconds
[token] Using token: a8dd42.afdb86eda4a8c987
[apiconfig] Created RBAC rules
[addons] Created essential addon: kube-proxy
[addons] Created essential addon: kube-dns

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run (as a regular user):

  sudo cp /etc/kubernetes/admin.conf $HOME/
  sudo chown $(id -u):$(id -g) $HOME/admin.conf
  export KUBECONFIG=$HOME/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:

http://kubernetes.io/docs/admin/addons/

You can now join any number of machines by running the following on each node
as root:

  kubeadm join --token abcdefghijklmn 10.27.53.32:6443

root@shaolin:~/k8s-install# pods
NAMESPACE     NAME                              READY     STATUS    RESTARTS   AGE       IP            NODE
kube-system   etcd-shaolin                      1/1       Running   0          34s       10.27.53.32   shaolin
kube-system   kube-apiserver-shaolin            1/1       Running   0          35s       10.27.53.32   shaolin
kube-system   kube-controller-manager-shaolin   1/1       Running   0          23s       10.27.53.32   shaolin
kube-system   kube-dns-3913472980-tkr91         0/3       Pending   0          1m        <none>
kube-system   kube-proxy-bzvvk                  1/1       Running   0          1m        10.27.53.32   shaolin
kube-system   kube-scheduler-shaolin            1/1       Running   0          46s       10.27.53.32   shaolin

k8s 1.6.2版本的weave network的安装与之前稍有不同，因为k8s 1.6启用了更为安全的机制，默认采用RBAC对运行于cluster上的workload进行有限授权。我们要使用的weave network plugin的yaml为weave-daemonset-k8s-1.6.yaml：

root@shaolin:~/k8s-install# kubectl apply -f https://git.io/weave-kube-1.6
clusterrole "weave-net" created
serviceaccount "weave-net" created
clusterrolebinding "weave-net" created
daemonset "weave-net" created

如果你的weave pod启动失败且原因类似如下日志：

Network 172.30.0.0/16 overlaps with existing route 172.16.0.0/12 on host.

你需要修改你的weave network的 IPALLOC_RANGE(这里我使用了172.32.0.0/16)：

//weave-daemonset-k8s-1.6.yaml
... ...
spec:
  template:
    metadata:
      labels:
        name: weave-net
    spec:
      hostNetwork: true
      hostPID: true
      containers:
        - name: weave
          env:
            - name: IPALLOC_RANGE
              value: 172.32.0.0/16
... ...

master安装ok后，我们将wudang、emei两个node作为k8s minion node，来测试一下cluster的搭建是否是正确的，同时这一过程也在wudang、emei上安装上了kubelet和kube-proxy，这两个组件在后续的“改造”过程中是可以直接使用的：

以emei node为例：

root@emei:~# kubeadm join --token abcdefghijklmn 10.27.53.32:6443
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[preflight] Running pre-flight checks
[preflight] WARNING: docker version is greater than the most recently validated version. Docker version: 17.03.1-ce. Max validated version: 1.12
[preflight] Starting the kubelet service
[discovery] Trying to connect to API Server "10.27.53.32:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.27.53.32:6443"
[discovery] Cluster info signature and contents are valid, will use API Server "https://10.27.53.32:6443"
[discovery] Successfully established connection with API Server "10.27.53.32:6443"
[bootstrap] Detected server version: v1.6.2
[bootstrap] The server supports the Certificates API (certificates.k8s.io/v1beta1)
[csr] Created API client to obtain unique certificate for this node, generating keys and certificate signing request
[csr] Received signed certificate from the API server, generating KubeConfig...
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"

Node join complete:
* Certificate signing request sent to master and response
  received.
* Kubelet informed of new secure connection details.

Run 'kubectl get nodes' on the master to see this machine join.

建立一个多pod的nginx服务，测试一下集群网络是否通！这里就不赘述了。

安装后的single-master kubernetes cluster的状态就如下图所示：

img{512x368}

四、第二步：搭建etcd cluster for ha k8s cluster

k8s集群状态和数据都存储在etcd中，高可用的k8s集群离不开高可用的etcd cluster。我们需要为最终的ha k8s cluster提供一个ha的etcd cluster，如何做呢？

当前k8s cluster中，shaolin master node上的etcd存储着k8s集群的所有数据和状态。我们需要在wudang和emei两个节点上也建立起etcd实例，与现存在 etcd共同构建成为高可用的且存储有cluster数据和状态的集群。我们将这一过程再细化为几个小步骤：

0、在emei、wudang两个节点上启动kubelet服务

etcd cluster可以采用完全独立的、与k8s组件无关的建立方法。不过这里我采用的是和master一样的方式，即采用由wudang和emei两个node上kubelet启动的etcd作为etcd cluster的两个member。此时，wudang和emei两个node的角色是k8s minion node，我们需要首先清理一下这两个node的数据：

root@shaolin:~/k8s-install # kubectl drain wudang --delete-local-data --force --ignore-daemonsets
node "wudang" cordoned
WARNING: Ignoring DaemonSet-managed pods: kube-proxy-mxwp3, weave-net-03jbh; Deleting pods with local storage: weave-net-03jbh
pod "my-nginx-2267614806-fqzph" evicted
node "wudang" drained

root@wudang:~# kubeadm reset
[preflight] Running pre-flight checks
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Removing kubernetes-managed containers
[reset] No etcd manifest found in "/etc/kubernetes/manifests/etcd.yaml", assuming external etcd.
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim]
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

root@shaolin:~/k8s-install # kubectl drain emei --delete-local-data --force --ignore-daemonsets
root@emei:~# kubeadm reset

root@shaolin:~/k8s-install# kubectl delete node/wudang
root@shaolin:~/k8s-install# kubectl delete node/emei

我们的小目标中：etcd cluster将由各个node上的kubelet自动启动；而kubelet则是由systemd在sys init时启动，且其启动配置如下：

root@wudang:~# cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_EXTRA_ARGS

我们需要首先在wudang和emei node上将kubelet启动起来，我们以wudang node为例：

root@wudang:~# systemctl enable kubelet
root@wudang:~# systemctl start kubelet

查看kubelet service日志：

root@wudang:~# journalctl -u kubelet -f

May 10 10:58:41 wudang systemd[1]: Started kubelet: The Kubernetes Node Agent.
May 10 10:58:41 wudang kubelet[27179]: I0510 10:58:41.798507   27179 feature_gate.go:144] feature gates: map[]
May 10 10:58:41 wudang kubelet[27179]: error: failed to run Kubelet: invalid kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory
May 10 10:58:41 wudang systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
May 10 10:58:41 wudang systemd[1]: kubelet.service: Unit entered failed state.
May 10 10:58:41 wudang systemd[1]: kubelet.service: Failed with result 'exit-code'.

kubelet启动失败，因为缺少/etc/kubernetes/kubelet.conf这个配置文件。我们需要向shaolin node求援，我们需要将shaolin node上的同名配置文件copy到wudang和emei两个node下面，当然同时需要copy的还包括shaolin node上的/etc/kubernetes/pki目录：

root@wudang:~# kubectl --kubeconfig=/etc/kubernetes/kubelet.conf config view
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: REDACTED
    server: https://10.27.53.32:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: system:node:shaolin
  name: system:node:shaolin@kubernetes
current-context: system:node:shaolin@kubernetes
kind: Config
preferences: {}
users:
- name: system:node:shaolin
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED

root@wudang:~# ls /etc/kubernetes/pki
apiserver.crt  apiserver-kubelet-client.crt  ca.crt  ca.srl              front-proxy-ca.key      front-proxy-client.key  sa.pub
apiserver.key  apiserver-kubelet-client.key ca.key  front-proxy-ca.crt  front-proxy-client.crt  sa.key

systemctl daemon-reload; systemctl restart kubelet后，再查看kubelet service日志，你会发现kubelet起来了！

以wudang node为例：

root@wudang:~# journalctl -u kubelet -f
-- Logs begin at Mon 2017-05-08 15:12:01 CST. --
May 11 10:37:07 wudang kubelet[26907]: I0511 10:37:07.213529   26907 factory.go:54] Registering systemd factory
May 11 10:37:07 wudang kubelet[26907]: I0511 10:37:07.213674   26907 factory.go:86] Registering Raw factory
May 11 10:37:07 wudang kubelet[26907]: I0511 10:37:07.213813   26907 manager.go:1106] Started watching for new ooms in manager
May 11 10:37:07 wudang kubelet[26907]: I0511 10:37:07.216383   26907 oomparser.go:185] oomparser using systemd
May 11 10:37:07 wudang kubelet[26907]: I0511 10:37:07.217415   26907 manager.go:288] Starting recovery of all containers
May 11 10:37:07 wudang kubelet[26907]: I0511 10:37:07.285428   26907 manager.go:293] Recovery completed
May 11 10:37:07 wudang kubelet[26907]: I0511 10:37:07.344425   26907 kubelet_node_status.go:230] Setting node annotation to enable volume controller attach/detach
May 11 10:37:07 wudang kubelet[26907]: E0511 10:37:07.356188   26907 eviction_manager.go:214] eviction manager: unexpected err: failed GetNode: node 'wudang' not found
May 11 10:37:07 wudang kubelet[26907]: I0511 10:37:07.358402   26907 kubelet_node_status.go:77] Attempting to register node wudang
May 11 10:37:07 wudang kubelet[26907]: I0511 10:37:07.363083   26907 kubelet_node_status.go:80] Successfully registered node wudang

此时此刻，我们先让wudang、emei node上的kubelet先连着shaolin node上的apiserver。

1、在emei、wudang两个节点上建立一个etcd cluster

我们以shaolin node上的/etc/kubernetes/manifests/etcd.yaml为蓝本，修改出wudang和emei上的etcd.yaml，主要的变化在于containers:command部分：

wudang上的/etc/kubernetes/manifests/etcd.yaml：

spec:
  containers:
  - command:
    - etcd
    - --name=etcd-wudang
    - --initial-advertise-peer-urls=http://10.24.138.208:2380
    - --listen-peer-urls=http://10.24.138.208:2380
    - --listen-client-urls=http://10.24.138.208:2379,http://127.0.0.1:2379
    - --advertise-client-urls=http://10.24.138.208:2379
    - --initial-cluster-token=etcd-cluster
    - --initial-cluster=etcd-wudang=http://10.24.138.208:2380,etcd-emei=http://10.27.52.72:2380
    - --initial-cluster-state=new
    - --data-dir=/var/lib/etcd
    image: gcr.io/google_containers/etcd-amd64:3.0.17

emei上的/etc/kubernetes/manifests/etcd.yaml：

spec:
  containers:
  - command:
    - etcd
    - --name=etcd-emei
    - --initial-advertise-peer-urls=http://10.27.52.72:2380
    - --listen-peer-urls=http://10.27.52.72:2380
    - --listen-client-urls=http://10.27.52.72:2379,http://127.0.0.1:2379
    - --advertise-client-urls=http://10.27.52.72:2379
    - --initial-cluster-token=etcd-cluster
    - --initial-cluster=etcd-emei=http://10.27.52.72:2380,etcd-wudang=http://10.24.138.208:2380
    - --initial-cluster-state=new
    - --data-dir=/var/lib/etcd
    image: gcr.io/google_containers/etcd-amd64:3.0.17

将这两个文件分别放入各自node的/etc/kubernetes/manifests目录后，各自node上的kubelet将会自动将对应的etcd pod启动起来！

root@shaolin:~# pods
NAMESPACE     NAME                              READY     STATUS    RESTARTS   AGE       IP              NODE
kube-system   etcd-emei                         1/1       Running   0          11s       10.27.52.72     emei
kube-system   etcd-shaolin                      1/1       Running   0          25m       10.27.53.32     shaolin
kube-system   etcd-wudang                       1/1       Running   0          24s       10.24.138.208   wudang

我们查看一下当前etcd cluster的状态：

# etcdctl endpoint status --endpoints=10.27.52.72:2379,10.24.138.208:2379
10.27.52.72:2379, 6e80adf8cd57f826, 3.0.17, 25 kB, false, 17, 660
10.24.138.208:2379, f3805d1ab19c110b, 3.0.17, 25 kB, true, 17, 660

注：输出的列从左到右分别表示：endpoint URL, ID, version, database size, leadership status, raft term, and raft status.
因此，我们可以看出wudang(10.24.138.208)上的etcd被选为cluster leader了

我们测试一下etcd cluster，put一些key：

在wudang节点：(注意：export ETCDCTL_API=3)

root@wudang:~# etcdctl put foo bar
OK
root@wudang:~# etcdctl put foo1 bar1
OK
root@wudang:~# etcdctl get foo
foo
bar

在emei节点：

root@emei:~# etcdctl get foo
foo
bar

至此，当前kubernetes cluster的状态示意图如下：

img{512x368}

2、同步shaolin上etcd的数据到etcd cluster中

kubernetes 1.6.2版本默认使用3.x版本etcd。etcdctl 3.x版本提供了一个make-mirror功能用于在etcd cluster间同步数据，这样我们就可以通过etcdctl make-mirror将shaolin上etcd的k8s cluster数据同步到上述刚刚创建的etcd cluster中。在emei node上执行下面命令：

root@emei:~# etcdctl make-mirror --no-dest-prefix=true  127.0.0.1:2379  --endpoints=10.27.53.32:2379 --insecure-skip-tls-verify=true
... ...
261
302
341
380
420
459
498
537
577
616
655

... ...

etcdctl make-mirror每隔30s输出一次日志，不过通过这些日志无法看出来同步过程。并且etcdctl make-mirror似乎是流式同步：没有结束的边界。因此你需要手工判断一下数据是否都同步过去了！比如通过查看某个key，对比两边的差异的方式：

# etcdctl get --from-key /api/v2/registry/clusterrolebindings/cluster-admin

.. ..
compact_rev_key
122912

或者通过endpoint status命令查看数据库size大小，对比双方的size是否一致。一旦差不多了，就可以停掉make-mirror的执行了！

3、将shaolin上的apiserver连接的etcd改为连接etcd cluster，停止并删除shaolin上的etcd

修改shaolin node上的/etc/kubernetes/manifests/kube-apiserver.yaml，让shaolin上的kube0-apiserver连接到emei node上的etcd：

修改下面一行：
- --etcd-servers=http://10.27.52.72:2379

修改保存后，kubelet会自动重启kube-apiserver，重启后的kube-apiserver工作正常！

接下来，我们停掉并删除掉shaolin上的etcd(并删除相关数据存放目录)：

root@shaolin:~# rm /etc/kubernetes/manifests/etcd.yaml
root@shaolin:~# rm -fr /var/lib/etcd

再查看k8s cluster当前pod，你会发现etcd-shaolin不见了。

至此，k8s集群的当前状态示意图如下：

img{512x368}

4、重新创建shaolin上的etcd ，并以member形式加入etcd cluster

我们首先需要在已存在的etcd cluster中添加etcd-shaolin这个member:

root@wudang:~/kubernetes-conf-shaolin/manifests# etcdctl member add etcd-shaolin --peer-urls=http://10.27.53.32:2380
Member 3184cfa57d8ef00c added to cluster 140cec6dd173ab61

然后，在shaolin node上基于原shaolin上的etcd.yaml文件进行如下修改：

// /etc/kubernetes/manifests/etcd.yaml
... ...
spec:
  containers:
  - command:
    - etcd
    - --name=etcd-shaolin
    - --initial-advertise-peer-urls=http://10.27.53.32:2380
    - --listen-peer-urls=http://10.27.53.32:2380
    - --listen-client-urls=http://10.27.53.32:2379,http://127.0.0.1:2379
    - --advertise-client-urls=http://10.27.53.32:2379
    - --initial-cluster-token=etcd-cluster
    - --initial-cluster=etcd-shaolin=http://10.27.53.32:2380,etcd-wudang=http://10.24.138.208:2380,etcd-emei=http://10.27.52.72:2380
    - --initial-cluster-state=existing
    - --data-dir=/var/lib/etcd
    image: gcr.io/google_containers/etcd-amd64:3.0.17

修改保存后，kubelet将自动拉起etcd-shaolin：

root@shaolin:~/k8s-install# pods
NAMESPACE     NAME                              READY     STATUS    RESTARTS   AGE       IP              NODE
kube-system   etcd-emei                         1/1       Running   0          3h        10.27.52.72     emei
kube-system   etcd-shaolin                      1/1       Running   0          8s        10.27.53.32     shaolin
kube-system   etcd-wudang                       1/1       Running   0          3h        10.24.138.208   wudang

查看etcd cluster状态：

root@shaolin:~# etcdctl endpoint status --endpoints=10.27.52.72:2379,10.24.138.208:2379,10.27.53.32:2379
10.27.52.72:2379, 6e80adf8cd57f826, 3.0.17, 11 MB, false, 17, 34941
10.24.138.208:2379, f3805d1ab19c110b, 3.0.17, 11 MB, true, 17, 34941
10.27.53.32:2379, 3184cfa57d8ef00c, 3.0.17, 11 MB, false, 17, 34941

可以看出三个etcd实例的数据size、raft status是一致的，wudang node上的etcd是leader！

5、将shaolin上的apiserver的etcdserver指向改回etcd-shaolin

// /etc/kubernetes/manifests/kube-apiserver.yaml

... ...
- --etcd-servers=http://127.0.0.1:2379
... ...

生效重启后，当前kubernetes cluster的状态如下面示意图：

img{512x368}

第二部分在这里。

weed-fs使用简介

八月 22, 2015
38 条评论

weed-fs，全名Seaweed-fs，是一种用golang实现的简单且高可用的分布式文件系统。该系统的目标有二：

- 存储billions of files
- serve the files fast

weed-fs起初是为了搞一个基于Fackbook的Haystack论文的实现，Haystack旨在优化Fackbook内部图片存储和获取。后在这个基础上，weed-fs作者又增加了若干feature，形成了目前的weed-fs。

这里并不打算深入分析weed-fs源码，仅仅是从黑盒角度介绍weed-fs的使用，发掘weed-fs的功能、长处和不足。

一、weed-fs集群简介

weed-fs集群的拓扑(Topology)由DataCenter、Rack(机架)、Machine(或叫Node)组成。最初版本的weed-fs应该可以通过配置文件来描述整个集群的拓扑结构，配置文件采用xml格式，官方给出的样例如下：

但目前的版本中，该配置文件在help说明中被置为“Deprecating!”了：

$weed master -help
…
-conf="/etc/weedfs/weedfs.conf": Deprecating! xml configuration file
…

0.70版本的weed-fs在Master中维护集群拓扑，master会根据master与master、volume与master的连接情况实时合成拓扑结构了。

weed-fs自身可以在两种模式下运行，一种是Master，另外一种则是Volume。集群的维护以及强一致性的保证由master们保证，master间通过raft协议实现强一致性。Volume是实际管理和存储数据的运行实例。数据的可靠性则可以通过weed-fs提供的 replication机制保证。

weed-fs提供了若干种replication策略(rack – 机架，一个逻辑上的概念)：

000 no replication, just one copy
001 replicate once on the same rack
010 replicate once on a different rack in the same data center
100 replicate once on a different data center
200 replicate twice on two other different data center
110 replicate once on a different rack, and once on a different data center

选择数据更可靠的策略，则会带来一些性能上的代价，这始终是一个权衡的问题。

更多的细节以及Scaling、数据迁移等方面，下面将逐一说明。

二、weed-fs集群的启动

为了实验方便，我们定义了一个weed-fs集群拓扑：

三个master:
    master1 – localhost:9333
    master2 – localhost:9334
    master3 – localhost:9335

replication策略：100(即在另外一个不同的datacenter中复制一份)

三个volume:
         volume1 – localhost:8081 dc1
    volume2 – localhost:8082 dc1
    volume3 – localhost:8083 dc2

集群启动首先启动master们，启动顺序: master1、master2、master3：

master1:

$ weed -v=3 master -port=9333 -mdir=./m1 -peers=localhost:9333,localhost:9334,localhost:9335 -defaultReplication=100
I0820 14:37:17 07606 file_util.go:20] Folder ./m1 Permission: -rwxrwxr-x
I0820 14:37:17 07606 topology.go:86] Using default configurations.
I0820 14:37:17 07606 master_server.go:59] Volume Size Limit is 30000 MB
I0820 14:37:17 07606 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9333
I0820 14:37:17 07606 raft_server.go:50] Starting RaftServer with IP:localhost:9333:
I0820 14:37:17 07606 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335
I0820 14:37:17 07606 raft_server.go:134] Attempting to connect to: http://localhost:9334/cluster/join
I0820 14:37:17 07606 raft_server.go:139] Post returned error: Post http://localhost:9334/cluster/join: dial tcp 127.0.0.1:9334: connection refused
I0820 14:37:17 07606 raft_server.go:134] Attempting to connect to: http://localhost:9335/cluster/join
I0820 14:37:17 07606 raft_server.go:139] Post returned error: Post http://localhost:9335/cluster/join: dial tcp 127.0.0.1:9335: connection refused
I0820 14:37:17 07606 raft_server.go:78] No existing server found. Starting as leader in the new cluster.
I0820 14:37:17 07606 master_server.go:93] [ localhost:9333 ] I am the leader!

I0820 14:37:52 07606 raft_server_handlers.go:16] Processing incoming join. Current Leader localhost:9333 Self localhost:9333 Peers map[]
I0820 14:37:52 07606 raft_server_handlers.go:20] Command:{"name":"localhost:9334","connectionString":"http://localhost:9334"}
I0820 14:37:52 07606 raft_server_handlers.go:27] join command from Name localhost:9334 Connection http://localhost:9334

I0820 14:38:02 07606 raft_server_handlers.go:16] Processing incoming join. Current Leader localhost:9333 Self localhost:9333 Peers map[localhost:9334:0xc20800f730]
I0820 14:38:02 07606 raft_server_handlers.go:20] Command:{"name":"localhost:9335","connectionString":"http://localhost:9335"}
I0820 14:38:02 07606 raft_server_handlers.go:27] join command from Name localhost:9335 Connection http://localhost:9335

master2:

$ weed -v=3 master -port=9334 -mdir=./m2 -peers=localhost:9333,localhost:9334,localhost:9335 -defaultReplication=100
I0820 14:37:52 07616 file_util.go:20] Folder ./m2 Permission: -rwxrwxr-x
I0820 14:37:52 07616 topology.go:86] Using default configurations.
I0820 14:37:52 07616 master_server.go:59] Volume Size Limit is 30000 MB
I0820 14:37:52 07616 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9334
I0820 14:37:52 07616 raft_server.go:50] Starting RaftServer with IP:localhost:9334:
I0820 14:37:52 07616 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335
I0820 14:37:52 07616 raft_server.go:134] Attempting to connect to: http://localhost:9333/cluster/join
I0820 14:37:52 07616 raft_server.go:179] Post returned status: 200

master3:

$ weed -v=3 master -port=9335 -mdir=./m3 -peers=localhost:9333,localhost:9334,localhost:9335 -defaultReplication=100
I0820 14:38:02 07626 file_util.go:20] Folder ./m3 Permission: -rwxrwxr-x
I0820 14:38:02 07626 topology.go:86] Using default configurations.
I0820 14:38:02 07626 master_server.go:59] Volume Size Limit is 30000 MB
I0820 14:38:02 07626 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9335
I0820 14:38:02 07626 raft_server.go:50] Starting RaftServer with IP:localhost:9335:
I0820 14:38:02 07626 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335
I0820 14:38:02 07626 raft_server.go:134] Attempting to connect to: http://localhost:9333/cluster/join
I0820 14:38:03 07626 raft_server.go:179] Post returned status: 200

master1启动后，发现其他两个peer master尚未启动，于是将自己选为leader。master2、master3启动后，加入到以master1为leader的 master集群。

接下来我们来启动volume servers：

volume1:

$ weed -v=3 volume -port=8081 -dir=./v1 -mserver=localhost:9333 -dataCenter=dc1
I0820 14:44:29 07642 file_util.go:20] Folder ./v1 Permission: -rwxrwxr-x
I0820 14:44:29 07642 store.go:225] Store started on dir: ./v1 with 0 volumes max 7
I0820 14:44:29 07642 volume.go:136] Start Seaweed volume server 0.70 beta at 0.0.0.0:8081
I0820 14:44:29 07642 volume_server.go:70] Volume server bootstraps with master localhost:9333
I0820 14:44:29 07642 list_masters.go:18] list masters result :{"IsLeader":true,"Leader":"localhost:9333","Peers":["localhost:9334","localhost:9335"]}
I0820 14:44:29 07642 store.go:65] current master nodes is nodes:[localhost:9334 localhost:9335 localhost:9333 localhost:9333], lastNode:3

volume server的启动大致相同，volume2和volume3的输出日志这里就不详细列出了。

volume2:

$weed -v=3 volume -port=8082 -dir=./v2 -mserver=localhost:9334 -dataCenter=dc1

volume3:

$weed -v=3 volume -port=8083 -dir=./v3 -mserver=localhost:9335 -dataCenter=dc2

三个volume server启动后，我们在leader master(9333)上能看到如下日志：

I0820 14:44:29 07606 node.go:208] topo adds child dc1
I0820 14:44:29 07606 node.go:208] topo:dc1 adds child DefaultRack
I0820 14:44:29 07606 node.go:208] topo:dc1:DefaultRack adds child 127.0.0.1:8081
I0820 14:47:09 07606 node.go:208] topo:dc1:DefaultRack adds child 127.0.0.1:8082
I0820 14:47:21 07606 node.go:208] topo adds child dc2
I0820 14:47:21 07606 node.go:208] topo:dc2 adds child DefaultRack
I0820 14:47:21 07606 node.go:208] topo:dc2:DefaultRack adds child 127.0.0.1:8083

至此，整个weed-fs集群已经启动了。初始启动后的master会在-mdir下建立一些目录和文件：

$ ls m1
conf log snapshot

但volume在-dir下没有做任何操作，volume server会在第一次写入数据时建立相应的.idx文件和.dat文件。

三、基本操作：存储、获取和删除文件

创建一个hello.txt文件，内容为"hello weed-fs!"，用于我们测试weed-fs的基本操作。weed-fs提供了HTTP REST API接口，我们可以很方便的使用其基本功能(这里客户端使用curl)。

1、存储

我们来将hello.txt文件存储在weed-fs文件系统中，我们通过master提供的submit API接口来完成这一操作：

$ curl -F file=@hello.txt http://localhost:9333/submit
{"fid":"6,01fc4a422c","fileName":"hello.txt","fileUrl":"127.0.0.1:8082/6,01fc4a422c","size":39}

我们看到master给我们返回了一行json数据，其中:

fid是一个逗号分隔的字符串，按照repository中文档的说明，这个字符串应该由volume id, key uint64和cookie code构成。其中逗号前面的6就是volume id, 01fc4a422c则是key和cookie组成的串。fid是文件hello.txt在集群中的唯一ID。后续查看、获取以及删除该文件数据都需要使用这个fid。

fileUrl是该文件在weed-fs中的一个访问地址(非唯一哦)，这里是127.0.0.1:8082/6,01fc4a422c，可以看出weed-fs在volume server2上存储了一份hello.txt的数据。

这一存储操作引发了物理volume的创建，我们可以看到volume server的-dir下发生了变化，多了很多.idx和.dat文件：

$ ls v1 v2 v3
v1:
3.dat 3.idx 4.dat 4.idx 5.dat 5.idx

v2:
1.dat 1.idx 2.dat 2.idx 6.dat 6.idx

v3:
1.dat 1.idx 2.dat 2.idx 3.dat 3.idx 4.dat 4.idx 5.dat 5.idx 6.dat 6.idx

并且这个创建过程是在master leader的控制之下的：

I0820 15:06:02 07606 volume_growth.go:204] Created Volume 3 on topo:dc1:DefaultRack:127.0.0.1:8081
I0820 15:06:02 07606 volume_growth.go:204] Created Volume 3 on topo:dc2:DefaultRack:127.0.0.1:8083

我们从文件的size可以看出，hello.txt文件被存储在了v2和v3下的id为6的卷(6.dat和6.idx)中：

v2:
-rw-r–r– 1 tonybai tonybai 104 8月20 15:06 6.dat
-rw-r–r– 1 tonybai tonybai 16 8月20 15:06 6.idx

v3:
-rw-r–r– 1 tonybai tonybai 104 8月20 15:06 6.dat
-rw-r–r– 1 tonybai tonybai 16 8月20 15:06 6.idx

v2和v3中的6.dat是一模一样的，6.idx也是一样的（后续在做数据迁移时，这点极其重要）。

2、获取

前面提到master给我们返回了一个fid:6,01fc4a422c以及fileUrl":"127.0.0.1:8082/6,01fc4a422c"。

通过这个fileUrl，我们可以获取到hello.txt的数据：

$ curl http://127.0.0.1:8082/6,01fc4a422c
hello weed-fs!

根据我们的replication策略，hello.txt应该还存储在v3下，我们换成8083这个volume，应该也可以得到 hello.txt数据：

$ curl http://127.0.0.1:8083/6,01fc4a422c
hello weed-fs!

如果我们通过volume1 (8081)查，应该得不到数据：

$ curl http://127.0.0.1:8081/6,01fc4a422c
<a href="http://127.0.0.1:8082/6,01fc4a422c">Moved Permanently</a>.

这里似乎是重定向了。我们给curl加上重定向处理选项再试一次：

$ curl -L http://127.0.0.1:8081/6,01fc4a422c
hello weed-fs!

居然也能得到相应数据，从volume1的日志来看，volume1也能获取到hello.txt的正确地址，并将返回重定向请求，这样curl 就能从正确的machine上获取数据了。

如果我们通过master来获取hello.txt数据，会是什么结果呢？

$ curl -L http://127.0.0.1:9335/6,01fc4a422c
hello weed-fs!

同样master返回重定向地址，curl从volume节点获取到正确数据。我们看看master是如何返回重定向地址的？

$ curl http://127.0.0.1:9335/6,01fc4a422c
<a href="http://127.0.0.1:8082/6,01fc4a422c">Moved Permanently</a>.
$ curl http://127.0.0.1:9335/6,01fc4a422c
<a href="http://127.0.0.1:8083/6,01fc4a422c">Moved Permanently</a>.

可以看到master会自动均衡负载，轮询式的返回8082和8083。0.70版本以前，通过非leader master是无法得到正确结果的，只能通过leader master得到，0.70版本fix了这个问题。

3、删除

通过fileUrl地址直接删除hello.txt：

$ curl -X DELETE http://127.0.0.1:8082/6,01fc4a422c
{"size":39}

操作成功后，我们再来get一下hello.txt:

$ curl -i http://127.0.0.1:8082/6,01fc4a422c
HTTP/1.1 404 Not Found
Date: Thu, 20 Aug 2015 08:13:28 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

$ curl -i -L http://127.0.0.1:9335/6,01fc4a422c
HTTP/1.1 301 Moved Permanently
Content-Length: 69
Content-Type: text/html; charset=utf-8
Date: Thu, 20 Aug 2015 08:13:56 GMT
Location: http://127.0.0.1:8082/6,01fc4a422c

HTTP/1.1 404 Not Found
Date: Thu, 20 Aug 2015 08:13:56 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

可以看出，无论是直接通过volume还是间接通过master都无法获取到hello.txt了，hello.txt被成功删除了。

不过删除hello.txt后，volume server下的数据文件的size却并没有随之减小，别担心，这就是weed-fs的处理方法，这些数据删除后遗留下来的空洞需要手工清除（对数据文件进行手工紧缩）：

$ curl "http://localhost:9335/vol/vacuum"
{"Topology":{"DataCenters":[{"Free":8,"Id":"dc1","Max":14,"Racks":[{"DataNodes":[{"Free":4,"Max":7,"PublicUrl":"127.0.0.1:8081","Url":"127.0.0.1:8081","Volumes":3},{"Free":4,"Max":7,"PublicUrl":"127.0.0.1:8082","Url":"127.0.0.1:8082","Volumes":3}],”Free”:8,”Id”:”DefaultRack”,”Max”:14}]},{“Free”:1,”Id”:”dc2″,”Max”:7,”Racks”:[{"DataNodes":[{"Free":1,"Max":7,"PublicUrl":"127.0.0.1:8083","Url":"127.0.0.1:8083","Volumes":6}],”Free”:1,”Id”:”DefaultRack”,”Max”:7}]}],”Free”:9,”Max”:21,”layouts”:[{"collection":"","replication":"100","ttl":"","writables":[1,2,3,4,5,6]}]},"Version":"0.70 beta"}

紧缩后，你再查看v1, v2, v3下的文件size，真的变小了。

四、一致性（consistency）

在分布式系统中，“一致性”是永恒的难题。weed-fs支持replication，其多副本的数据一致性需要保证。

weed-fs理论上采用了是一种“强一致性”的策略，即：

存储文件时，当多个副本都存储成功后，才会返回成功；任何一个副本存储失败，此次存储操作则返回失败。
删除文件时，当所有副本都删除成功后，才返回成功；任何一个副本删除失败，则此次删除操作返回失败。

我们来验证一下weed-fs是否做到了以上两点：

1、存储的一致性保证

我们先将volume3停掉(即dc2)，这样在replication 策略为100时，向weed-fs存储hello.txt时会发生如下结果：

$ curl -F file=@hello.txt http://localhost:9333/submit
{"error":"Cannot grow volume group! Not enough data node found!"}

master根据100策略，需要在dc2选择一个volume存储hello.txt的副本，但dc2所有machine都down掉了，因此没有存储空间，于是master认为此次操作无法继续进行，返回失败。这点符合存储一致性的要求。

2、删除的一致性保证

恢复dc2，将hello.txt存入：

$ curl -F file=@hello.txt http://localhost:9333/submit
{"fid":"6,04dce94a72","fileName":"hello.txt","fileUrl":"127.0.0.1:8082/6,04dce94a72","size":39}

再次停掉dc2，之后尝试删除hello.txt（通过master删除)：

$ curl -L -X DELETE http://127.0.0.1:9333/6,04dce94a72
{"error":"Deletion Failed."}

虽然返回的是delete failed，但从8082上的日志来看，似乎8082已经将hello.txt删除了：

I0820 17:32:20 07653 volume_server_handlers_write.go:53] deleting Cookie:3706276466, Id:4, Size:0, DataSize:0, Name: , Mime:

我们再从8082获取一下hello.txt：

$ curl http://127.0.0.1:8082/6,04dce94a72

结果是什么也没有返回。

从8082日志来看：

I0820 17:33:24 07653 volume_server_handlers_read.go:53] read error: File Entry Not Found. Needle 70 Memory 0 /6,04dce94a72

hello.txt的确被删除了！

这时将dc2(8083)重新启动！我们尝试从8083获取hello.txt：

$ curl http://127.0.0.1:8083/6,04dce94a72
hello weed-fs!

8083上的hello.txt依旧存在，可以被读取。

再试试通过master来获取hello.txt：

$ curl -L http://127.0.0.1:9333/6,04dce94a72
$ curl -L http://127.0.0.1:9333/6,04dce94a72
hello weed-fs!

结果是有时能返回hello.txt内容，有时不行。显然这是与master的自动负载均衡有关，返回8082这个重定向地址，则curl无法得到结果；但若返回8083这个重定向地址，我们就可以得到hello.txt的内容。

这样来看，目前weed-fs的删除操作还无法保证强一致性。weed-fs github.com上已有若干issues(#172，#179，#182)是关于这个问题的。在大数据量(TB、PB级别)的情况下，这种不一致性最大的问题是导致storage leak，即空间被占用而无法回收，volume将被逐个逐渐占满，期待后续的解决方案吧。

五、目录支持

weed-fs还支持像传统文件系统那样，将文件放在目录下管理，并通过文件路径对文件进行存储、获取和删除操作。weed-fs对目录的支持是通过另外一个server实现的：filer server。也就是说如果想拥有对目录的支持，则必须启动一个(或若干个) filer server，并且所有的操作都要通过filer server进行。

$ weed filer -port=8888 -dir=./f1 -master=localhost:9333 -defaultReplicaPlacement=100
I0820 22:09:40 08238 file_util.go:20] Folder ./f1 Permission: -rwxrwxr-x
I0820 22:09:40 08238 filer.go:88] Start Seaweed Filer 0.70 beta at port 8888

1、存储

$curl -F "filename=@hello.txt" "http://localhost:8888/foo/"
{"name":"hello.txt","size":39}

2、获取

$ curl http://localhost:8888/foo/hello.txt
hello weed-fs!

3、查询目录文件列表

$ curl "http://localhost:8888/foo/?pretty=y"
{
"Directory": "/foo/",
"Files": [
    {
      "name": "hello.txt",
      "fid": "6,067281a126"
    }
],
"Subdirectories": null
}

4、删除

$ curl -X DELETE http://localhost:8888/foo/hello.txt
{"error":""}

再尝试获取hello.txt：

$curl http://localhost:8888/foo/hello.txt
返回空。hello.txt已被删除。

5、多filer server

weed filer server是单点，我们再来启动一个filer server。

$ weed filer -port=8889 -dir=./f2 -master=localhost:9333 -defaultReplicaPlacement=100
I0821 13:47:52 08973 file_util.go:20] Folder ./f2 Permission: -rwxrwxr-x
I0821 13:47:52 08973 filer.go:88] Start Seaweed Filer 0.70 beta at port 8889

两个filer节点间是否有协调呢？我们来测试一下：我们从8888存储一个文件，然后从8889获取这个文件：

$ curl -F "filename=@hello.txt" "http://localhost:8888/foo/"
{"name":"hello.txt","size":39}
$ curl http://localhost:8888/foo/hello.txt
hello weed-fs!
$ curl http://localhost:8889/foo/hello.txt
空

从测试结果来看，二者各自独立工作，并没有任何联系，也就是说没有共享“文件full path”到"fid"的索引关系。默认情况下 filer server都是工作在standalone模式下的。

weed-fs官方给出了filer的集群方案，即使用redis或Cassandra作为后端，在多个filer节点间共享“文件full path”到"fid"的索引关系。

我们启动一个redis-server(2.8.21)，监听在默认的6379端口。用下面命令重启两个filer server节点：

$ weed filer -port=8888 -dir=./f1 -master=localhost:9333 -defaultReplicaPlacement=100 -redis.server=localhost:6379
$ weed filer -port=8889 -dir=./f2 -master=localhost:9333 -defaultReplicaPlacement=100 -redis.server=localhost:6379

重复一下上面的测试步骤：
$ curl -F "filename=@hello.txt" "http://localhost:8888/foo/"
{"name":"hello.txt","size":39}

$ curl http://localhost:8889/foo/hello.txt
hello weed-fs!

可以看到从8888存储的文件，可以被从8889获取到。

我们删除这个文件：
$ curl -X DELETE http://localhost:8889/foo/hello.txt
{"error":"Invalid fileId "}

提示error，但实际上文件已经被删除了！这块可能是个小bug(#183)。

虽然filer是集群了，但其后端的redis依旧是单点，如果考虑高可靠性，redis显然也要做好集群。

六、Collection

Collection，顾名思义是“集合”，在weed-fs中，它指的是物理volume的集合。前面我们在存储文件时并没有指定 collection，因此weed-fs采用默认collection(空)。如果我们指定集合，结果会是什么样子呢？

$ curl -F file=@hello.txt "http://localhost:9333/submit?collection=picture"
{"fid":"7,0c4f5dc90f","fileName":"hello.txt","fileUrl":"127.0.0.1:8083/7,0c4f5dc90f","size":39}

$ ls v1 v2 v3
v1:
3.dat 3.idx 4.dat 4.idx 5.dat 5.idx picture_7.dat picture_7.idx
v2:
1.dat 1.idx 2.dat 2.idx 6.dat 6.idx
v3:
1.dat 1.idx 2.dat 2.idx 3.dat 3.idx 4.dat 4.idx 5.dat 5.idx 6.dat 6.idx picture_7.dat picture_7.idx

可以看出volume server在自己的-dir下面建立了一个collection名字为prefix的idx和dat文件，上述例子中hello.txt被分配到 8081和8083两个volume server上，因此这两个volume server各自建立了picture_7.dat和picture_7.idx。以picture为前缀的idx和dat文件只是用来存放存储在 collection=picture的文件数据，其他数据要么存储在默认collection中，要么存储在其他名字的collection 中。

collection就好比为Windows下位驱动器存储卷起名。比如C:叫"系统盘"，D叫“程序盘”，E叫“数据盘”。这里各个 volume server下的picture_7.dat和picture_7.idx被起名为picture卷。如果还有video collection，那么它可能由各个volume server下的video_8.dat和video_8.idx。

不过由于默认情况下，weed volume的默认-max="7"，因此在实验环境下每个volume server最多在-dir下建立7个物理卷(七对.idx和.dat)。如果此时我还想建立video卷会怎么样呢？

$ curl -F file=@hello.txt "http://localhost:9333/submit?collection=video"
{"error":"Cannot grow volume group! Not enough data node found!"}

volume server们返回失败结果，提示无法再扩展volume了。这时你需要重启各个volume server，将-max值改大，比如100。

比如：$weed -v=3 volume -port=8083 -dir=./v3 -mserver=localhost:9335 -dataCenter=dc2 -max=100

重启后，我们再来建立video collection:

$ curl -F file=@hello.txt "http://localhost:9333/submit?collection=video"
{"fid":"11,0ee98ca54d","fileName":"hello.txt","fileUrl":"127.0.0.1:8083/11,0ee98ca54d","size":39}

$ ls v1 v2 v3
v1:
3.dat 4.dat 5.dat picture_7.dat video_10.dat video_11.dat video_12.dat video_13.dat video_9.dat
3.idx 4.idx 5.idx picture_7.idx video_10.idx video_11.idx video_12.idx video_13.idx video_9.idx

v2:
1.dat 1.idx 2.dat 2.idx 6.dat 6.idx video_8.dat video_8.idx

v3:
1.dat 2.dat 3.dat 4.dat 5.dat 6.dat picture_7.dat video_10.dat video_11.dat video_12.dat video_13.dat video_8.dat video_9.dat
1.idx 2.idx 3.idx 4.idx 5.idx 6.idx picture_7.idx video_10.idx video_11.idx video_12.idx video_13.idx video_8.idx video_9.idx

可以看到每个datacenter的volume server一次分配了6个volume作为video collection的存储卷。

七、伸缩(Scaling)

对于分布式系统来说，Scaling是不得不考虑的问题，也是极为常见的操作。

1、伸（scale up)

weed-fs对“伸"的支持是很好的，我们分角色说。

【master】
master间采用的是raft协议，增加一个master，对于集群来说是最最基本的操作：

$weed -v=3 master -port=9336 -mdir=./m4 -peers=localhost:9333,localhost:9334,localhost:9335,localhost:9336 -defaultReplication=100
I0821 15:45:47 12398 file_util.go:20] Folder ./m4 Permission: -rwxrwxr-x
I0821 15:45:47 12398 topology.go:86] Using default configurations.
I0821 15:45:47 12398 master_server.go:59] Volume Size Limit is 30000 MB
I0821 15:45:47 12398 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9336
I0821 15:45:47 12398 raft_server.go:50] Starting RaftServer with IP:localhost:9336:
I0821 15:45:47 12398 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335,localhost:9336
I0821 15:45:48 12398 raft_server.go:134] Attempting to connect to: http://localhost:9333/cluster/join
I0821 15:45:49 12398 raft_server.go:179] Post returned status: 200

新master节点启动后，会通过raft协议自动加入到以9333为leader的master集群中。

【volume】

和master一样，volume本身就是靠master管理的，volume server之间没有什么联系，增加一个volume server要做的就是启动一个新的volume server就好了：

$ weed -v=3 volume -port=8084 -dir=./v4 -mserver=localhost:9335 -dataCenter=dc2
I0821 15:48:21 12412 file_util.go:20] Folder ./v4 Permission: -rwxrwxr-x
I0821 15:48:21 12412 store.go:225] Store started on dir: ./v4 with 0 volumes max 7
I0821 15:48:21 12412 volume.go:136] Start Seaweed volume server 0.70 beta at 0.0.0.0:8084
I0821 15:48:21 12412 volume_server.go:70] Volume server bootstraps with master localhost:9335
I0821 15:48:22 12412 list_masters.go:18] list masters result :
I0821 15:48:22 12412 list_masters.go:18] list masters result :{"IsLeader":true,"Leader":"localhost:9333","Peers":["localhost:9334","localhost:9335","localhost:9336"]}
I0821 15:48:22 12412 store.go:65] current master nodes is nodes:[localhost:9334 localhost:9335 localhost:9336 localhost:9333 localhost:9333], lastNode:4
I0821 15:48:22 12412 volume_server.go:82] Volume Server Connected with master at localhost:9333

新volume server节点启动后，同样会自动加入集群，后续master就会自动在其上存储数据了。

【filer】

前面已经谈到了，无论是standalone模式，还是distributed模式，filter都可以随意增减，这里就不再重复赘述了。

2、缩(scale down)

master的缩是极其简单的，只需将相应节点shutdown即可；如果master是leader，则其他master会检测到leader shutdown，并自动重新选出新leader。不过在leader选举的过程中，整个集群的服务将短暂停止，直到leader选出。

filer在standalone模式下，谈伸缩是毫无意义的；对于distributed模式下，filter节点和master节点缩的方法一致，shutdown即可。

唯一的麻烦就是volume节点，因为数据存储在volume节点下，我们不能简单的停掉volume，我们需要考虑在不同 replication策略下是否可以做数据迁移，如何做数据迁移。这就是下一节我们要详细描述的。

八、数据迁移

下面我们就来探讨一下weed-fs的volume数据迁移问题。

1、000复制策略下的数据迁移

为方便测试，我简化一下实验环境（一个master+3个volume）：

master:

$ weed -v=3 master -port=9333 -mdir=./m1 -defaultReplication=000

volume:

$ weed -v=3 volume -port=8081 -dir=./v1 -mserver=localhost:9333 -dataCenter=dc1
$ weed -v=3 volume -port=8082 -dir=./v2 -mserver=localhost:9333 -dataCenter=dc1
$ weed -v=3 volume -port=8083 -dir=./v3 -mserver=localhost:9333 -dataCenter=dc1

和之前一样，启动后，v1，v2，v3目录下面是空的，卷的创建要等到第一份数据存入时。000策略就是没有副本的策略，你存储的文件在 weed-fs中只有一份数据。

我们上传一份文件：

$ curl -F filename=@hello1.txt "http://localhost:9333/submit"
{"fid":"1,01655ab58e","fileName":"hello1.txt","fileUrl":"127.0.0.1:8081/1,01655ab58e","size":40}

$ ll v1 v2 v3

v1:
-rw-r–r– 1 tonybai tonybai 104 8 21 21:31 1.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 21:31 1.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 4.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 4.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 7.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 7.idx

v2:
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 2.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 3.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 6.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 6.idx

v3:
-rw-r–r– 1 tonybai tonybai 8 8 21 21:31 5.dat
-rw-r–r– 1 tonybai tonybai 0 8 21 21:31 5.idx

可以看到hello1.txt被存储在v1下，同时可以看出不同的物理卷分别存放在不同节点下（由于不需要do replication）。

在这种情况(000)下，如果要将v1数据迁移到v2或v3中，只需将v1停掉，将v1下的文件mv到v2或v3中，重启volume server2或volume server3即可。

2、001复制策略下的数据迁移

001复制策略是weed-fs默认的复制策略，weed-fs会为每个文件在同Rack下复制一个副本。我们还利用上面的环境，不过需要停掉 weed-fs，清空目录下的文件，重启后使用，别忘了-defaultReplication=001。

我们连续存储三个文件：

$ curl -F filename=@hello1.txt "http://localhost:9333/submit"
{"fid":"2,01ea84980d","fileName":"hello1.txt","fileUrl":"127.0.0.1:8082/2,01ea84980d","size":40}

$ curl -F filename=@hello2.txt "http://localhost:9333/submit"
{"fid":"1,027883baa8","fileName":"hello2.txt","fileUrl":"127.0.0.1:8083/1,027883baa8","size":40}

$ curl -F filename=@hello3.txt "http://localhost:9333/submit"
{"fid":"6,03220f577e","fileName":"hello3.txt","fileUrl":"127.0.0.1:8081/6,03220f577e","size":40}

可以看出三个文件分别被存储在vol2, vol1和vol6中，我们查看一下v1, v2, v3中的文件情况：

$ ll v1 v2 v3
v1:
-rw-r–r– 1 tonybai tonybai 104 8 21 22:00 1.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:00 1.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 3.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 4.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 4.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:02 6.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:02 6.idx

v2:
-rw-r–r– 1 tonybai tonybai 104 8 21 21:56 2.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 21:56 2.idx
-rw-r–r– 1 tonybai tonybai 8 8 21 21:56 5.dat
-rw-r–r– 1 tonybai tonybai 0 8 21 21:56 5.idx

v3:
-rw-r–r– 1 tonybai tonybai 104 8 21 22:00 1.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:00 1.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 21:56 2.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 21:56 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 3.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 4.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 4.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 5.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 5.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:02 6.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:02 6.idx

假设我们现在要shutdown v3，将v3数据迁移到其他volume server，我们有3种做法：

1) 不迁移
2) 将v3下的所有文件mv到v2或v1中
3) 将v3下的所有文件先后覆盖到v1和v2中

我们来逐个分析每种做法的后果：

1) 不迁移

001策略下，每份数据有两个copy，v3中的数据其他两个v1+v2总是有的，因此即便不迁移，v1+v2中也会有一份数据copy。你可以测试一下当shutdown volume3后：

$ curl -L "http://localhost:9333/2,01ea84980d"
hello weed-fs1!
$ curl -L "http://localhost:9333/1,027883baa8"
hello weed-fs2!
$ curl -L "http://localhost:9333/6,03220f577e"
hello weed-fs3!

针对每一份文件，你都可以多get几次，都会得到正确的结果。但此时的不足也很明显，那就是存量数据不再拥有另外一份备份。

2) 将v3下的所有文件mv到v2或v1中

还是根据001策略，将v3数据mv到v2或v1中，结果会是什么呢，这里就以v3 mv到 v1举例：

- 对于v1和v3都有的卷id，比如1，两者的文件1.idx和1.dat是一模一样的。这是001策略决定的。但一旦迁移后，系统中的数据就由2份变成1份了。
- 对于v1有，而v3没有的，那自然不必说了。
- 对于v1没有，而v3有的，mv过去就成为了v1的数据。

为此，这种做法依旧不够完美。

3）将v3下的所有文件覆盖到v1和v2中

结合上面的方法，只有此种迁移方式才能保证迁移后，系统中的数据不丢失，且每个都是按照001策略所说的2份，这才是正确的方法。

我们来测试一下：

   – 停掉volume3；
   – 停掉volume1，将v3下的文件copy到v1下，启动volume1
   – 停掉volume2，将v3下的文件copy到v2下，启动volume2

$ curl "http://localhost:9333/6,03220f577e"
<a href="http://127.0.0.1:8081/6,03220f577e">Moved Permanently</a>.

$ curl "http://localhost:9333/6,03220f577e"
<a href="http://127.0.0.1:8082/6,03220f577e">Moved Permanently</a>.

可以看到，master返回了重定向地址8081和8082，说明8083迁移到8082上的数据也生效了。

3、100复制策略下的数据迁移

测试环境稍作变化：

master:

$ weed -v=3 master -port=9333 -mdir=./m1 -defaultReplication=100

volume:

和之前一样，我们上传三份文件：

$ curl -F filename=@hello1.txt "http://localhost:9333/submit"
{"fid":"4,01d937dd30","fileName":"hello1.txt","fileUrl":"127.0.0.1:8083/4,01d937dd30","size":40}

$ curl -F filename=@hello2.txt "http://localhost:9333/submit"
{"fid":"2,025efbef14","fileName":"hello2.txt","fileUrl":"127.0.0.1:8082/2,025efbef14","size":40}

$ curl -F filename=@hello3.txt "http://localhost:9333/submit"
{"fid":"2,03be936488","fileName":"hello3.txt","fileUrl":"127.0.0.1:8082/2,03be936488","size":40}

$ ll v1 v2 v3
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 3.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:58 4.dat
-rw-r–r– 1 tonybai tonybai   16 8 21 22:58 4.idx

v2:
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 1.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 1.idx
-rw-r–r– 1 tonybai tonybai 200 8 21 22:59 2.dat
-rw-r–r– 1 tonybai tonybai   32 8 21 22:59 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 5.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 5.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 6.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 6.idx

v3:
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 1.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 1.idx
-rw-r–r– 1 tonybai tonybai 200 8 21 22:59 2.dat
-rw-r–r– 1 tonybai tonybai   32 8 21 22:59 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 3.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:58 4.dat
-rw-r–r– 1 tonybai tonybai   16 8 21 22:58 4.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 5.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 5.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 6.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 6.idx

由于100策略是在不同DataCenter中各保持一份copy，因此数据的迁移不应该在数据中心间进行，而同一数据中心内的迁移又回归到了 “000”策略的情形。

其他策略的分析方式也是如此，这里就不长篇大论了。

九、Benchmark

在HP ProLiant DL380 G4, Intel(R) Xeon(TM) CPU 3.60GHz 4核，6G内存的机器(非SSD硬盘)上，执行benchmark test:

$ weed benchmark -server=localhost:9333

This is SeaweedFS version 0.70 beta linux amd64

———— Writing Benchmark ———-
Concurrency Level:      16
Time taken for tests:   831.583 seconds
Complete requests:      1048576
Failed requests:        0
Total transferred:      1106794545 bytes
Requests per second:    1260.94 [#/sec]
Transfer rate:          1299.75 [Kbytes/sec]

Connection Times (ms)
min avg max std
Total: 2.2 12.5 1118.4 9.3

Percentage of the requests served within a certain time (ms)
   50%     11.4 ms
   66%     13.3 ms
   75%     14.8 ms
   80%     15.9 ms
   90%     19.2 ms
   95%     22.6 ms
   98%     27.4 ms
   99%     31.2 ms
100%    1118.4 ms

———— Randomly Reading Benchmark ———-
Concurrency Level:      16
Time taken for tests:   151.480 seconds
Complete requests:      1048576
Failed requests:        0
Total transferred:      1106791113 bytes
Requests per second:    6922.22 [#/sec]
Transfer rate:          7135.28 [Kbytes/sec]

Connection Times (ms)
min avg max std
Total: 0.1 2.2 116.7 3.9

Percentage of the requests served within a certain time (ms)
   50%      1.6 ms
   66%      2.1 ms
   75%      2.5 ms
   80%      2.8 ms
   90%      3.7 ms
   95%      4.8 ms
   98%      7.4 ms
   99%     11.1 ms
100%    116.7 ms

这个似乎比作者在mac笔记本(SSD)上性能还要差些，当然此次我们用的策略是100，并且这个服务器上还运行着其他程序。但即便如此，感觉weed-fs还是有较大优化的空间的。

作者在官网上将weed-fs与其他分布式文件系统如Ceph，hdfs等做了简要对比，强调了weed-fs相对于其他分布式文件系统的优点。

十、其它

weed-fs使用google glog，因此所有log的级别设置以及log定向的方法均与glog一致。

weed-fs提供了backup命令，用来在同机上备份volume server上的数据。

weed-fs没有提供官方client包，但在wiki上列出多种第三方client包（各种语言），就Go client包来看，似乎还没有特别理想的。

weed-fs目前还没有web console，只能通过命令行进行操作。

使用weed-fs时，别忘了将open files no limit调大，否则可能会导致volume server crash。

十一、小结

weed-fs为想寻找开源分布式文件系统的朋友们提供了一个新选择。尤其是在存储大量小图片时，weed-fs自身就是基于haystack这一优化图片存储的论文的。另外weed-fs使用起来的确十分简单，分分钟就可以建立起一个分布式系统，部署容易，几乎不需要什么配置。但weed-fs目前最大的问题似乎是没有重量级的使用案例，自身也还有不少不足，但希望通过这篇文章能让更多人认识weed-fs，并使用weed-fs，帮助改善weed-fs吧。