标签 Linux 下的文章

使用Ceph RBD为Kubernetes集群提供存储卷

一旦走上使用Kubernetes的道路,你就会发现这条路并不好走,充满荆棘。即便你使用Kubernetes建立起的集群规模不大,也是需要“五脏俱全”的,否则你根本无法真正将kubernetes用起来,或者说一个半拉子Kubernetes集群很可能无法满足你要支撑的业务需求。在目前我正在从事的一个产品就是这样,光有K8s还不够,考虑到”有状态服务”的需求,我们还需要给Kubernetes配一个后端存储以支持Persistent Volume机制,使得Pod在k8s的不同节点间调度迁移时,具有持久化需求的数据不会被清除,且Pod中Container无论被调度到哪个节点,始终都能挂载到同一个Volume。

Kubernetes支持多种Volume类型,这里选择Ceph RBD(Rados Block Device)。选择Ceph大致有三个原因:

  • Ceph经过多年开发,已经逐渐步入成熟;
  • Ceph在Ubuntu 14.04.x上安装方便(仅通过apt-get即可),并且在未经任何调优(调优需要你对Ceph背后的原理十分熟悉)的情况下,性能可以基本满足我们需求;
  • Ceph同时支持对象存储、块存储和文件系统接口,虽然这里我们可能仅需要块存储。

即便这样,Ceph与K8s的集成过程依旧少不了“趟坑”,接下来我们就详细道来。

一、环境和准备条件

我们依然使用两个阿里云ECS Node,操作系统以及内核版本为:Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-70-generic x86_64)。

Ceph采用当前Ubuntu 14.04源中最新的Ceph LTS版本:JEWEL10.2.3

Kubernetes版本为上次安装时的1.3.7版本。

二、Ceph安装原理

Ceph分布式存储集群由若干组件组成,包括:Ceph Monitor、Ceph OSD和Ceph MDS,其中如果你仅使用对象存储和块存储时,MDS不是必须的(本次我们也不需要安装MDS),仅当你要用到Cephfs时,MDS才是需要安装的。

Ceph的安装模型与k8s有些类似,也是通过一个deploy node远程操作其他Node以create、prepare和activate各个Node上的Ceph组件,官方手册中给出的示意图如下:

img{}

映射到我们实际的环境中,我的安装设计是这样的:

admin-node, deploy-node(ceph-deploy):10.47.136.60  iZ25cn4xxnvZ
mon.node1,(mds.node1): 10.47.136.60  iZ25cn4xxnvZ
osd.0: 10.47.136.60   iZ25cn4xxnvZ
osd.1: 10.46.181.146 iZ25mjza4msZ

实际上就是两个Aliyun ECS节点承担以上多种角色。不过像iZ25cn4xxnvZ这样的host name太反人类,长远考虑还是换成node1、node2这样的简单名字更好。通过编辑各个ECS上的/etc/hostname, /etc/hosts,我们将iZ25cn4xxnvZ换成node1,将iZ25mjza4msZ换成node2:

10.47.136.60 (node1):

# cat /etc/hostname
node1

# cat /etc/hosts
127.0.0.1 localhost
127.0.1.1    localhost.localdomain    localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.47.136.60 admin
10.47.136.60 node1
10.47.136.60 iZ25cn4xxnvZ
10.46.181.146 node2

----------------------------------
10.46.181.146 (node2):

# cat /etc/hostname
node2

# cat /etc/hosts
127.0.0.1 localhost
127.0.1.1    localhost.localdomain    localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.46.181.146  node2
10.46.181.146  iZ25mjza4msZ
10.47.136.60  node1

于是上面的环境设计就变成了:

admin-node, deploy-node(ceph-deploy):node1 10.47.136.60
mon.node1, (mds.node1) : node1  10.47.136.60
osd.0:  node1 10.47.136.60
osd.1:  node2 10.46.181.146

三、Ceph安装步骤

1、安装ceph-deploy

Ceph提供了一键式安装工具ceph-deploy来协助Ceph集群的安装,在deploy node上,我们首先要来安装的就是ceph-deploy,Ubuntu 14.04官方源中的ceph-deploy是1.4.0版本,比较old,我们需要添加Ceph源,安装最新的ceph-deploy:

# wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
OK

# echo deb https://download.ceph.com/debian-jewel/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
deb https://download.ceph.com/debian-jewel/ trusty main

#apt-get update
... ...

# apt-get install ceph-deploy
Reading package lists... Done
Building dependency tree
Reading state information... Done
.... ...
The following NEW packages will be installed:
  ceph-deploy
0 upgraded, 1 newly installed, 0 to remove and 105 not upgraded.
Need to get 96.4 kB of archives.
After this operation, 622 kB of additional disk space will be used.
Get:1 https://download.ceph.com/debian-jewel/ trusty/main ceph-deploy all 1.5.35 [96.4 kB]
Fetched 96.4 kB in 1s (53.2 kB/s)
Selecting previously unselected package ceph-deploy.
(Reading database ... 153022 files and directories currently installed.)
Preparing to unpack .../ceph-deploy_1.5.35_all.deb ...
Unpacking ceph-deploy (1.5.35) ...
Setting up ceph-deploy (1.5.35) ...

注意:ceph-deploy只需要在admin/deploy node上安装即可。

2、前置设置

和安装k8s一样,在ceph-deploy真正执行安装之前,需要确保所有Ceph node都要开启NTP,同时建议在每个node节点上为安装过程创建一个安装账号,即ceph-deploy在ssh登录到每个Node时所用的账号。这个账号有两个约束:

  • 具有sudo权限;
  • 执行sudo命令时,无需输入密码。

我们将这一账号命名为cephd,我们需要在每个ceph node上(包括admin node/deploy node)都建立一个cephd用户,并加入到sudo组中。

以下命令在每个Node上都要执行:

useradd -d /home/cephd -m cephd
passwd cephd

添加sudo权限:
echo "cephd ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/cephd
sudo chmod 0440 /etc/sudoers.d/cephd

在admin node(deploy node)上,登入cephd账号,创建该账号下deploy node到其他各个Node的ssh免密登录设置,密码留空:

在deploy node上执行:

$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/cephd/.ssh/id_rsa):
Created directory '/home/cephd/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/cephd/.ssh/id_rsa.
Your public key has been saved in /home/cephd/.ssh/id_rsa.pub.
The key fingerprint is:
....

将deploy node的公钥copy到其他节点上去:

$ ssh-copy-id cephd@node1
The authenticity of host 'node1 (10.47.136.60)' can't be established.
ECDSA key fingerprint is d2:69:e2:3a:3e:4c:6b:80:15:30:17:8e:df:3b:62:1f.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
cephd@node1's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'cephd@node1'"
and check to make sure that only the key(s) you wanted were added.

同样,执行 ssh-copy-id cephd@node2,完成后,测试一下免密登录。

$ ssh node1
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-70-generic x86_64)

 * Documentation:  https://help.ubuntu.com/
New release '16.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Welcome to aliyun Elastic Compute Service!

最后,在Deploy node上创建并编辑~/.ssh/config,这是Ceph官方doc推荐的步骤,这样做的目的是可以避免每次执行ceph-deploy时都要去指定 –username {username} 参数。

//~/.ssh/config
Host node1
   Hostname node1
   User cephd
Host node2
   Hostname node2
   User cephd
3、安装ceph

这个环节参考的是Ceph官方doc手工部署一节

如果之前安装过ceph,可以先执行如下命令以获得一个干净的环境:

ceph-deploy purgedata node1 node2
ceph-deploy forgetkeys
ceph-deploy purge node1 node2

接下来我们就可以来全新安装Ceph了。在deploy node上,建立cephinstall目录,然后进入cephinstall目录执行相关步骤。

我们首先来创建一个ceph cluster,这个环节需要通过执行ceph-deploy new {initial-monitor-node(s)}命令。按照上面的安装设计,我们的ceph monitor node就是node1,因此我们执行下面命令来创建一个名为ceph的ceph cluster:

$ ceph-deploy new node1
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephd/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy new node1
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  func                          : <function new at 0x7f71d2051938>
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f71d19f5710>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  ssh_copykey                   : True
[ceph_deploy.cli][INFO  ]  mon                           : ['node1']
[ceph_deploy.cli][INFO  ]  public_network                : None
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  cluster_network               : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  fsid                          : None
[ceph_deploy.new][DEBUG ] Creating new cluster named ceph
[ceph_deploy.new][INFO  ] making sure passwordless SSH succeeds
[node1][DEBUG ] connection detected need for sudo
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /sbin/initctl version
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /bin/ip link show
[node1][INFO  ] Running command: sudo /bin/ip addr show
[node1][DEBUG ] IP addresses found: [u'101.201.78.51', u'192.168.16.1', u'10.47.136.60', u'172.16.99.0', u'172.16.99.1']
[ceph_deploy.new][DEBUG ] Resolving host node1
[ceph_deploy.new][DEBUG ] Monitor node1 at 10.47.136.60
[ceph_deploy.new][DEBUG ] Monitor initial members are ['node1']
[ceph_deploy.new][DEBUG ] Monitor addrs are ['10.47.136.60']
[ceph_deploy.new][DEBUG ] Creating a random mon key...
[ceph_deploy.new][DEBUG ] Writing monitor keyring to ceph.mon.keyring...
[ceph_deploy.new][DEBUG ] Writing initial config to ceph.conf...

new命令执行完后,ceph-deploy会在当前目录下创建一些辅助文件:

# ls
ceph.conf  ceph-deploy-ceph.log  ceph.mon.keyring

$ cat ceph.conf
[global]
fsid = f5166c78-e3b6-4fef-b9e7-1ecf7382fd93
mon_initial_members = node1
mon_host = 10.47.136.60
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

由于我们仅有两个OSD节点,因此我们在进一步安装之前,需要先对ceph.conf文件做一些配置调整:
修改配置以进行后续安装:

在[global]标签下,添加下面一行:
osd pool default size = 2

ceph.conf保存退出。接下来,我们执行下面命令在node1和node2上安装ceph运行所需的各个binary包:

# ceph-deploy install nod1 node2
.... ...
[node2][INFO  ] Running command: sudo ceph --version
[node2][DEBUG ] ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

这一过程ceph-deploy会SSH登录到各个node上去,执行apt-get update, 并install ceph的各种组件包,这个环节耗时可能会长一些(依网络情况不同而不同),请耐心等待。

4、初始化ceph monitor node

有了ceph启动的各个程序后,我们首先来初始化ceph cluster的monitor node。在deploy node的工作目录cephinstall下,执行:

# ceph-deploy mon create-initial

[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy mon create-initial
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : create-initial
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f0f7ea2fe60>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  func                          : <function mon at 0x7f0f7ee93de8>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  keyrings                      : None
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts node1
[ceph_deploy.mon][DEBUG ] detecting platform for host node1...
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
....

[iZ25cn4xxnvZ][INFO  ] Running command: ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.iZ25cn4xxnvZ.asok mon_status
[ceph_deploy.mon][INFO  ] mon.iZ25cn4xxnvZ monitor has reached quorum!
[ceph_deploy.mon][INFO  ] all initial monitors are running and have formed quorum
[ceph_deploy.mon][INFO  ] Running gatherkeys...
[ceph_deploy.gatherkeys][INFO  ] Storing keys in temp directory /tmp/tmpP_SmXX
[iZ25cn4xxnvZ][DEBUG ] connected to host: iZ25cn4xxnvZ
[iZ25cn4xxnvZ][DEBUG ] detect platform information from remote host
[iZ25cn4xxnvZ][DEBUG ] detect machine type
[iZ25cn4xxnvZ][DEBUG ] find the location of an executable
[iZ25cn4xxnvZ][INFO  ] Running command: /sbin/initctl version
[iZ25cn4xxnvZ][DEBUG ] get remote short hostname
[iZ25cn4xxnvZ][DEBUG ] fetch remote file
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --admin-daemon=/var/run/ceph/ceph-mon.iZ25cn4xxnvZ.asok mon_status
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-iZ25cn4xxnvZ/keyring auth get-or-create client.admin osd allow * mds allow * mon allow *
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-iZ25cn4xxnvZ/keyring auth get-or-create client.bootstrap-mds mon allow profile bootstrap-mds
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-iZ25cn4xxnvZ/keyring auth get-or-create client.bootstrap-osd mon allow profile bootstrap-osd
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-iZ25cn4xxnvZ/keyring auth get-or-create client.bootstrap-rgw mon allow profile bootstrap-rgw
... ...
[ceph_deploy.gatherkeys][INFO  ] Storing ceph.client.admin.keyring
[ceph_deploy.gatherkeys][INFO  ] Storing ceph.bootstrap-mds.keyring
[ceph_deploy.gatherkeys][INFO  ] keyring 'ceph.mon.keyring' already exists
[ceph_deploy.gatherkeys][INFO  ] Storing ceph.bootstrap-osd.keyring
[ceph_deploy.gatherkeys][INFO  ] Storing ceph.bootstrap-rgw.keyring
[ceph_deploy.gatherkeys][INFO  ] Destroy temp directory /tmp/tmpP_SmXX

这一过程很顺利。命令执行完成后我们能看到一些变化:

在当前目录下,出现了若干*.keyring,这是Ceph组件间进行安全访问时所需要的:

# ls -l
total 216
-rw------- 1 root root     71 Nov  3 17:24 ceph.bootstrap-mds.keyring
-rw------- 1 root root     71 Nov  3 17:25 ceph.bootstrap-osd.keyring
-rw------- 1 root root     71 Nov  3 17:25 ceph.bootstrap-rgw.keyring
-rw------- 1 root root     63 Nov  3 17:24 ceph.client.admin.keyring
-rw-r--r-- 1 root root    242 Nov  3 16:40 ceph.conf
-rw-r--r-- 1 root root 192336 Nov  3 17:25 ceph-deploy-ceph.log
-rw------- 1 root root     73 Nov  3 16:28 ceph.mon.keyring
-rw-r--r-- 1 root root   1645 Oct 16  2015 release.asc

在node1(monitor node)上,我们看到ceph-mon已经运行起来了:

cephd@node1:~/cephinstall$ ps -ef|grep ceph
ceph     32326     1  0 14:19 ?        00:00:00 /usr/bin/ceph-mon --cluster=ceph -i node1 -f --setuser ceph --setgroup ceph

如果要手工停止ceph-mon,可以使用stop ceph-mon-all 命令。

5、prepare ceph OSD node

至此,ceph-mon组件程序已经成功启动了,剩下的只有OSD这一关了。启动OSD node分为两步:prepare 和 activate。OSD node是真正存储数据的节点,我们需要为ceph-osd提供独立存储空间,一般是一个独立的disk。但我们环境不具备这个条件,于是在本地盘上创建了个目录,提供给OSD。

在deploy node上执行:

ssh node1
sudo mkdir /var/local/osd0
exit

ssh node2
sudo mkdir /var/local/osd1
exit

接下来,我们就可以执行prepare操作了,prepare操作会在上述的两个osd0和osd1目录下创建一些后续activate激活以及osd运行时所需要的文件:

cephd@node1:~/cephinstall$ ceph-deploy osd prepare node1:/var/local/osd0 node2:/var/local/osd1
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephd/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy osd prepare node1:/var/local/osd0 node2:/var/local/osd1
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  disk                          : [('node1', '/var/local/osd0', None), ('node2', '/var/local/osd1', None)]
[ceph_deploy.cli][INFO  ]  dmcrypt                       : False
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  bluestore                     : None
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : prepare
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir               : /etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f072603e8c0>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  fs_type                       : xfs
[ceph_deploy.cli][INFO  ]  func                          : <function osd at 0x7f0726492d70>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  zap_disk                      : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks node1:/var/local/osd0: node2:/var/local/osd1:
[node1][DEBUG ] connection detected need for sudo
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /sbin/initctl version
[node1][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] Deploying osd to node1
[node1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.osd][DEBUG ] Preparing host node1 disk /var/local/osd0 journal None activate False
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /usr/sbin/ceph-disk -v prepare --cluster ceph --fs-type xfs -- /var/local/osd0
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --check-allows-journal -i 0 --cluster ceph
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --check-wants-journal -i 0 --cluster ceph
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --check-needs-journal -i 0 --cluster ceph
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size
[node1][WARNIN] populate_data_path: Preparing osd data dir /var/local/osd0
[node1][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd0/ceph_fsid.782.tmp
[node1][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd0/fsid.782.tmp
[node1][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd0/magic.782.tmp
[node1][INFO  ] checking OSD status...
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /usr/bin/ceph --cluster=ceph osd stat --format=json
[
ceph_deploy.osd][DEBUG ] Host node1 is now ready for osd use.
[node2][DEBUG ] connection detected need for sudo
[node2][DEBUG ] connected to host: node2
... ...
[node2][INFO  ] Running command: sudo /usr/bin/ceph --cluster=ceph osd stat --format=json
[ceph_deploy.osd][DEBUG ] Host node2 is now ready for osd use.

prepare并不会启动ceph osd,那是activate的职责。

6、激活ceph OSD node

接下来,我们来激活各个OSD node:

$ ceph-deploy osd activate node1:/var/local/osd0 node2:/var/local/osd1

... ...
[node1][WARNIN] got monmap epoch 1
[node1][WARNIN] command: Running command: /usr/bin/timeout 300 ceph-osd --cluster ceph --mkfs --mkkey -i 0 --monmap /var/local/osd0/activate.monmap --osd-data /var/local/osd0 --osd-journal /var/local/osd0/journal --osd-uuid 6def4f7f-4f37-43a5-8699-5c6ab608c89c --keyring /var/local/osd0/keyring --setuser ceph --setgroup ceph
[node1][WARNIN] Traceback (most recent call last):
[node1][WARNIN]   File "/usr/sbin/ceph-disk", line 9, in <module>
[node1][WARNIN]     load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5011, in run
[node1][WARNIN]     main(sys.argv[1:])
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4962, in main
[node1][WARNIN]     args.func(args)
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3324, in main_activate
[node1][WARNIN]     init=args.mark_init,
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3144, in activate_dir
[node1][WARNIN]     (osd_id, cluster) = activate(path, activate_key_template, init)
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3249, in activate
[node1][WARNIN]     keyring=keyring,
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2742, in mkfs
[node1][WARNIN]     '--setgroup', get_ceph_group(),
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2689, in ceph_osd_mkfs
[node1][WARNIN]     raise Error('%s failed : %s' % (str(arguments), error))
[node1][WARNIN] ceph_disk.main.Error: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', '0', '--monmap', '/var/local/osd0/activate.monmap', '--osd-data', '/var/local/osd0', '--osd-journal', '/var/local/osd0/journal', '--osd-uuid', '6def4f7f-4f37-43a5-8699-5c6ab608c89c', '--keyring', '/var/local/osd0/keyring', '--setuser', 'ceph', '--setgroup', 'ceph'] failed : 2016-11-04 14:25:40.325009 7fd1aa73f800 -1 filestore(/var/local/osd0) mkfs: write_version_stamp() failed: (13) Permission denied
[node1][WARNIN] 2016-11-04 14:25:40.325032 7fd1aa73f800 -1 OSD::mkfs: ObjectStore::mkfs failed with error -13
[node1][WARNIN] 2016-11-04 14:25:40.325075 7fd1aa73f800 -1  ** ERROR: error creating empty object store in /var/local/osd0: (13) Permission denied
[node1][WARNIN]
[node1][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init upstart --mount /var/local/osd0

激活没能成功,在激活第一个节点时,就输出了如上错误日志。日志的error含义很明显:权限问题。

ceph-deploy尝试在osd node1上以ceph:ceph启动ceph-osd,但/var/local/osd0目录的权限情况如下:

$ ls -l /var/local
drwxr-sr-x 2 root staff 4096 Nov  4 14:25 osd0

osd0被root拥有,以ceph用户启动的ceph-osd程序自然没有权限在/var/local/osd0目录下创建文件并写入数据了。这个问题在ceph官方issue中有很多人提出来,也给出了临时修正方法:

将osd0和osd1的权限赋予ceph:ceph:

node1:
sudo chown -R ceph:ceph /var/local/osd0

node2:
sudo chown -R ceph:ceph /var/local/osd1

修改完权限后,我们再来执行activate:

$ ceph-deploy osd activate node1:/var/local/osd0 node2:/var/local/osd1
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephd/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy osd activate node1:/var/local/osd0 node2:/var/local/osd1
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : activate
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f3c90c678c0>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  func                          : <function osd at 0x7f3c910bbd70>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  disk                          : [('node1', '/var/local/osd0', None), ('node2', '/var/local/osd1', None)]
[ceph_deploy.osd][DEBUG ] Activating cluster ceph disks node1:/var/local/osd0: node2:/var/local/osd1:
[node1][DEBUG ] connection detected need for sudo
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /sbin/initctl version
[node1][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] activating host node1 disk /var/local/osd0
[ceph_deploy.osd][DEBUG ] will use init type: upstart
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /usr/sbin/ceph-disk -v activate --mark-init upstart --mount /var/local/osd0
[node1][WARNIN] main_activate: path = /var/local/osd0
[node1][WARNIN] activate: Cluster uuid is f5166c78-e3b6-4fef-b9e7-1ecf7382fd93
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
[node1][WARNIN] activate: Cluster name is ceph
[node1][WARNIN] activate: OSD uuid is 6def4f7f-4f37-43a5-8699-5c6ab608c89c
[node1][WARNIN] activate: OSD id is 0
[node1][WARNIN] activate: Initializing OSD...
[node1][WARNIN] command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/local/osd0/activate.monmap
[node1][WARNIN] got monmap epoch 1
[node1][WARNIN] command: Running command: /usr/bin/timeout 300 ceph-osd --cluster ceph --mkfs --mkkey -i 0 --monmap /var/local/osd0/activate.monmap --osd-data /var/local/osd0 --osd-journal /var/local/osd0/journal --osd-uuid 6def4f7f-4f37-43a5-8699-5c6ab608c89c --keyring /var/local/osd0/keyring --setuser ceph --setgroup ceph
[node1][WARNIN] activate: Marking with init system upstart
[node1][WARNIN] activate: Authorizing OSD key...
[node1][WARNIN] command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth add osd.0 -i /var/local/osd0/keyring osd allow * mon allow profile osd
[node1][WARNIN] added key for osd.0
[node1][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd0/active.4616.tmp
[node1][WARNIN] activate: ceph osd.0 data dir is ready at /var/local/osd0
[node1][WARNIN] activate_dir: Creating symlink /var/lib/ceph/osd/ceph-0 -> /var/local/osd0
[node1][WARNIN] start_daemon: Starting ceph osd.0...
[node1][WARNIN] command_check_call: Running command: /sbin/initctl emit --no-wait -- ceph-osd cluster=ceph id=0
[node1][INFO  ] checking OSD status...
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /usr/bin/ceph --cluster=ceph osd stat --format=json
[node1][WARNIN] there is 1 OSD down
[node1][WARNIN] there is 1 OSD out

[node2][DEBUG ] connection detected need for sudo
[node2][DEBUG ] connected to host: node2
[node2][DEBUG ] detect platform information from remote host
[node2][DEBUG ] detect machine type
[node2][DEBUG ] find the location of an executable
[node2][INFO  ] Running command: sudo /sbin/initctl version
[node2][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] activating host node2 disk /var/local/osd1
[ceph_deploy.osd][DEBUG ] will use init type: upstart
[node2][DEBUG ] find the location of an executable
[node2][INFO  ] Running command: sudo /usr/sbin/ceph-disk -v activate --mark-init upstart --mount /var/local/osd1
[node2][WARNIN] main_activate: path = /var/local/osd1
[node2][WARNIN] activate: Cluster uuid is f5166c78-e3b6-4fef-b9e7-1ecf7382fd93
[node2][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
[node2][WARNIN] activate: Cluster name is ceph
[node2][WARNIN] activate: OSD uuid is 4733f683-0376-4708-86a6-818af987ade2
[node2][WARNIN] allocate_osd_id: Allocating OSD id...
[node2][WARNIN] command: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd create --concise 4733f683-0376-4708-86a6-818af987ade2
[node2][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd1/whoami.27470.tmp
[node2][WARNIN] activate: OSD id is 1
[node2][WARNIN] activate: Initializing OSD...
[node2][WARNIN] command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/local/osd1/activate.monmap
[node2][WARNIN] got monmap epoch 1
[node2][WARNIN] command: Running command: /usr/bin/timeout 300 ceph-osd --cluster ceph --mkfs --mkkey -i 1 --monmap /var/local/osd1/activate.monmap --osd-data /var/local/osd1 --osd-journal /var/local/osd1/journal --osd-uuid 4733f683-0376-4708-86a6-818af987ade2 --keyring /var/local/osd1/keyring --setuser ceph --setgroup ceph
[node2][WARNIN] activate: Marking with init system upstart
[node2][WARNIN] activate: Authorizing OSD key...
[node2][WARNIN] command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth add osd.1 -i /var/local/osd1/keyring osd allow * mon allow profile osd
[node2][WARNIN] added key for osd.1
[node2][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd1/active.27470.tmp
[node2][WARNIN] activate: ceph osd.1 data dir is ready at /var/local/osd1
[node2][WARNIN] activate_dir: Creating symlink /var/lib/ceph/osd/ceph-1 -> /var/local/osd1
[node2][WARNIN] start_daemon: Starting ceph osd.1...
[node2][WARNIN] command_check_call: Running command: /sbin/initctl emit --no-wait -- ceph-osd cluster=ceph id=1
[node2][INFO  ] checking OSD status...
[node2][DEBUG ] find the location of an executable
[node2][INFO  ] Running command: sudo /usr/bin/ceph --cluster=ceph osd stat --format=json

没有错误报出!但OSD真的运行起来了吗?我们还需要再确认一下。

我们先通过ceph admin命令将各个.keyring同步到各个Node上,以便可以在各个Node上使用ceph命令连接到monitor:

注意:执行ceph admin前,需要在deploy-node的/etc/hosts中添加:

10.47.136.60 admin

执行ceph admin:

$ ceph-deploy admin admin node1 node2
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephd/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy admin admin node1 node2
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f072ee3b758>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  client                        : ['admin', 'node1', 'node2']
[ceph_deploy.cli][INFO  ]  func                          : <function admin at 0x7f072f6cf5f0>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to admin
[admin][DEBUG ] connection detected need for sudo
[admin][DEBUG ] connected to host: admin
[admin][DEBUG ] detect platform information from remote host
[admin][DEBUG ] detect machine type
[admin][DEBUG ] find the location of an executable
[admin][INFO  ] Running command: sudo /sbin/initctl version
[admin][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to node1
[node1][DEBUG ] connection detected need for sudo
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /sbin/initctl version
[node1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to node2
[node2][DEBUG ] connection detected need for sudo
[node2][DEBUG ] connected to host: node2
[node2][DEBUG ] detect platform information from remote host
[node2][DEBUG ] detect machine type
[node2][DEBUG ] find the location of an executable
[node2][INFO  ] Running command: sudo /sbin/initctl version
[node2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf

$sudo chmod +r /etc/ceph/ceph.client.admin.keyring

接下来,查看一下ceph集群中的OSD节点状态:

$ ceph osd tree
ID WEIGHT  TYPE NAME             UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.07660 root default
-2 0.03830     host node1
 0 0.03830         osd.0            down        0          1.00000
-3 0.03830     host iZ25mjza4msZ
 1 0.03830         osd.1            down        0          1.00000

果不其然,两个osd节点均处于down状态,一个也没有启动起来。问题在哪?

我们来查看一下node1上的日志:/var/log/ceph/ceph-osd.0.log:

016-11-04 15:33:17.088971 7f568d6db800  0 pidfile_write: ignore empty --pid-file
2016-11-04 15:33:17.102052 7f568d6db800  0 filestore(/var/lib/ceph/osd/ceph-0) backend generic (magic 0xef53)
2016-11-04 15:33:17.102071 7f568d6db800 -1 filestore(/var/lib/ceph/osd/ceph-0) WARNING: max attr value size (1024) is smaller than osd_max_object_name_len (2048).  Your backend filesystem appears to not support attrs large enough to handle the configured max rados name size.  You may get unexpected ENAMETOOLONG errors on rados operations or buggy behavior
2016-11-04 15:33:17.102410 7f568d6db800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2016-11-04 15:33:17.102425 7f568d6db800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2016-11-04 15:33:17.102445 7f568d6db800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice is supported
2016-11-04 15:33:17.119261 7f568d6db800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2016-11-04 15:33:17.127630 7f568d6db800  0 filestore(/var/lib/ceph/osd/ceph-0) limited size xattrs
2016-11-04 15:33:17.128125 7f568d6db800  1 leveldb: Recovering log #38
2016-11-04 15:33:17.136595 7f568d6db800  1 leveldb: Delete type=3 #37
2016-11-04 15:33:17.136656 7f568d6db800  1 leveldb: Delete type=0 #38
2016-11-04 15:33:17.136845 7f568d6db800  0 filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2016-11-04 15:33:17.137064 7f568d6db800 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2016-11-04 15:33:17.137068 7f568d6db800  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 18: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0
2016-11-04 15:33:17.137897 7f568d6db800  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 18: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0
2016-11-04 15:33:17.138243 7f568d6db800  1 filestore(/var/lib/ceph/osd/ceph-0) upgrade
2016-11-04 15:33:17.138453 7f568d6db800 -1 osd.0 0 backend (filestore) is unable to support max object name[space] len
2016-11-04 15:33:17.138481 7f568d6db800 -1 osd.0 0    osd max object name len = 2048
2016-11-04 15:33:17.138485 7f568d6db800 -1 osd.0 0    osd max object namespace len = 256
2016-11-04 15:33:17.138488 7f568d6db800 -1 osd.0 0 (36) File name too long
2016-11-04 15:33:17.138895 7f568d6db800  1 journal close /var/lib/ceph/osd/ceph-0/journal
2016-11-04 15:33:17.140041 7f568d6db800 -1  ** ERROR: osd init failed: (36) File name too long

的确发现了错误日志:

2016-11-04 15:33:17.138481 7f568d6db800 -1 osd.0 0    osd max object name len = 2048
2016-11-04 15:33:17.138485 7f568d6db800 -1 osd.0 0    osd max object namespace len = 256
2016-11-04 15:33:17.138488 7f568d6db800 -1 osd.0 0 (36) File name too long
2016-11-04 15:33:17.138895 7f568d6db800  1 journal close /var/lib/ceph/osd/ceph-0/journal
2016-11-04 15:33:17.140041 7f568d6db800 -1  ** ERROR: osd init failed: (36) File name too long

进一步搜索ceph官方文档,发现在文件系统推荐这个doc中有提到,官方不建议采用ext4文件系统作为ceph的后端文件系统,如果采用,那么对于ext4的filesystem,应该在ceph.conf中添加如下配置:

osd max object name len = 256
osd max object namespace len = 64

由于配置已经分发到个个node上,我们需要到各个Node上同步修改:/etc/ceph/ceph.conf,添加上面两行。然后重新activate osd node,这里不赘述。重新激活后,我们来查看ceph osd状态:

$ ceph osd tree
ID WEIGHT  TYPE NAME             UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.07660 root default
-2 0.03830     host node1
 0 0.03830         osd.0              up  1.00000          1.00000
-3 0.03830     host iZ25mjza4msZ
 1 0.03830         osd.1              up  1.00000          1.00000

 $ceph -s
    cluster f5166c78-e3b6-4fef-b9e7-1ecf7382fd93
     health HEALTH_OK
     monmap e1: 1 mons at {node1=10.47.136.60:6789/0}
            election epoch 3, quorum 0 node1
     osdmap e11: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v29: 64 pgs, 1 pools, 0 bytes data, 0 objects
            37834 MB used, 38412 MB / 80374 MB avail
                  64 active+clean

$ !ps
ps -ef|grep ceph
ceph       17139       1  0 16:20 ?        00:00:00 /usr/bin/ceph-osd --cluster=ceph -i 0 -f --setuser ceph --setgroup ceph

可以看到ceph osd节点上的ceph-osd启动正常,cluster 状态为active+clean,至此,Ceph Cluster集群安装ok(我们暂不需要Ceph MDS组件)。

四、创建一个使用Ceph RBD作为后端Volume的Pod

在这一节中,我们就要将Ceph RBD与Kubenetes做集成了。Kubernetes的官方源码的examples/volumes/rbd目录下,就有一个使用cephrbd作为kubernetes pod volume的例子,我们试着将其跑起来。

例子提供了两个pod描述文件:rbd.json和rbd-with-secret.json。由于我们在ceph install时在ceph.conf中使用默认的安全验证协议cephx – The Ceph authentication protocol了:

auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

因此我们将采用rbd-with-secret.json这个pod描述文件来创建例子中的Pod,限于篇幅,这里仅节选json文件中的volumes部分:

//例子中的rbd-with-secret.json

{
    ... ...
        "volumes": [
            {
                "name": "rbdpd",
                "rbd": {
                    "monitors": [
                           "10.16.154.78:6789",
                           "10.16.154.82:6789",
                           "10.16.154.83:6789"
                                 ],
                    "pool": "kube",
                    "image": "foo",
                    "user": "admin",
                    "secretRef": {
                           "name": "ceph-secret"
                                         },
                    "fsType": "ext4",
                    "readOnly": true
                }
            }
        ]
    }
}

volumes部分是和ceph rbd紧密相关的一些信息,各个字段的大致含义如下:

name:volume名字,这个没什么可说的,顾名思义即可。
rbd.monitors:前面提到过ceph集群的monitor组件,这里填写monitor组件的通信信息,集群里有几个monitor就填几个;
rbd.pool:Ceph中的pool记号,它用来给ceph中存储的对象进行逻辑分区用的。默认的pool是”rbd”;
rbd.image:Ceph磁盘块设备映像文件
rbd.user:ceph client访问ceph storage cluster所使用的用户名。ceph有自己的一套user管理系统,user的写法通常是TYPE.ID,比如client.admin(是不是想到对应的文件:ceph.client.admin.keyring)。client是一种type,而admin则是user。一般来说,Type基本都是client。
secret.Ref:引用的k8s secret对象名称。

上面的字段中,有两个字段值我们无法提供:rbd.image和secret.Ref,现在我们就来“填空”。我们在root用户下建立k8s-cephrbd工作目录,我们首先需要使用ceph提供的rbd工具创建Pod要用到image:

# rbd create foo -s 1024

# rbd list
foo

我们在rbd pool中(在上述命令中未指定pool name,默认image建立在rbd pool中)创建一个大小为1024Mi的ceph image foo,rbd list命令的输出告诉我们foo image创建成功。接下来,我们尝试将foo image映射到内核,并格式化该image:

root@node1:~# rbd map foo
rbd: sysfs write failed
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable".
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (6) No such device or address

map操作报错。不过从错误提示信息,我们能找到一些蛛丝马迹:“RBD image feature set mismatch”。ceph新版中在map image时,给image默认加上了许多feature,通过rbd info可以查看到:

# rbd info foo
rbd image 'foo':
    size 1024 MB in 256 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.10612ae8944a
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    flags:

可以看到foo image拥有: layering, exclusive-lock, object-map, fast-diff, deep-flatten。不过遗憾的是我的Ubuntu 14.04的3.19内核仅支持其中的layering feature,其他feature概不支持。我们需要手动disable这些features:

# rbd feature disable foo exclusive-lock, object-map, fast-diff, deep-flatten
root@node1:/var/log/ceph# rbd info foo
rbd image 'foo':
    size 1024 MB in 256 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.10612ae8944a
    format: 2
    features: layering
    flags:

不过每次这么来disable可是十分麻烦的,一劳永逸的方法是在各个cluster node的/etc/ceph/ceph.conf中加上这样一行配置:

rbd_default_features = 1 #仅是layering对应的bit码所对应的整数值

设置完后,通过下面命令查看配置变化:

# ceph --show-config|grep rbd|grep features
rbd_default_features = 1

关于image features的这个问题,zphj1987的这篇文章中有较为详细的讲解。

我们再来map一下foo这个image:

# rbd map foo
/dev/rbd0

# ls -l /dev/rbd0
brw-rw---- 1 root disk 251, 0 Nov  5 10:33 /dev/rbd0

map后,我们就可以像格式化一个空image那样对其进行格式化了,这里格成ext4文件系统(格式化这一步大可不必,在后续小节中你会看到):

# mkfs.ext4 /dev/rbd0
mke2fs 1.42.9 (4-Feb-2014)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1024 blocks, Stripe width=1024 blocks
65536 inodes, 262144 blocks
13107 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=268435456
8 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376

Allocating group tables: done
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

接下来我们来创建ceph-secret这个k8s secret对象,这个secret对象用于k8s volume插件访问ceph集群:

获取client.admin的keyring值,并用base64编码:

# ceph auth get-key client.admin
AQBiKBxYuPXiJRAAsupnTBsURoWzb0k00oM3iQ==

# echo "AQBiKBxYuPXiJRAAsupnTBsURoWzb0k00oM3iQ=="|base64
QVFCaUtCeFl1UFhpSlJBQXN1cG5UQnNVUm9XemIwazAwb00zaVE9PQo=

在k8s-cephrbd下建立ceph-secret.yaml文件,data下的key字段值即为上面得到的编码值:

//ceph-secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
data:
  key: QVFCaUtCeFl1UFhpSlJBQXN1cG5UQnNVUm9XemIwazAwb00zaVE9PQo=

创建ceph-secret:

# kubectl create -f ceph-secret.yaml
secret "ceph-secret" created

# kubectl get secret
NAME                  TYPE                                  DATA      AGE
ceph-secret           Opaque                                1         16s

至此,我们的rbd-with-secret.json全貌如下:

{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "name": "rbd2"
    },
    "spec": {
        "containers": [
            {
                "name": "rbd-rw",
                "image": "kubernetes/pause",
                "volumeMounts": [
                    {
                        "mountPath": "/mnt/rbd",
                        "name": "rbdpd"
                    }
                ]
            }
        ],
        "volumes": [
            {
                "name": "rbdpd",
                "rbd": {
                    "monitors": [
                        "10.47.136.60:6789"
                                 ],
                    "pool": "rbd",
                    "image": "foo",
                    "user": "admin",
                    "secretRef": {
                        "name": "ceph-secret"
                        },
                    "fsType": "ext4",
                    "readOnly": true
                }
            }
        ]
    }
}

基于该Pod描述文件,创建使用cephrbd作为后端存储的pod:

# kubectl create -f rbd-with-secret.json
pod "rbd2" created

# kubectl get pod
NAME                        READY     STATUS    RESTARTS   AGE
rbd2                        1/1       Running   0          16s

# rbd showmapped
id pool image snap device
0  rbd  foo   -    /dev/rbd0

# mount
... ...
/dev/rbd0 on /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-foo type ext4 (rw)

在我的环境中,pod实际被调度到了另外一个k8s node上运行了:

pod被调度到另外一个 node2 上:

# docker ps
CONTAINER ID        IMAGE                                          COMMAND                  CREATED             STATUS              PORTS                    NAMES
32f92243f911        kubernetes/pause                               "/pause"                 2 minutes ago       Up 2 minutes                                 k8s_rbd-rw.c1dc309e_rbd2_default_6b6541b9-a306-11e6-ba01-00163e1625a9_a6bb1b20

#docker inspect 32f92243f911
... ...
"Mounts": [
            {
                "Source": "/var/lib/kubelet/pods/6b6541b9-a306-11e6-ba01-00163e1625a9/volumes/kubernetes.io~secret/default-token-40z0x",
                "Destination": "/var/run/secrets/kubernetes.io/serviceaccount",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Source": "/var/lib/kubelet/pods/6b6541b9-a306-11e6-ba01-00163e1625a9/etc-hosts",
                "Destination": "/etc/hosts",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Source": "/var/lib/kubelet/pods/6b6541b9-a306-11e6-ba01-00163e1625a9/containers/rbd-rw/a6bb1b20",
                "Destination": "/dev/termination-log",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Source": "/var/lib/kubelet/pods/6b6541b9-a306-11e6-ba01-00163e1625a9/volumes/kubernetes.io~rbd/rbdpd",
                "Destination": "/mnt/rbd",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
... ...

五、Kubernetes Persistent Volume和Persistent Volume Claim

上面一小节讲解了Kubernetes volume与Ceph RBD的结合,但是k8s volume还不能完全满足实际生产过程对持久化存储的需求,因为k8s volume的lifetime和pod的生命周期相同,一旦pod被delete,那么volume中的数据就不复存在了。于是k8s又推出了Persistent Volume(PV)和Persistent Volume Claim(PVC)组合,故名思意:即便挂载其的pod被delete了,PV依旧存在,PV上的数据依旧存在。

由于有了之前的“铺垫”,这里仅仅给出使用PV和PVC的步骤:

1、创建disk image

$ rbd create ceph-image -s 128 #考虑后续format快捷,这里只用了128M,仅适用于Demo哦。

# rbd create ceph-image -s 128
# rbd info rbd/ceph-image
rbd image 'ceph-image':
    size 128 MB in 32 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.37202ae8944a
    format: 2
    features: layering
    flags:

如果这里不先创建一个ceph-image,后续Pod启动时,会出现如下的一些错误,比如pod始终处于ContainerCreating状态:

# kubectl get pod
NAME                        READY     STATUS              RESTARTS   AGE
ceph-pod1                   0/1       ContainerCreating   0          13s

如果出现这种错误情况,可以查看/var/log/upstart/kubelet.log,你也许能看到如下错误信息:

I1107 06:02:27.500247   22037 operation_executor.go:768] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/01d049c6-9430-11e6-ba01-00163e1625a9-default-token-40z0x" (spec.Name: "default-token-40z0x") pod "01d049c6-9430-11e6-ba01-00163e1625a9" (UID: "01d049c6-9430-11e6-ba01-00163e1625a9").
I1107 06:03:08.499628   22037 reconciler.go:294] MountVolume operation started for volume "kubernetes.io/rbd/ea848a49-a46b-11e6-ba01-00163e1625a9-ceph-pv" (spec.Name: "ceph-pv") to pod "ea848a49-a46b-11e6-ba01-00163e1625a9" (UID: "ea848a49-a46b-11e6-ba01-00163e1625a9").
E1107 06:03:09.532348   22037 disk_manager.go:56] failed to attach disk
E1107 06:03:09.532402   22037 rbd.go:228] rbd: failed to setup mount /var/lib/kubelet/pods/ea848a49-a46b-11e6-ba01-00163e1625a9/volumes/kubernetes.io~rbd/ceph-pv rbd: map failed exit status 2 rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (2) No such file or directory
2、创建PV

我们直接复用之前创建的ceph-secret对象,PV的描述文件ceph-pv.yaml如下:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ceph-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  rbd:
    monitors:
      - 10.47.136.60:6789
    pool: rbd
    image: ceph-image
    user: admin
    secretRef:
      name: ceph-secret
    fsType: ext4
    readOnly: false
  persistentVolumeReclaimPolicy: Recycle

执行创建操作:

# kubectl create -f ceph-pv.yaml
persistentvolume "ceph-pv" created

# kubectl get pv
NAME      CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS      CLAIM     REASON    AGE
ceph-pv   1Gi        RWO           Recycle         Available                       7s
3、创建PVC

pvc是Pod对Pv的请求,将请求做成一种资源,便于管理以及pod复用。我们用到的pvc描述文件ceph-pvc.yaml如下:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

执行创建操作:

# kubectl create -f ceph-pvc.yaml
persistentvolumeclaim "ceph-claim" created

# kubectl get pvc
NAME         STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
ceph-claim   Bound     ceph-pv   1Gi        RWO           12s

4、创建挂载ceph RBD的pod

pod描述文件ceph-pod1.yaml如下:

apiVersion: v1
kind: Pod
metadata:
  name: ceph-pod1
spec:
  containers:
  - name: ceph-busybox1
    image: busybox
    command: ["sleep", "600000"]
    volumeMounts:
    - name: ceph-vol1
      mountPath: /usr/share/busybox
      readOnly: false
  volumes:
  - name: ceph-vol1
    persistentVolumeClaim:
      claimName: ceph-claim

创建pod操作:

# kubectl create -f ceph-pod1.yaml
pod "ceph-pod1" created

# kubectl get pod
NAME                        READY     STATUS              RESTARTS   AGE
ceph-pod1                   0/1       ContainerCreating   0          13s

Pod还处于ContainerCreating状态。pod的创建,尤其是挂载pv的Pod的创建需要一小段时间,耐心等待一下,我们可以查看一下/var/log/upstart/kubelet.log:

I1107 11:44:38.768541   22037 mount_linux.go:272] `fsck` error fsck from util-linux 2.20.1

fsck.ext2: Bad magic number in super-block while trying to open /dev/rbd1
/dev/rbd1:
The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

E1107 11:44:38.774080   22037 mount_linux.go:110] Mount failed: exit status 32
Mounting arguments: /dev/rbd1 /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-ceph-image ext4 [defaults]
Output: mount: wrong fs type, bad option, bad superblock on /dev/rbd1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

I1107 11:44:38.839148   22037 mount_linux.go:292] Disk "/dev/rbd1" appears to be unformatted, attempting to format as type: "ext4" with options: [-E lazy_itable_init=0,lazy_journal_init=0 -F /dev/rbd1]
I1107 11:44:39.152689   22037 mount_linux.go:297] Disk successfully formatted (mkfs): ext4 - /dev/rbd1 /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-ceph-image
I1107 11:44:39.220223   22037 operation_executor.go:768] MountVolume.SetUp succeeded for volume "kubernetes.io/rbd/811a57ee-a49c-11e6-ba01-00163e1625a9-ceph-pv" (spec.Name: "ceph-pv") pod "811a57ee-a49c-11e6-ba01-00163e1625a9" (UID: "811a57ee-a49c-11e6-ba01-00163e1625a9").

可以看到,k8s通过fsck发现这个image是一个空image,没有fs在里面,于是默认采用ext4为其格式化,成功后,再行挂载。等待一会后,我们看到ceph-pod1成功run起来了:

# kubectl get pod
NAME                        READY     STATUS    RESTARTS   AGE
ceph-pod1                   1/1       Running   0          4m

# docker ps
CONTAINER ID        IMAGE                                                 COMMAND                  CREATED             STATUS              PORTS               NAMES
f50bb8c31b0f        busybox                                               "sleep 600000"           4 hours ago         Up 4 hours                              k8s_ceph-busybox1.c0c0379f_ceph-pod1_default_811a57ee-a49c-11e6-ba01-00163e1625a9_9d910a29

# docker exec 574b8069e548 df -h
Filesystem                Size      Used Available Use% Mounted on
none                     39.2G     20.9G     16.3G  56% /
tmpfs                     1.9G         0      1.9G   0% /dev
tmpfs                     1.9G         0      1.9G   0% /sys/fs/cgroup
/dev/vda1                39.2G     20.9G     16.3G  56% /dev/termination-log
/dev/vda1                39.2G     20.9G     16.3G  56% /etc/resolv.conf
/dev/vda1                39.2G     20.9G     16.3G  56% /etc/hostname
/dev/vda1                39.2G     20.9G     16.3G  56% /etc/hosts
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/rbd1               120.0M      1.5M    109.5M   1% /usr/share/busybox
tmpfs                     1.9G     12.0K      1.9G   0% /var/run/secrets/kubernetes.io/serviceaccount
tmpfs                     1.9G         0      1.9G   0% /proc/kcore
tmpfs                     1.9G         0      1.9G   0% /proc/timer_list
tmpfs                     1.9G         0      1.9G   0% /proc/timer_stats
tmpfs                     1.9G         0      1.9G   0% /proc/sched_debug

六、简单测试

这一节我们要对cephrbd作为k8s PV的效用做一个简单测试。测试步骤:

1) 在container中,向挂载的cephrbd写入数据;
2) 删除ceph-pod1
3) 重新创建ceph-pod1,查看数据是否还存在。

我们首先通过touch 、vi等命令向ceph-pod1挂载的cephrbd volume写入数据:我们通过容器f50bb8c31b0f 创建/usr/share/busybox/hello-ceph.txt,并向文件写入”hello ceph”一行字符串并保存。

# docker exec -it f50bb8c31b0f touch /usr/share/busybox/hello-ceph.txt
# docker exec -it f50bb8c31b0f vi /usr/share/busybox/hello-ceph.txt
# docker exec -it f50bb8c31b0f cat /usr/share/busybox/hello-ceph.txt
hello ceph

接下来删除ceph-pod1:

# kubectl get pod
NAME                        READY     STATUS    RESTARTS   AGE
ceph-pod1                   1/1       Running   0          4h

# kubectl delete pod/ceph-pod1
pod "ceph-pod1" deleted

# kubectl get pod
NAME                        READY     STATUS        RESTARTS   AGE
ceph-pod1                   1/1       Terminating   0          4h

# kubectl get pv,pvc
NAME         CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM                REASON    AGE
pv/ceph-pv   1Gi        RWO           Recycle         Bound     default/ceph-claim             4h
NAME             STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
pvc/ceph-claim   Bound     ceph-pv   1Gi        RWO           4h

可以看到ceph-pod1的删除需要一段时间,这段时间pod一直处于“ Terminating”状态。同时,我们看到pod的删除并没有影响到pv和pvc object,它们依旧存在。

最后,我们再次来创建一下一个使用同一个pvc的pod,为了避免“不必要”的麻烦,我们建立一个名为ceph-pod2.yaml的描述文件:

apiVersion: v1
kind: Pod
metadata:
  name: ceph-pod2
spec:
  containers:
  - name: ceph-busybox2
    image: busybox
    command: ["sleep", "600000"]
    volumeMounts:
    - name: ceph-vol2
      mountPath: /usr/share/busybox
      readOnly: false
  volumes:
  - name: ceph-vol2
    persistentVolumeClaim:
      claimName: ceph-claim

创建ceph-pod2:

# kubectl create -f ceph-pod2.yaml
pod "ceph-pod2" created

root@node1:~/k8stest/k8s-cephrbd# kubectl get pod
NAME                        READY     STATUS    RESTARTS   AGE
ceph-pod2                   1/1       Running   0          14s

root@node1:~/k8stest/k8s-cephrbd# docker ps
CONTAINER ID        IMAGE                                                 COMMAND                  CREATED             STATUS              PORTS               NAMES
574b8069e548        busybox                                               "sleep 600000"           11 seconds ago      Up 10 seconds                           k8s_ceph-busybox2.c5e637a1_ceph-pod2_default_f4aeebd6-a4c3-11e6-ba01-00163e1625a9_fc94c0fe

查看数据是否依旧存在:

# docker exec -it 574b8069e548 cat /usr/share/busybox/hello-ceph.txt
hello ceph

数据完好无损的被ceph-pod2读取到了!

七、小结

至此,对k8s与ceph的集成仅仅才是一个开端,更多的feature和坑等待挖掘。近期发现文章越写越长,原因么?自己赶脚是因为目标系统越来越大,越来越复杂。深入K8s的过程,就是继续给自己挖坑的过程^_^。

我,不是在填坑的路上,就是在坑里:)。

BTW,列一下参考资料:
1、Ceph官方文档
2、OpenShift中的K8s与Ceph RBD集成的文档;
3、Kubernetes官方文档Persistent volumes部分
4、zphj1987博主的这篇文章

一篇文章带你了解Kubernetes安装

由于之前在阿里云上部署的Docker 1.12.2的Swarm集群没能正常展示出其所宣称的Routing mesh和VIP等功能,为了满足项目需要,我们只能转向另外一种容器集群管理和服务编排工具Kubernetes

注:之前Docker1.12集群的Routing mesh和VIP功能失效的问题,经过在github上与Docker开发人员的沟通,目前已经将问题原因缩小在阿里云的网络上面,目前看是用于承载vxlan数据通信的节点4789 UDP端口不通的问题,针对这个问题,我正在通过阿里云售后工程师做进一步沟通,希望能找出真因。

Kubernetes(以下称k8s)是Google开源的一款容器集群管理工具,是Google内部工具Borg的“开源版”。背靠Google这个高大上的亲爹,k8s一出生就吸引了足够的眼球,并得到了诸多知名IT公司的支持。至于Google开源k8s的初衷,美好的说法是Google希望通过输出自己在容器领域长达10多年的丰富经验,帮助容器领域的开发人员和客户提升开发效率和容器管理的档次。但任何一种公司行为都会有其背后的短期或长期的商业目的,Google作为一个商业公司也不会例外。Google推出k8s到底为啥呢?众说纷纭。一种说法是Google通过k8s输出其容器工具的操作和使用方法、API标准等,为全世界的开发人员使用其公有容器预热并提供“零门槛”体验。

k8s目前是公认的最先进的容器集群管理工具,在1.0版本发布后,k8s的发展速度更加迅猛,并且得到了容器生态圈厂商的全力支持,这包括coreosrancher等,诸多提供公有云服务的厂商在提供容器服务时也都基于k8s做二次开发来提供基础设施层的支撑,比如华为。可以说k8s也是Docker进军容器集群管理和服务编排领域最为强劲的竞争对手。

不过和已经原生集成了集群管理工具swarmkit的Docker相比,k8s在文档、安装和集群管理方面的体验还有很大的提升空间。k8s最新发布的1.4版本就是一个着重在这些方面进行改善的版本。比如1.4版本对于Linux主要发行版本Ubuntu Xenial和Red Hat centos7的用户,可以使用熟悉的apt-get和yum来直接安装Kubernetes。再比如,1.4版本引入了kubeadm命令,将集群启动简化为两条命令,不需要再使用复杂的kube-up脚本。

但对于1.4版本以前的1.3.x版本来说,安装起来的赶脚用最近流行的网络词汇来形容就是“蓝瘦,香菇”,但有些时候我们还不得不去挑战这个过程,本文要带大家了解的就是利用阿里云国内区的ECS主机,在Ubuntu 14.04.4操作系统上安装k8s 1.3.7版本的方法和安装过程。

零、心理建设

由于k8s是Google出品,很多组件与google是“打断了骨头还连着筋”,因此在国内网络中安装k8s是需要先进行心理建设的^_^,因为和文档中宣称的k8s 1.4版的安装或docker 1.12.x的安装相比,k8s 1.3.7版本的安装简直就是“灾难级”的。

要想让这一过程适当顺利一些,我们必须准备一个“加速器(你懂的)”。利用加速器应对三件事:慢、断和无法连接。

  • 慢:国内从github或其他国外公有云上下东西简直太慢了,稍大一些的文件,通常都是几个小时或是10几个小时。
  • 断:你说慢就算了,还总断。断了之后,遇到不支持断点续传的,一切还得重来。动不动就上G的文件,重来的时间成本是我们无法承受的。
  • 无法连接:这个你知道的,很多托管在google名下的东西,你总是无法下载的。

总而言之,k8s的安装和容器集群的搭建过程是一个“漫长”且可能反复的过程,需要做好心理准备。

BTW,我在安装过程使用的 网友noah_昨夜星辰推荐的多态加速器,只需配置一个http_proxy即可,尤其适合服务器后台加速,非常方便,速度也很好。

一、安装模型

k8s的文档不可谓不丰富,尤其在k8s安装这个环节,k8s提供了针对各种云平台、裸机、各类OS甚至各类cluster network model实现的安装文档,你着实得费力挑选一个最适合自己情况的。

由于目前国内阿里云尚未提供Ubuntu 16.04LTS版本虚拟机镜像(通过apt-get install可直接安装最新1.4.x版本k8s),我们只能用ubuntu 14.04.x来安装k8s 1.3.x版本,k8s 1.4版本使用了systemd的相关组件,在ubuntu 14.04.x上手工安装k8s 1.4难度估计将是“地狱级”的。网络模型实现我选择coreos提供的flannel,因此我们需要参考的是由国内浙大团队维护的这份k8s安装文档。浙大的这份安装文档针对的是k8s 1.2+的,从文档评分来看,只是二星半,由此推断,完全按照文档中的步骤安装,成功与否要看运气^_^。注意该文档中提到:文档针对ubuntu 14.04是测试ok的,但由于ubuntu15.xx使用systemd替代upstart了,因此无法保证在ubuntu 15.xx上可以安装成功。

关于k8s的安装过程,网上也有很多资料,多数资料一上来就是下载xxx,配置yyy,install zzz,缺少一个k8s安装的总体视图。与内置编排引擎swarmkit的单一docker engine的安装不同,k8s是由一系列核心组件配合协作共同完成容器集群调度和服务编排功能的,安装k8s实际上就是将不同组件安装到承担不同角色的节点上去。

k8s的节点只有两种角色:master和minion,对比Docker swarm集群,master相当于docker swarm集群中的manager,而minion则相当于docker swarm集群中的worker。

在master节点上运行的k8s核心组件包括:

# ls /opt/bin|grep kube
kube-apiserver
kube-controller-manager
kubelet
kube-proxy
kube-scheduler

在minion节点上,k8s核心组件较少,包括:

# ls /opt/bin|grep kube
kubelet
kube-proxy

k8s的安装模型可以概述为:在安装机上将k8s的各个组件分别部署到不同角色的节点上去(通过ssh远程登录到各节点),并启动起来。用下面这个简易图表达起来可能更加形象:

安装机(放置k8s的安装程序和安装脚本) ----- install k8s core components to(via ssh) ---->  master and minion nodes

在安装之前,这里再明确一下我所用的环境信息:

阿里云ECS: Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-70-generic x86_64)

root@iZ25cn4xxnvZ:~# docker version
Client:
Version: 1.12.2
API version: 1.24
Go version: go1.6.3
Git commit: bb80604
Built: Tue Oct 11 17:00:50 2016
OS/Arch: linux/amd64

Server:
Version: 1.12.2
API version: 1.24
Go version: go1.6.3
Git commit: bb80604
Built: Tue Oct 11 17:00:50 2016
OS/Arch: linux/amd64

二、先决条件

根据浙大团队的那篇在Ubuntu上安装k8s的文章,在真正安装k8s组件之前,需要先满足一些先决条件:

1、安装Docker

关于Docker的文档,不得不说,写的还是不错的。Docker到目前为止已经发展了许多年了,其在Ubuntu上的安装已经逐渐成熟了。在其官方文档中有针对ubuntu 12.04、14.04和16.04的详细安装说明。如果你的Ubuntu服务器上docker版本较低,还可以用国内Daocloud提供的一键安装服务来安装最新版的Docker。

2、安装bridge-utils

安装网桥管理工具:

[sudo] apt-get install bridge-utils

安装后,可以测试一下安装是否ok:

root@iZ25cn4xxnvZ:~# brctl show
bridge name    bridge id        STP enabled    interfaces
docker0        8000.0242988b938c    no        veth901efcb
docker_gwbridge        8000.0242bffb02d5    no        veth21546ed
                            veth984b294
3、确保master node可以连接互联网并下载必要的文件

这里要提到的是为master node配置上”加速器”。同时如果master node还承担逻辑上的minion node角色,还需要为节点上Docker配置上加速器(如果加速器是通过代理配置的),minion node上亦是如此,比如:

/etc/default/docker

export http_proxy=http://duotai:xxxxx@sheraton.h.xduotai.com:24448
export https_proxy=$http_proxy

4、在安装机上配置自动免密ssh登录各个master node 和minion node

我在阿里云上开了两个ECS(暂成为node1 – 10.47.136.60和node2 – 10.46.181.146),我的k8s集群就由这两个物理node承载,但在逻辑上node1和node2承担着多种角色,逻辑上这是一个由一个master node和两个minion node组成的k8s集群:

安装机:node1
master node:node1
minion node: node1和node2

因此为了满足安装机到各个k8s node免密ssh登录的先决条件,我需要实现从安装机(node1)到master node(node1)和minion node(node1和node2)的免费ssh登录设置。

在安装机node上执行:

# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
... ...

安装机免密登录逻辑意义上的master node(实际上就是登录自己,即node1):

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

安装机免费登录minion node(node2):

将公钥复制到server:
#scp ~/.ssh/id_rsa.pub root@10.46.181.146:/root/id_rsa.pub
The authenticity of host '10.46.181.146 (10.46.181.146)' can't be established.
ECDSA key fingerprint is b7:31:8d:33:f5:6e:ef:a4:a1:cc:72:5f:cf:68:c6:3d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.46.181.146' (ECDSA) to the list of known hosts.
root@10.46.181.146's password:
id_rsa.pub

在minion node,即node2上,导入安装机的公钥并修改访问权限:

cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
root@iZ25mjza4msZ:~# chmod 700 ~/.ssh
root@iZ25mjza4msZ:~#     chmod 600 ~/.ssh/authorized_keys

配置完成,你可以在安装机上测试一下到自身(node1)和到node2的免密登录,以免密登录node2为例:

root@iZ25cn4xxnvZ:~/.ssh# ssh 10.46.181.146
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-70-generic x86_64)

 * Documentation:  https://help.ubuntu.com/
New release '16.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Welcome to aliyun Elastic Compute Service!

Last login: Thu Oct 13 12:55:21 2016 from 218.25.32.210
5、下载pause-amd64镜像

k8s集群启动后,启动容器时会去下载google的gcr.io/google_containers下的一个pause-amd64镜像,为了避免那时出错时不便于查找,这些先下手为强,先通过“加速器”将该镜像下载到各个k8s node上:

修改/etc/default/docker,添加带有加速器的http_proxy/https_proxy,并增加–insecure-registry gcr.io

# If you need Docker to use an HTTP proxy, it can also be specified here.
export http_proxy=http://duotai:xxxx@sheraton.h.xduotai.com:24448
export https_proxy=http://duotai:xxxx@sheraton.h.xduotai.com:24448

# This is also a handy place to tweak where Docker's temporary files go.
#export TMPDIR="/mnt/bigdrive/docker-tmp"
DOCKER_OPTS="$DOCKER_OPTS -H unix:///var/run/docker.sock -H tcp://0.0.0.0:2375 --insecure-registry gcr.io"

重启docker daemon服务。下载pause-amd64 image:

root@iZ25cn4xxnvZ:~# docker search gcr.io/google_containers/pause-amd64
NAME                            DESCRIPTION   STARS     OFFICIAL   AUTOMATED
google_containers/pause-amd64                 0
root@iZ25cn4xxnvZ:~# docker pull gcr.io/google_containers/pause-amd64
Using default tag: latest
Pulling repository gcr.io/google_containers/pause-amd64
Tag latest not found in repository gcr.io/google_containers/pause-amd64

latest标签居然都没有,尝试下载3.0标签的pause-amd64:

root@iZ25cn4xxnvZ:~# docker pull gcr.io/google_containers/pause-amd64:3.0
3.0: Pulling from google_containers/pause-amd64
a3ed95caeb02: Pull complete
f11233434377: Pull complete
Digest: sha256:163ac025575b775d1c0f9bf0bdd0f086883171eb475b5068e7defa4ca9e76516
Status: Downloaded newer image for gcr.io/google_containers/pause-amd64:3.0

三、设置工作目录,进行安装前的各种配置

到目前为止,所有node上,包括安装机node上还是“一无所有”的。接下来,我们开始在安装机node上做文章。

俗话说:“巧妇不为无米炊”。安装机想在各个node上安装k8s组件,安装机本身就要有”米”才行,这个米就是k8s源码包或release包中的安装脚本。

在官方文档中,这个获取“米”的步骤为clone k8s的源码库。由于之前就下载了k8s 1.3.7的release包,这里我就直接使用release包中的”米”。

解压kubernetes.tar.gz后,在当前目录下将看到kubernetes目录:

root@iZ25cn4xxnvZ:~/k8stest/1.3.7/kubernetes# ls -F
cluster/  docs/  examples/  federation/  LICENSES  platforms/  README.md  server/  third_party/  Vagrantfile  version

这个kubernetes目录就是我们安装k8s的工作目录。由于我们在ubuntu上安装k8s,因此我们实际上要使用的脚本都在工作目录下的cluster/ubuntu下面,后续有详细说明。

在安装机上,我们最终是要执行这样一句脚本的:

KUBERNETES_PROVIDER=ubuntu ./cluster/kube-up.sh

在provider=ubuntu的情况下,./cluster/kube-up.sh最终会调用到./cluster/ubuntu/util.sh中的kube-up shell函数,kube-up函数则会调用./cluster/ubuntu/download-release.sh下载k8s安装所使用到的所有包,包括k8s的安装包(kubernetes.tar.gz)、etcd和flannel等。由于之前我们已经下载完k8s的1.3.7版本release包了,这里我们就需要对down-release.sh做一些修改,防止重新下载,导致安装时间过长。

./cluster/ubuntu/download-release.sh

    # KUBE_VERSION=$(get_latest_version_number | sed 's/^v//')
    #curl -L https://github.com/kubernetes/kubernetes/releases/download/v${KUBE_VERSION}/kubernetes.tar.gz -o kubernetes.tar.gz

这种情况下,你还需要把已经下载的kubernetes.tar.gz文件copy一份,放到./cluster/ubuntu下面。

如果你的网络访问国外主机足够快,你还有足够耐心,那么你大可忽略上面脚本修改的步骤。

在真正执行./cluster/kube-up.sh之前,安装机还需要知道:

1、k8s物理集群都有哪些node组成,node的角色都是什么?
2、k8s的各个依赖程序,比如etcd的版本是什么?

我们需要通过配置./cluster/ubuntu/config-default.sh让./cluster/kube-up.sh获取这些信息。

./cluster/ubuntu/config-default.sh

# node信息,本集群由两个物理node组成,其中第一个node既是master,也是minion
export nodes=${nodes:-"root@10.47.136.60  root@10.46.181.146"}
roles=${roles:-"ai i"}

# minion node个数
export NUM_NODES=${NUM_NODES:-2}

# 为安装脚本配置网络代理,这里主要是为了使用加速器,方便或加速下载一些包
PROXY_SETTING=${PROXY_SETTING:-"http_proxy=http://duotai:xxxx@sheraton.h.xduotai.com:24448 https_proxy=http://duotai:xxxx@sheraton.h.xduotai.com:24448"}

通过环境变量设置k8s要下载的依赖程序的版本:

export KUBE_VERSION=1.3.7
export FLANNEL_VERSION=0.5.5
export ETCD_VERSION=3.0.12

如果不设置环境变量,./cluster/ubuntu/download-release.sh中默认的版本号将是:

k8s: 最新版本
etcd:2.3.1
flannel : 0.5.5

四、执行安装

在安装机上,进入./cluster目录,执行如下安装命令:

KUBERNETES_PROVIDER=ubuntu ./kube-up.sh

执行输出如下:

root@iZ25cn4xxnvZ:~/k8stest/1.3.7/kubernetes/cluster# KUBERNETES_PROVIDER=ubuntu ./kube-up.sh
... Starting cluster using provider: ubuntu
... calling verify-prereqs
Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
... calling kube-up
~/k8stest/1.3.7/kubernetes/cluster/ubuntu ~/k8stest/1.3.7/kubernetes/cluster

Prepare flannel 0.5.5 release ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   608    0   608    0     0    410      0 --:--:--  0:00:01 --:--:--   409
100 3408k  100 3408k    0     0   284k      0  0:00:11  0:00:11 --:--:--  389k

Prepare etcd 3.0.12 release ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   607    0   607    0     0    388      0 --:--:--  0:00:01 --:--:--   388
  3  9.8M    3  322k    0     0  84238      0  0:02:02  0:00:03  0:01:59  173k
100  9.8M  100  9.8M    0     0   327k      0  0:00:30  0:00:30 --:--:--  344k

Prepare kubernetes 1.3.7 release ...

~/k8stest/1.3.7/kubernetes/cluster/ubuntu/kubernetes/server ~/k8stest/1.3.7/kubernetes/cluster/ubuntu ~/k8stest/1.3.7/kubernetes/cluster
~/k8stest/1.3.7/kubernetes/cluster/ubuntu ~/k8stest/1.3.7/kubernetes/cluster

Done! All your binaries locate in kubernetes/cluster/ubuntu/binaries directory
~/k8stest/1.3.7/kubernetes/cluster

Deploying master and node on machine 10.47.136.60
saltbase/salt/generate-cert/make-ca-cert.sh: No such file or directory
easy-rsa.tar.gz                                                                                                                               100%   42KB  42.4KB/s   00:00
config-default.sh                                                                                                                             100% 5610     5.5KB/s   00:00
util.sh                                                                                                                                       100%   29KB  28.6KB/s   00:00
kubelet.conf                                                                                                                                  100%  644     0.6KB/s   00:00
kube-proxy.conf                                                                                                                               100%  684     0.7KB/s   00:00
kubelet                                                                                                                                       100% 2158     2.1KB/s   00:00
kube-proxy                                                                                                                                    100% 2233     2.2KB/s   00:00
etcd.conf                                                                                                                                     100%  709     0.7KB/s   00:00
kube-scheduler.conf                                                                                                                           100%  674     0.7KB/s   00:00
kube-apiserver.conf                                                                                                                           100%  674     0.7KB/s   00:00
kube-controller-manager.conf                                                                                                                  100%  744     0.7KB/s   00:00
kube-scheduler                                                                                                                                100% 2360     2.3KB/s   00:00
kube-controller-manager                                                                                                                       100% 2672     2.6KB/s   00:00
kube-apiserver                                                                                                                                100% 2358     2.3KB/s   00:00
etcd                                                                                                                                          100% 2073     2.0KB/s   00:00
reconfDocker.sh                                                                                                                               100% 2074     2.0KB/s   00:00
kube-scheduler                                                                                                                                100%   56MB  56.2MB/s   00:01
kube-controller-manager                                                                                                                       100%   95MB  95.4MB/s   00:01
kube-apiserver                                                                                                                                100%  105MB 104.9MB/s   00:00
etcdctl                                                                                                                                       100%   18MB  17.6MB/s   00:00
flanneld                                                                                                                                      100%   16MB  15.8MB/s   00:01
etcd                                                              100% 2074     2.0KB/s   00:00
kube-scheduler                                                                                              100%   56MB  56.2MB/s   0         100%   56MB  56.2MB/s   00:01
kube-controller-manager                                                                                     100%   95MB  95.4MB/s             100%   95MB  95.4MB/s   00:01
kube-apiserver                                                                                             100%  105MB 104.9MB/s              100%  105MB 104.9MB/s   00:00
etcdctl                                                                                                    100%   18MB  17.6MB/s us           100%   18MB  17.6MB/s   00:00
flanneld                 10                                                                                100%   16MB  15.8MB/sge            100%   16MB  15.8MB/s   00:01

... ...

结果中并没有出现代表着安装成功的如下log字样:

Cluster validation succeeded

查看上面安装日志输出,发现在向10.47.136.60 master节点部署组件时,出现如下错误日志:

saltbase/salt/generate-cert/make-ca-cert.sh: No such file or directory

查看一下./cluster下的确没有saltbase目录,这个问题在网上找到了答案,解决方法如下:

k8s安装包目录下,在./server/kubernetes下已经有salt包:kubernetes-salt.tar.gz,解压后,将saltbase整个目录cp到.cluster/下即可。

再次执行:KUBERNETES_PROVIDER=ubuntu ./kube-up.sh,可以看到如下执行输出:

... ...

Deploying master and node on machine 10.47.136.60
make-ca-cert.sh                                                                                                                               100% 4028     3.9KB/s   00:00
easy-rsa.tar.gz                                                                                                                               100%   42KB  42.4KB/s   00:00
config-default.sh                                                                                                                             100% 5632     5.5KB/s   00:00
util.sh                                                                                                                                       100%   29KB  28.6KB/s   00:00
kubelet.conf                                                                                                                                  100%  644     0.6KB/s   00:00
kube-proxy.conf                                                                                                                               100%  684     0.7KB/s   00:00
kubelet                                                                                                                                       100% 2158     2.1KB/s   00:00
kube-proxy                                                                                                                                    100% 2233     2.2KB/s   00:00
etcd.conf                                                                                                                                     100%  709     0.7KB/s   00:00
kube-scheduler.conf                                                                                                                           100%  674     0.7KB/s   00:00
kube-apiserver.conf                                                                                                                           100%  674     0.7KB/s   00:00
kube-controller-manager.conf                                                                                                                  100%  744     0.7KB/s   00:00
kube-scheduler                                                                                                                                100% 2360     2.3KB/s   00:00
kube-controller-manager                                                                                                                       100% 2672     2.6KB/s   00:00
kube-apiserver                                                                                                                                100% 2358     2.3KB/s   00:00
etcd                                                                                                                                          100% 2073     2.0KB/s   00:00
reconfDocker.sh                                                                                                                               100% 2074     2.0KB/s   00:00
kube-scheduler                                                                                                                                100%   56MB  56.2MB/s   00:01
kube-controller-manager                                                                                                                       100%   95MB  95.4MB/s   00:00
kube-apiserver                                                                                                                                100%  105MB 104.9MB/s   00:01
etcdctl                                                                                                                                       100%   18MB  17.6MB/s   00:00
flanneld                                                                                                                                      100%   16MB  15.8MB/s   00:00
etcd                                                                                                                                          100%   19MB  19.3MB/s   00:00
flanneld                                                                                                                                      100%   16MB  15.8MB/s   00:00
kubelet                                                                                                                                       100%  103MB 103.1MB/s   00:01
kube-proxy                                                                                                                                    100%   48MB  48.4MB/s   00:00
flanneld.conf                                                                                                                                 100%  577     0.6KB/s   00:00
flanneld                                                                                                                                      100% 2121     2.1KB/s   00:00
flanneld.conf                                                                                                                                 100%  568     0.6KB/s   00:00
flanneld                                                                                                                                      100% 2131     2.1KB/s   00:00
etcd start/running, process 7997
Error:  dial tcp 127.0.0.1:2379: getsockopt: connection refused
{"Network":"172.16.0.0/16", "Backend": {"Type": "vxlan"}}
{"Network":"172.16.0.0/16", "Backend": {"Type": "vxlan"}}
docker stop/waiting
docker start/running, process 8220
Connection to 10.47.136.60 closed.

Deploying node on machine 10.46.181.146
config-default.sh                                                                                                                             100% 5632     5.5KB/s   00:00
util.sh                                                                                                                                       100%   29KB  28.6KB/s   00:00
reconfDocker.sh                                                                                                                               100% 2074     2.0KB/s   00:00
kubelet.conf                                                                                                                                  100%  644     0.6KB/s   00:00
kube-proxy.conf                                                                                                                               100%  684     0.7KB/s   00:00
kubelet                                                                                                                                       100% 2158     2.1KB/s   00:00
kube-proxy                                                                                                                                    100% 2233     2.2KB/s   00:00
flanneld                                                                                                                                      100%   16MB  15.8MB/s   00:00
kubelet                                                                                                                                       100%  103MB 103.1MB/s   00:01
kube-proxy                                                                                                                                    100%   48MB  48.4MB/s   00:00
flanneld.conf                                                                                                                                 100%  577     0.6KB/s   00:00
flanneld                                                                                                                                      100% 2121     2.1KB/s   00:00
flanneld start/running, process 2365
docker stop/waiting
docker start/running, process 2574
Connection to 10.46.181.146 closed.
Validating master
Validating root@10.47.136.60
Validating root@10.46.181.146
Using master 10.47.136.60
cluster "ubuntu" set.
user "ubuntu" set.
context "ubuntu" set.
switched to context "ubuntu".
Wrote config for ubuntu to /root/.kube/config
... calling validate-cluster

Error from server: an error on the server has prevented the request from succeeding
(kubectl failed, will retry 2 times)

Error from server: an error on the server has prevented the request from succeeding
(kubectl failed, will retry 1 times)

Error from server: an error on the server has prevented the request from succeeding
('kubectl get nodes' failed, giving up)

安装并未成功,至少calling validate-cluster后的validation过程并未成功。

但是和第一次的失败有所不同的是,在master node和minion node上,我们都可以看到已经安装并启动了的k8s核心组件:

master node:

root@iZ25cn4xxnvZ:~/k8stest/1.3.7/kubernetes/cluster# ps -ef|grep kube
root      8006     1  0 16:39 ?        00:00:00 /opt/bin/kube-scheduler --logtostderr=true --master=127.0.0.1:8080
root      8008     1  0 16:39 ?        00:00:01 /opt/bin/kube-apiserver --insecure-bind-address=0.0.0.0 --insecure-port=8080 --etcd-servers=http://127.0.0.1:4001 --logtostderr=true --service-cluster-ip-range=192.168.3.0/24 --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,SecurityContextDeny,ResourceQuota --service-node-port-range=30000-32767 --advertise-address=10.47.136.60 --client-ca-file=/srv/kubernetes/ca.crt --tls-cert-file=/srv/kubernetes/server.cert --tls-private-key-file=/srv/kubernetes/server.key
root      8009     1  0 16:39 ?        00:00:02 /opt/bin/kube-controller-manager --master=127.0.0.1:8080 --root-ca-file=/srv/kubernetes/ca.crt --service-account-private-key-file=/srv/kubernetes/server.key --logtostderr=true
root      8021     1  0 16:39 ?        00:00:04 /opt/bin/kubelet --hostname-override=10.47.136.60 --api-servers=http://10.47.136.60:8080 --logtostderr=true --cluster-dns=192.168.3.10 --cluster-domain=cluster.local --config=
root      8023     1  0 16:39 ?        00:00:00 /opt/bin/kube-proxy --hostname-override=10.47.136.60 --master=http://10.47.136.60:8080 --logtostderr=true

minion node:

root@iZ25mjza4msZ:~# ps -ef|grep kube
root      2370     1  0 16:39 ?        00:00:04 /opt/bin/kubelet --hostname-override=10.46.181.146 --api-servers=http://10.47.136.60:8080 --logtostderr=true --cluster-dns=192.168.3.10 --cluster-domain=cluster.local --config=
root      2371     1  0 16:39 ?        00:00:00 /opt/bin/kube-proxy --hostname-override=10.46.181.146 --master=http://10.47.136.60:8080 --logtostderr=true

那为什么安装节点上的安装脚本在验证安装是否成功时一直阻塞、最终超时失败呢?我在安装节点,同时也是master node上执行了一下kubectl get node命令:

root@iZ25cn4xxnvZ:~/k8stest/1.3.7/kubernetes/cluster# kubectl get nodes

Error from server: an error on the server ("<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">\n<html><head>\n<meta type=\"copyright\" content=\"Copyright (C) 1996-2015 The Squid Software Foundation and contributors\">\n<meta http-equiv=\"Content-Type\" CONTENT=\"text/html; charset=utf-8\">\n<title>ERROR: The requested URL could not be retrieved</title>\n<style type=\"text/css\"><!-- \n /*\n * Copyright (C) 1996-2015 The Squid Software Foundation and contributors\n *\n * Squid software is distributed under GPLv2+ license and includes\n * contributions from numerous individuals and organizations.\n * Please see the COPYING and CONTRIBUTORS files for details.\n */\n\n/*\n Stylesheet for Squid Error pages\n Adapted from design by Free CSS Templates\n http://www.freecsstemplates.org\n Released for free under a Creative Commons Attribution 2.5 License\n*/\n\n/* Page basics */\n* {\n\tfont-family: verdana, sans-serif;\n}\n\nhtml body {\n\tmargin: 0;\n\tpadding: 0;\n\tbackground: #efefef;\n\tfont-size: 12px;\n\tcolor: #1e1e1e;\n}\n\n/* Page displayed title area */\n#titles {\n\tmargin-left: 15px;\n\tpadding: 10px;\n\tpadding-left: 100px;\n\tbackground: url('/squid-internal-static/icons/SN.png') no-repeat left;\n}\n\n/* initial title */\n#titles h1 {\n\tcolor: #000000;\n}\n#titles h2 {\n\tcolor: #000000;\n}\n\n/* special event: FTP success page titles */\n#titles ftpsuccess {\n\tbackground-color:#00ff00;\n\twidth:100%;\n}\n\n/* Page displayed body content area */\n#content {\n\tpadding: 10px;\n\tbackground: #ffffff;\n}\n\n/* General text */\np {\n}\n\n/* error brief description */\n#error p {\n}\n\n/* some data which may have caused the problem */\n#data {\n}\n\n/* the error message received from the system or other software */\n#sysmsg {\n}\n\npre {\n    font-family:sans-serif;\n}\n\n/* special event: FTP / Gopher directory listing */\n#dirmsg {\n    font-family: courier;\n    color: black;\n    font-size: 10pt;\n}\n#dirlisting {\n    margin-left: 2%;\n    margin-right: 2%;\n}\n#dirlisting tr.entry td.icon,td.filename,td.size,td.date {\n    border-bottom: groove;\n}\n#dirlisting td.size {\n    width: 50px;\n    text-align: right;\n    padding-right: 5px;\n}\n\n/* horizontal lines */\nhr {\n\tmargin: 0;\n}\n\n/* page displayed footer area */\n#footer {\n\tfont-size: 9px;\n\tpadding-left: 10px;\n}\n\n\nbody\n:lang(fa) { direction: rtl; font-size: 100%; font-family: Tahoma, Roya, sans-serif; float: right; }\n:lang(he) { direction: rtl; }\n --></style>\n</head><body id=ERR_CONNECT_FAIL>\n<div id=\"titles\">\n<h1>ERROR</h1>\n<h2>The requested URL could not be retrieved</h2>\n</div>\n<hr>\n\n<div id=\"content\">\n<p>The following error was encountered while trying to retrieve the URL: <a href=\"http://10.47.136.60:8080/api\">http://10.47.136.60:8080/api</a></p>\n\n<blockquote id=\"error\">\n<p><b>Connection to 10.47.136.60 failed.</b></p>\n</blockquote>\n\n<p id=\"sysmsg\">The system returned: <i>(110) Connection timed out</i></p>\n\n<p>The remote host or network may be down. Please try the request again.</p>\n\n<p>Your cache administrator is <a href=\"mailto:webmaster?subject=CacheErrorInfo%20-%20ERR_CONNECT_FAIL&amp;body=CacheHost%3A%20192-241-236-182%0D%0AErrPage%3A%20ERR_CONNECT_FAIL%0D%0AErr%3A%20(110)%20Connection%20timed%20out%0D%0ATimeStamp%3A%20Thu,%2013%20Oct%202016%2008%3A49%3A35%20GMT%0D%0A%0D%0AClientIP%3A%20127.0.0.1%0D%0AServerIP%3A%2010.47.136.60%0D%0A%0D%0AHTTP%20Request%3A%0D%0AGET%20%2Fapi%20HTTP%2F1.1%0AUser-Agent%3A%20kubectl%2Fv1.4.0%20(linux%2Famd64)%20kubernetes%2F4b28af1%0D%0AAccept%3A%20application%2Fjson,%20*%2F*%0D%0AAccept-Encoding%3A%20gzip%0D%0AHost%3A%2010.47.136.60%3A8080%0D%0A%0D%0A%0D%0A\">webmaster</a>.</p>\n\n<br>\n</div>\n\n<hr>\n<div id=\"footer\">\n<p>Generated Thu, 13 Oct 2016 08:49:35 GMT by 192-241-236-182 (squid/3.5.12)</p>\n<!-- ERR_CONNECT_FAIL -->\n</div>\n</body></html>") has prevented the request from succeeding

可以看到kubectl得到一坨信息,这是一个html页面内容的数据,仔细分析body内容,我们可以看到:

<body id=ERR_CONNECT_FAIL>\n<div id=\"titles\">\n<h1>ERROR</h1>\n<h2>The requested URL could not be retrieved</h2>\n</div>\n<hr>\n\n<div id=\"content\">\n<p>The following error was encountered while trying to retrieve the URL: <a href=\"http://10.47.136.60:8080/api\">http://10.47.136.60:8080/api</a></p>\n\n<blockquote id=\"error\">\n<p><b>Connection to 10.47.136.60 failed.</b></p>\n</blockquote>\n\n<p id=\"sysmsg\">The system returned: <i>(110) Connection timed out</i></p>\n\n<p>The remote host or network may be down. Please try the request again.</p>

kubectl在访问http://10.47.136.60:8080/api这个url时出现了timed out错误。在master node上直接执行curl http://10.47.136.60:8080/api也是这个错误。猜想是否是我.bashrc中的http_proxy在作祟。于是在.bashrc中增加no_proxy:

export no_proxy='10.47.136.60,10.46.181.146,localhost,127.0.0.1'

生效后,再在master node上执行curl:

# curl http://10.47.136.60:8080/api
{
  "kind": "APIVersions",
  "versions": [
    "v1"
  ],
  "serverAddressByClientCIDRs": [
    {
      "clientCIDR": "0.0.0.0/0",
      "serverAddress": "10.47.136.60:6443"
    }
  ]
}

看来问题原因就是安装程序的PROXY_SETTING中没有加入no_proxy的设置的缘故,于是修改config-default.sh中的代理设置:

PROXY_SETTING=${PROXY_SETTING:-"http_proxy=http://duotai:xxxx@sheraton.h.xduotai.com:24448 https_proxy=http://duotai:xxxx@sheraton.h.xduotai.com:24448 no_proxy=10.47.136.60,10.46.181.146,localhost,127.0.0.1"}

然后重新deploy:

root@iZ25cn4xxnvZ:~/k8stest/1.3.7/kubernetes/cluster# KUBERNETES_PROVIDER=ubuntu ./kube-up.sh
... Starting cluster using provider: ubuntu
... calling verify-prereqs
Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
... calling kube-up
~/k8stest/1.3.7/kubernetes/cluster/ubuntu ~/k8stest/1.3.7/kubernetes/cluster
Prepare flannel 0.5.5 release ...
Prepare etcd 3.0.12 release ...
Prepare kubernetes 1.3.7 release ...
Done! All your binaries locate in kubernetes/cluster/ubuntu/binaries directory
~/k8stest/1.3.7/kubernetes/cluster

Deploying master and node on machine 10.47.136.60
make-ca-cert.sh                                                                                                                               100% 4028     3.9KB/s   00:00
easy-rsa.tar.gz                                                                                                                               100%   42KB  42.4KB/s   00:00
config-default.sh                                                                                                                             100% 5678     5.5KB/s   00:00
... ...
cp: cannot create regular file ‘/opt/bin/etcd’: Text file busy
cp: cannot create regular file ‘/opt/bin/flanneld’: Text file busy
cp: cannot create regular file ‘/opt/bin/kube-apiserver’: Text file busy
cp: cannot create regular file ‘/opt/bin/kube-controller-manager’: Text file busy
cp: cannot create regular file ‘/opt/bin/kube-scheduler’: Text file busy
Connection to 10.47.136.60 closed.
Deploying master and node on machine 10.47.136.60 failed

重新部署时,由于之前k8s cluster在各个node的组件已经启动,因此failed。我们需要通过

KUBERNETES_PROVIDER=ubuntu kube-down.sh

将k8s集群停止后再尝试up,或者如果不用这个kube-down.sh脚本,也可以在各个节点上手动shutdown各个k8s组件(master上有五个核心组件,minion node上有两个核心组件,另外别忘了停止etcd和flanneld服务),以kube-controller-manager为例:

service kube-controller-manager stop

即可。

再次执行kube-up.sh:

... ...
.. calling validate-cluster
Waiting for 2 ready nodes. 1 ready nodes, 2 registered. Retrying.
Found 2 node(s).
NAME            STATUS    AGE
10.46.181.146   Ready     4h
10.47.136.60    Ready     4h
Validate output:
NAME                 STATUS    MESSAGE              ERROR
scheduler            Healthy   ok
controller-manager   Healthy   ok
etcd-0               Healthy   {"health": "true"}
Cluster validation succeeded
Done, listing cluster services:

Kubernetes master is running at http://10.47.136.60:8080

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

通过字样:”Cluster validation succeeded”可以证明我们成功安装了k8s集群。

执行kubectl get node可以看到当前集群的节点组成情况:

# kubectl get node
NAME            STATUS    AGE
10.46.181.146   Ready     4h
10.47.136.60    Ready     4h

通过执行kubectl cluster-info dump 可以看到k8s集群更为详尽的信息。

五、测试k8s的service特性

之所以采用k8s,初衷就是因为Docker 1.12在阿里云搭建的swarm集群的VIP和Routing mesh机制不好用。因此,在k8s集群部署成功后,我们需要测试一下这两种机制在k8s上是否能够获得支持。

k8s中一些关于集群的抽象概念,比如node、deployment、pod、service等,这里就不赘述了,需要的话可以参考这里的Concept guide。

1、集群内负载均衡

在k8s集群中,有一个等同于docker swarm vip的概念,成为cluster ip,k8s回为每个service分配一个cluster ip,这个cluster ip在service生命周期中不会改变,并且访问cluster ip的请求会被自动负载均衡到service里的后端container中。

我们来启动一个replicas= 2的nginx service,我们需要先从一个描述文件来部署一个deployment:

//run-my-nginx.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-nginx
spec:
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx:1.10.1
        ports:
        - containerPort: 80

启动deployment:

root@iZ25cn4xxnvZ:~/k8stest/demo# kubectl create -f ./run-my-nginx.yaml
deployment "my-nginx" created

root@iZ25cn4xxnvZ:~/k8stest/demo# kubectl get deployment
NAME       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
my-nginx   2         2         2            2           9s

root@iZ25cn4xxnvZ:~/k8stest/demo# kubectl get pods -l run=my-nginx -o wide
NAME                        READY     STATUS    RESTARTS   AGE       IP            NODE
my-nginx-2395715568-2t6xe   1/1       Running   0          50s       172.16.57.3   10.46.181.146
my-nginx-2395715568-gpljv   1/1       Running   0          50s       172.16.99.2   10.47.136.60

可以看到my-nginx deployment已经成功启动,并且被调度在两个minion node上。

接下来,我们将deployment转化为service:

# kubectl expose deployment/my-nginx
service "my-nginx" exposed

root@iZ25cn4xxnvZ:~/k8stest/demo# kubectl get svc my-nginx
NAME       CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
my-nginx   192.168.3.239   <none>        80/TCP    15s

# kubectl describe svc my-nginx
Name:            my-nginx
Namespace:        default
Labels:            run=my-nginx
Selector:        run=my-nginx
Type:            ClusterIP
IP:            192.168.3.239
Port:            <unset>    80/TCP
Endpoints:        172.16.57.3:80,172.16.99.2:80
Session Affinity:    None

我们看到通过expose命令,可以将deployment转化为service,转化后,my-nginx service被分配了一个cluster-ip:192.168.3.239。

我们启动一个client container用于测试内部负载均衡:

root@iZ25cn4xxnvZ:~/k8stest/demo# kubectl run myclient --image=registry.cn-hangzhou.aliyuncs.com/mioss/test --replicas=1 --command -- tail -f /var/log/bootstrap.log
deployment "myclient" created

root@iZ25cn4xxnvZ:~/k8stest/demo# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
my-nginx-2395715568-2t6xe   1/1       Running   0          24m
my-nginx-2395715568-gpljv   1/1       Running   0          24m
myclient-1460251692-g7rnl   1/1       Running   0          21s

通过docker exec -it containerid /bin/bash进入myclient容器内,通过curl向上面的cluster-ip发起http请求:

root@myclient-1460251692-g7rnl:/# curl -v 192.168.3.239:80

同时在两个minion节点上,通过docker logs -f查看my-nginx service下面的两个nginx container实例日志,可以看到两个container轮询收到http request:

root@iZ25cn4xxnvZ:~/k8stest/demo# docker logs -f  ccc2f9bb814a
172.16.57.0 - - [17/Oct/2016:06:35:57 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.0 - - [17/Oct/2016:06:36:13 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.0 - - [17/Oct/2016:06:37:06 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.0 - - [17/Oct/2016:06:37:45 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.0 - - [17/Oct/2016:06:37:46 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.0 - - [17/Oct/2016:06:37:50 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"

root@iZ25mjza4msZ:~# docker logs -f 0e533ec2dc71
172.16.57.4 - - [17/Oct/2016:06:33:14 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.4 - - [17/Oct/2016:06:33:18 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.4 - - [17/Oct/2016:06:34:06 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.4 - - [17/Oct/2016:06:34:09 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.4 - - [17/Oct/2016:06:35:45 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.4 - - [17/Oct/2016:06:36:59 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"

cluster-ip机制有效。

2、nodeport机制

k8s通过nodeport机制实现类似docker的routing mesh,但底层机制和原理是不同的。

k8s的nodePort的原理是在集群中的每个node上开了一个端口,将访问该端口的流量导入到该node上的kube-proxy,然后再由kube-proxy进一步讲流量转发给该对应该nodeport的service的alive的pod上。

我们先来删除掉前面启动的my-nginx service,再重新创建支持nodeport的新my-nginx service。在k8s delete service有点讲究,我们删除service的目的不仅要删除service“索引”,还要stop并删除该service对应的Pod中的所有docker container。但在k8s中,直接删除service或delete pods都无法让对应的container stop并deleted,而是要通过delete service and delete deployment两步才能彻底删除service。

root@iZ25cn4xxnvZ:~# kubectl delete svc my-nginx
service "my-nginx" deleted

root@iZ25cn4xxnvZ:~# kubectl get service my-nginx
Error from server: services "my-nginx" not found

//容器依然在运行
root@iZ25cn4xxnvZ:~# kubectl get deployment my-nginx
NAME       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
my-nginx   2         2         2            2           20h

root@iZ25cn4xxnvZ:~# kubectl delete deployment my-nginx
deployment "my-nginx" deleted

再执行docker ps,看看对应docker container应该已经被删除。

重新创建暴露nodeport的my-nginx服务,我们先来创建一个新的service文件:

//my-nginx-svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: my-nginx
  labels:
    run: my-nginx
spec:
  type: NodePort
  ports:
  - port: 80
    nodePort: 30062
    protocol: TCP
  selector:
    run: my-nginx

创建服务:

root@iZ25cn4xxnvZ:~/k8stest/demo# kubectl create -f ./my-nginx-svc.yaml
deployment "my-nginx" created

查看服务信息:

root@iZ25cn4xxnvZ:~/k8stest/demo# kubectl describe service my-nginx
Name:            my-nginx
Namespace:        default
Labels:            run=my-nginx
Selector:        run=my-nginx
Type:            NodePort
IP:            192.168.3.179
Port:            <unset>    80/TCP
NodePort:        <unset>    30062/TCP
Endpoints:        172.16.57.3:80,172.16.99.2:80
Session Affinity:    None

可以看到与上一次的service信息相比,这里多出一个属性:NodePort 30062/TCP,这个就是整个服务暴露到集群外面的端口。

接下来我们通过这两个node的公网地址访问一下这个暴露的nodeport,看看service中的两个ngnix container是否能收到request:

通过公网ip curl 30062端口:

curl -v x.x.x.x:30062
curl -v  y.y.y.y:30062

同样,我们用docker logs -f来监控两个nginx container的日志输出,可以看到:

nginx1:

172.16.57.4 - - [17/Oct/2016:08:19:56 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
172.16.57.1 - - [17/Oct/2016:08:21:55 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"
172.16.57.1 - - [17/Oct/2016:08:21:56 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"
172.16.57.1 - - [17/Oct/2016:08:21:59 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"
172.16.57.1 - - [17/Oct/2016:08:22:07 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"
172.16.57.1 - - [17/Oct/2016:08:22:09 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"

nginx2:

172.16.57.0 - - [17/Oct/2016:08:22:05 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"
172.16.57.0 - - [17/Oct/2016:08:22:06 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"
172.16.57.0 - - [17/Oct/2016:08:22:08 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"
172.16.57.0 - - [17/Oct/2016:08:22:09 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"

两个container轮询地收到外部转来的http request。

现在我们将my-nginx服务的scale由2缩减为1:

root@iZ25cn4xxnvZ:~# kubectl scale --replicas=1 deployment/my-nginx
deployment "my-nginx" scaled

再次测试nodeport机制:

curl -v x.x.x.x:30062
curl -v  y.y.y.y:30062

scale后,只有master上的my-nginx存活。由于nodeport机制,没有my-nginx上的node收到请求后,将请求转给kube-proxy,通过内部clusterip机制,发给有my-nginx的container。

master上的nginx container:

172.16.99.1 - - [18/Oct/2016:00:55:04 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"
172.16.57.0 - - [18/Oct/2016:00:55:10 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.30.0" "-"

nodeport机制测试ok。通过netstat我们可以看到30062端口是node上的kube-proxy监听的端口,因此即便该node上没有nginx服务container运行,kube-proxy也会转发request。

root@iZ25cn4xxnvZ:~# netstat -tnlp|grep 30062
tcp6       0      0 :::30062                :::*                    LISTEN      22076/kube-proxy

六、尾声

到这里,k8s集群已经是可用的了。但要用好背后拥有15年容器经验沉淀的k8s,还有很长的路要走,比如安装Addon(DNS plugin等)、比如安装Dashboard等。这些在这里暂不提了,文章已经很长了。后续可能会有单独文章说明。

Docker 1.12 swarm模式下遇到的各种问题

前段时间,由于工作上的原因,与Docker的联系发生了几个月的中断^_^,从10月份开始,工作中又与Docker建立了广泛密切的联系。不过这次,Docker却给我泼了一盆冷水:(。事情的经过请允许多慢慢道来。

经过几年的开发,Docker已经成为轻量级容器领域不二的事实标准,应用范围以及社区都在快速发展和壮大。今年的年中,Docker发布了其里程碑的版本Docker 1.12,该版本最大的变动就在于其引擎自带了swarmkit ,一款Docker开发的容器集群管理工具,可以让用户无需安装第三方公司提供的工具或Docker公司提供的引擎之外的工具,就能搭建并管理好一个容器集群,并兼有负载均衡、服务发现和服务编排管理等功能。这对于容器生态圈内的企业,尤其是那些做容器集群管理和服务编排平台的公司来说,不亚于当年微软在Windows操作系统中集成Internet Explorer。对此,网上和社区对Docker口诛笔伐之声不绝于耳,认为Docker在亲手打击社区,葬送大好前程。关于商业上的是是非非,我们这里暂且不提。不可否认的是,对于容器的普通用户而言,Docker引擎内置集群管理功能带来的更多是便利。

9月末启动的一款新产品的开发中,决定使用容器技术,需要用到容器的集群管理以及服务伸缩、服务发现、负载均衡等特性。鉴于团队的能力和开发时间约束,初期我们确定直接利用Docker 1.12版本提供的这些内置特性,而不是利用第三方,诸如k8sRancher这样的第三方容器集群管理工具或是手工利用各种开源组件“拼凑”出一套满足需求的集群管理系统,如利用consul做服务注册和发现等。于是Docker 1.12的集群模式之旅就开始了。

一、环境准备

这次我们直接使用的是阿里的公有云虚拟主机服务,这里使用两台aliyun ECS:

manager: 10.46.181.146/21(内网)
worker: 10.47.136.60/22 (内网)

系统版本为:

Ubuntu 14.04.4:
Linux iZ25cn4xxnvZ 3.13.0-86-generic #130-Ubuntu SMP Mon Apr 18 18:27:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Docker版本:

# docker version
Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:22:43 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:22:43 2016
 OS/Arch:      linux/amd64

Ubuntu上Docker的安装日益方便了,我个人习惯于采用daocloud推荐的方式,在这里可以看到。当然你也可以参考Docker官方的doc

如果你的Ubuntu上已经安装了old版的Docker,也可以在docker的github上下载相应平台的二进制包,覆盖本地版本即可(注意1.10.0版本前后的Docker组件有所不同)。

二、Swarm集群搭建

Docker 1.12内置swarm mode,即docker原生支持的docker容器集群管理模式,只要是执行了docker swarm init或docker swarm join到一个swarm cluster中,执行了这些命令的host上的docker engine daemon就进入了swarm mode。

swarm mode中,Docker进行了诸多抽象概念(这些概念与k8s、rancher中的概念大同小异,也不知是谁参考了谁^_^):

- node: 部署了docker engine的host实例,既可以是物理主机,也可以是虚拟主机。
- service: 由一系列运行于集群容器上的tasks组成的。
- task: 在具体某个docker container中执行的具体命令。
- manager: 负责维护docker cluster的docker engine,通常有多个manager在集群中,manager之间通过raft协议进行状态同步,当然manager角色engine所在host也参与负载调度。
- worker: 参与容器集群负载调度,仅用于承载tasks。

swarm mode下,一个Docker原生集群至少要有一个manager,因此第一步我们就要初始化一个swarm cluster:

# docker swarm init --advertise-addr 10.46.181.146
Swarm initialized: current node (c7vo4qtb2m41796b4ji46n9uw) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-1iwaui223jy6ggcsulpfh1bufn0l4oq97zifbg8l5na914vyz5-2mg011xh7vso9hu7x542uizpt \
    10.46.181.146:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

通过一行swarm init命令,我们就创建了一个swarm集群。同时,Docker daemon给出了清晰提示,如果要向swarm集群添加worker node,执行上述提示中的语句。如果其他node要以manager身份加入集群,则需要执行:docker swarm join-token manager以获得下一个“通关密语”^_^。

# docker swarm join-token manager
To add a manager to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-1iwaui223jy6ggcsulpfh1bufn0l4oq97zifbg8l5na914vyz5-8wh5gp043i1cqz4at76wvx29m \
    10.46.181.146:2377

对比两个“通关密语”,我们发现仅是token串的后半部分有所不同(2mg011xh7vso9hu7x542uizpt vs. 8wh5gp043i1cqz4at76wvx29m)。

在未添加新node之前,我们可以通过docker node ls查看当前集群内的node状态:

# docker node ls
ID                           HOSTNAME      STATUS  AVAILABILITY  MANAGER STATUS
c7vo4qtb2m41796b4ji46n9uw *  iZ25mjza4msZ  Ready   Active        Leader

可以看出当前swarm仅有一个node,且该node是manager,状态是manager中的leader。

我们现在将另外一个node以worker身份加入到该swarm:

# docker swarm join \
     --token SWMTKN-1-1iwaui223jy6ggcsulpfh1bufn0l4oq97zifbg8l5na914vyz5-2mg011xh7vso9hu7x542uizpt \
     10.46.181.146:2377
This node joined a swarm as a worker.

在manager上查看node情况:

# docker node ls
ID                           HOSTNAME      STATUS  AVAILABILITY  MANAGER STATUS
8asff8ta70j91myh734os6ihg    iZ25cn4xxnvZ  Ready   Active
c7vo4qtb2m41796b4ji46n9uw *  iZ25mjza4msZ  Ready   Active        Leader

Swarm集群中已经有了两个active node:一个manager和一个worker。这样我们的集群环境初建ok。

三、Service启动

Docker 1.12版本宣称提供服务的Scaling、health check、滚动升级等功能,并提供了内置的dns、vip机制,实现service的服务发现和负载均衡能力。接下来,我们来测试一下docker的“服务能力”:

我们先来创建一个用户承载服务的自定义内部overlay网络:

root@iZ25mjza4msZ:~# docker network create -d overlay mynet1
avjvpxkfg6u8xt0qd5xynoc28
root@iZ25mjza4msZ:~# docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
dba1faa24c0d        bridge              bridge              local
a2807d0ec7ed        docker_gwbridge     bridge              local
2b6eb8b95c00        host                host                local
55v43pasf7p9        ingress             overlay             swarm
avjvpxkfg6u8        mynet1              overlay             swarm
6f2d47678226        none                null                local

我们看到在network list中,我们的overlay网络mynet1出现在列表中。这时,在worker node上你还看不到mynet1的存在,因为按照目前docker的机制,只有将归属于mynet1的task调度到worker node上时,mynet1的信息才会同步到worker node上。

接下来就是在mynet1上启动service的时候了,我们先来测试一下:

在manager节点上,用docker service命令启动服务mytest:

# docker service create --replicas 2 --name mytest --network mynet1 alpine:3.3 ping baidu.com
0401ri7rm1bdwfbvhgyuwroqn

似乎启动成功了,我们来查看一下服务状态:

root@iZ25mjza4msZ:~# docker service ps mytest
ID                         NAME          IMAGE       NODE          DESIRED STATE  CURRENT STATE                     ERROR
73hyxfhafguivtrbi8dyosufh  mytest.1      alpine:3.3  iZ25mjza4msZ  Ready          Preparing 1 seconds ago
c5konzyaeq4myzswthm8ax77w   \_ mytest.1  alpine:3.3  iZ25mjza4msZ  Shutdown       Failed 1 seconds ago              "starting container failed: co…"
6umn2qlj34okagb4mldpl6yga   \_ mytest.1  alpine:3.3  iZ25mjza4msZ  Shutdown       Failed 6 seconds ago              "starting container failed: co…"
5y7c1uoi73272uxjp2uscynwi   \_ mytest.1  alpine:3.3  iZ25mjza4msZ  Shutdown       Failed 11 seconds ago             "starting container failed: co…"
4belae8b8mhd054ibhpzbx63q   \_ mytest.1  alpine:3.3  iZ25mjza4msZ  Shutdown       Failed 16 seconds ago             "starting container failed: co…"

似乎服务并没有起来,service ps的结果告诉我:出错了!

但从ps的输出来看,ERROR那行的日志太过简略:“starting container failed: co…” ,无法从这里面分析出失败原因,通过docker logs查看失败容器的日志(实际上日志是空的)以及通过syslog查看docker engine的日志都没有特殊的发现。调查了许久,无意中尝试手动重启一下失败的Service task:

# docker start 4709dbb40a7b
Error response from daemon: could not add veth pair inside the network sandbox: could not find an appropriate master "ov-000101-46gc3" for "vethf72fc59"
Error: failed to start containers: 4709dbb40a7b

从这个Daemon返回的Response Error来看似乎与overlay vxlan的网络驱动有关。又经过搜索引擎的确认,大致确定可能是因为host的kernel version太low导致的,当前kernel是3.13.0-86-generic,记得之前在docker 1.9.1时玩vxlan overlay我是将kernel version升级到3.19以上了。于是决定升级kernel version

升级到15.04 ubuntu版本的内核:

命令:

    apt-get install linux-generic-lts-vivid

升级后:

# uname -a
Linux iZ25cn4xxnvZ 3.19.0-70-generic #78~14.04.1-Ubuntu SMP Fri Sep 23 17:39:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

reboot虚拟机后,重新启动mytest service,这回服务正常启动了。看来升级内核版本这味药是对了症了。

这里issue第一个吐槽:Docker强依赖linux kernel提供的诸多feature,但docker似乎在kernel版本依赖这块并未给出十分明确的对应关系,导致使用者莫名其妙的不断遇坑填坑,浪费了好多时间。

顺便这里把service的基本管理方式也一并提一下:

scale mytest服务的task数量从2到4:

docker service scale mytest=4

删除mytest服务:

docker service rm mytest

服务删除执行后,需要一些时间让docker engine stop and remove container instance。

四、vip机制测试

Docker 1.12通过集群内置的DNS服务实现服务发现,通过vip实现自动负载均衡。单独使用DNS RR机制也可以实现负载均衡,但这种由client端配合实现的机制,无法避免因dns update latency导致的服务短暂不可用的情况。vip机制才是相对理想的方式。

所谓Vip机制,就是docker swarm为每一个启动的service分配一个vip,并在DNS中将service name解析为该vip,发往该vip的请求将被自动分发到service下面的诸多active task上(down掉的task将被自动从vip均衡列表中删除)。

我们用nginx作为backend service来测试这个vip机制,首先在集群内启动mynginx service,内置2个task,一般来说,docker swarm会在manager和worker node上各启动一个container来承载一个task:

# docker service create --replicas 2 --name mynginx --network mynet1 --mount type=bind,source=/root/dockertest/staticcontents,dst=/usr/share/nginx/html,ro=true  nginx:1.10.1
3n7dlr8km9v2xd66bf0mumh1h

一切如预期,swarm在manager和worker上各自启动了一个nginx container:

# docker service ps mynginx
ID                         NAME       IMAGE         NODE          DESIRED STATE  CURRENT STATE               ERROR
bcyffgo1q3i5x0qia26fs703o  mynginx.1  nginx:1.10.1  iZ25mjza4msZ  Running        Running about a minute ago
arkol2l7gpvq42f0qytqf0u85  mynginx.2  nginx:1.10.1  iZ25cn4xxnvZ  Running        Running about a minute ago

接下来,我们尝试在mynet1中启动一个client container,并在client container中使用ping、curl对mynginx service进行vip机制的验证测试。client container的image是基于ubuntu:14.04 commit的本地image,只是在官方image中添加了curl, dig, traceroute等网络探索工具,读者朋友可自行完成。

我们在manager node上尝试启动client container:

# docker run -it --network mynet1 ubuntu:14.04 /bin/bash
docker: Error response from daemon: swarm-scoped network (mynet1) is not compatible with `docker create` or `docker run`. This network can only be used by a docker service.
See 'docker run --help'.

可以看到:直接通过docker run的方式在mynet1网络里启动container的方法失败了,docker提示:docker run与swarm范围的网络不兼容。看来我们还得用docker service create的方式来做。

# docker service create --replicas 1 --name myclient --network mynet1 test/client tail -f /var/log/bootstrap.log
0eippvade7j5e0zdyr5nkkzyo

# docker ps
CONTAINER ID        IMAGE                                                 COMMAND                  CREATED             STATUS              PORTS                    NAMES
4da6700cdf4d        test/client:latest   "tail -f /var/log/boo"   33 seconds ago      Up 32 seconds                                myclient.1.3cew8x46i5b28e2q3kd1zz3mq

我们使用exec命令attach到client container中:

root@iZ25mjza4msZ:~# docker exec -it 4da6700cdf4d /bin/bash
root@4da6700cdf4d:/#

在client container中,我们可以通过dig命令查看mynginx service的vip:

root@4da6700cdf4d:/# dig mynginx

; <<>> DiG 9.9.5-3ubuntu0.9-Ubuntu <<>> mynginx
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34806
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;mynginx.            IN    A

;; ANSWER SECTION:
mynginx.        600    IN    A    10.0.0.2

;; Query time: 0 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Tue Oct 11 08:58:58 UTC 2016
;; MSG SIZE  rcvd: 48

可以看到为mynginx service分配的vip是10.0.0.2。

接下来就是见证奇迹的时候了,我们尝试通过curl访问mynginx这个service,预期结果是:请求被轮询转发到不同的nginx container中,返回结果输出不同内容。实际情况如何呢?

root@4da6700cdf4d:/# curl mynginx
^C
root@4da6700cdf4d:/# curl mynginx
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title> 主标题 | 副标题< /title>
</head>
<body>
<p>hello world, i am manager</p>
</body>
</html>
root@4da6700cdf4d:/# curl mynginx
curl: (7) Failed to connect to mynginx port 80: Connection timed out

第一次执行curl mynginx,curl就hang住了。ctrl+c后,再次执行curl mynginx,顺利返回manager节点上的nginx container的response结果:”hello world, i am manager“。

第三次执行curl mynginx,又hang住了,一段时间后显示timed out,这也从侧面说明了,swarm下的docker engine的确按照rr规则将request均衡转发到不同nginx container,但实际看来,从manager node上的client container到worker node上的nginx container的网络似乎不通。我们来验证一下这两个container间的网络是否ok。

我们在两个node上分别用docker inspect获得client container和nginx container的ip地址:

    manager node:
        client container: 10.0.0.6
        nginx container: 10.0.0.4
    worker node:
        nginx container: 10.0.0.3

理论上,位于同一overlay网络中的三个container之间应该是互通的。但实际上通过docker exec -it container_id /bin/bash进入每个docker container内部进行互ping来看,manager node上的两个container可以互相ping通,但无法ping通 worker node上的nginx container,同样,位于worker node上的nginx container也无法ping通位于manager node上的任何container。

通过docker swarm leave将worker节点从swarm cluster中摘出,docker swarm会在manager上再启动一个nginx container,这时如果再再client container测试vip机制,那么测试是ok的。

也就是说我遇到的问题是跨node的swarm network不好用,导致vip机制无法按预期执行。

后续我又试过双swarm manager等方式,vip机制在跨node时均不可用。在docker github的issue中,很多人遇到了同样的问题,涉及的环境也是多种多样(不同内核版本、不同linux发行版,不同公有云提供商或本地虚拟机管理软件),似乎这个问题是随机出现的。 按照docker developer的提示检查了swarm必要端口的开放情况、防火墙、swarm init的传递参数,都是无误的。也尝试过重建swarm,在init和join时全部显式带上–listen-addr和–advertise-addr选项,问题依旧没能解决。

最后,又将docker版本从1.12.1升级到最新发布的docker 1.12.2rc3版本,重建集群,问题依旧没有解决。

自此确定,docker 1.12的vip机制尚不稳定,并且没有临时解决方案能绕过这一问题。

五、Routing mesh机制测试

内部网络的vip机制的测试失败,让我在测试Docker 1.12的另外一个机制:Routing mesh之前心里蒙上了一丝阴影,一个念头油然而生:Routing mesh可能也不好用。

对于外部网络和内部网络的边界,docker 1.12提供了ingress(入口) overlay网络应对,通过routing mesh机制,保证外部的请求可以被任意集群node转发到启动了相应服务container的node中,并保证高
可用。如果有多个container,还可以实现负载均衡的转发。

与vip不同,Routing mesh在启动服务前强调暴露一个node port的概念。既然叫node port,说明这个暴露的port是docker engine listen的,并由docker engine将发到port上的流量转到相应启动了service container的节点上去(如果本node也启动了service task,那么也会负载分担留给自己node上的service task container去处理)。

我们先清除上面的service,还是利用nginx来作为网络入口服务:

# docker service create --replicas 2 --name mynginx --network mynet1 --mount type=bind,source=/root/dockertest/staticcontents,dst=/usr/share/nginx/html,ro=true --publish 8091:80/tcp nginx:1.10.1
cns4gcsrs50n2hbi2o4gpa1tp

看看node上的8091端口状态:

root@iZ25mjza4msZ:~# lsof -i tcp:8091
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
dockerd 13909 root   37u  IPv6 121343      0t0  TCP *:8091 (LISTEN)

dockerd负责监听该端口。

接下来,我们在manager node上通过curl来访问10.46.181.146:8091。

# curl 10.46.181.146:8091
^C
root@iZ25mjza4msZ:~# curl 10.46.181.146:8091
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title> 主标题 | 副标题< /title>
</head>
<body>
<p>hello world, i am master</p>
</body>
</html>
root@iZ25mjza4msZ:~# curl 10.46.181.146:8091

在vip测试中的一幕又出现了,docker swarm似乎再将请求负载分担到两个node上,当分担到worker node上时,curl又hang住了。routing mesh机制失效。

理论上再向swarm cluster添加一个worker node,该node上并未启动nginx service,当访问这个新node的8091端口时,流量也会被转到manager node或之前的那个worker node,但实际情况是,跨node流量互转失效,和vip机制测试似乎是一个问题。

六、小结

Docker 1.12的routing mesh和vip均因swarm network的问题而不可用,这一点出乎我的预料。

翻看Docker在github上的issues,发现类似问题从Docker 1.12发布起就出现很多,近期也有不少:

https://github.com/docker/docker/issues/27237

https://github.com/docker/docker/issues/27218

https://github.com/docker/docker/issues/25266

https://github.com/docker/docker/issues/26946

*https://github.com/docker/docker/issues/27016

这里除了27016的issue发起者在issue最后似乎顿悟到了什么(也没了下文):Good news. I believe I discovered the root cause of our issue. Remember above I noted our Swarm spanned across L3 networks? I appears there is some network policy that is blocking VxLAN traffic (4789/udp) across the two L3 networks. I redeployed our same configuration to a single L3 network and can reliably access the published port on all worker nodes (based on a few minutes of testing)。其余的几个issue均未有solution。

不知道我在阿里云的两个node之间是否有阻隔vxlan traffic的什么policy,不过使用nc探测4789 udp端口均是可用的:

nc -vuz 10.47.136.60 4789

无论是配置原因还是代码bug导致的随机问题,Docker日益庞大的身躯和背后日益复杂的网络机制,让开发者(包括docker自己的开发人员)查找问题的难度都变得越来越高。Docker代码的整体质量似乎也呈现出一定下滑的不良趋势。

针对上述问题,尚未找到很好的解决方案。如果哪位读者能发现其中玄机,请不吝赐教。




这里是Tony Bai的个人Blog,欢迎访问、订阅和留言!订阅Feed请点击上面图片

如果您觉得这里的文章对您有帮助,请扫描上方二维码进行捐赠,加油后的Tony Bai将会为您呈现更多精彩的文章,谢谢!

如果您喜欢通过微信App浏览本站内容,可以扫描下方二维码,订阅本站官方微信订阅号“iamtonybai”;点击二维码,可直达本人官方微博主页^_^:



本站Powered by Digital Ocean VPS。

选择Digital Ocean VPS主机,即可获得10美元现金充值,可免费使用两个月哟!

著名主机提供商Linode 10$优惠码:linode10,在这里注册即可免费获得。

阿里云推荐码:1WFZ0V立享9折!

View Tony Bai's profile on LinkedIn


文章

评论

  • 正在加载...

分类

标签

归档











更多