2016年十二月月 发布的文章

使用Kubeadm安装Kubernetes

在《当Docker遇到systemd》一文中,我提到过这两天儿一直在做的一个task:使用kubeadmUbuntu 16.04上安装部署Kubernetes的最新发布版本-k8s 1.5.1

年中,Docker宣布在Docker engine中集成swarmkit工具包,这一announcement在轻量级容器界引发轩然大波。毕竟开发者是懒惰的^0^,有了docker swarmkit,驱动developer去安装其他容器编排工具的动力在哪里呢?即便docker engine还不是当年那个被人们高频使用的IE浏览器。作为针对Docker公司这一市场行为的回应,容器集群管理和服务编排领先者Kubernetes在三个月后发布了Kubernetes1.4.0版本。在这个版本中K8s新增了kubeadm工具。kubeadm的使用方式有点像集成在docker engine中的swarm kit工具,旨在改善开发者在安装、调试和使用k8s时的体验,降低安装和使用门槛。理论上通过两个命令:init和join即可搭建出一套完整的Kubernetes cluster。

不过,和初入docker引擎的swarmkit一样,kubeadm目前也在active development中,也不是那么stable,因此即便在当前最新的k8s 1.5.1版本中,它仍然处于Alpha状态,官方不建议在Production环境下使用。每次执行kubeadm init时,它都会打印如下提醒日志:

[kubeadm] WARNING: kubeadm is in alpha, please do not use it for production clusters.

不过由于之前部署的k8s 1.3.7集群运行良好,这给了我们在k8s这条路上继续走下去并走好的信心。但k8s在部署和管理方面的体验的确是太繁琐了,于是我们准备试验一下kubeadm是否能带给我们超出预期的体验。之前在aliyun ubuntu 14.04上安装kubernetes 1.3.7的经验和教训,让我略微有那么一丢丢底气,但实际安装过程依旧是一波三折。这既与kubeadm的unstable有关,同样也与cni、第三方网络add-ons的质量有关。无论哪一方出现问题都会让你的install过程异常坎坷曲折。

一、环境与约束

在kubeadm支持的Ubuntu 16.04+, CentOS 7 or HypriotOS v1.0.1+三种操作系统中,我们选择了Ubuntu 16.04。由于阿里云尚无官方16.04 Image可用,我们新开了两个Ubuntu 14.04ECS实例,并通过apt-get命令手工将其升级到Ubuntu 16.04.1,详细版本是:Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-58-generic x86_64)。

Ubuntu 16.04使用了systemd作为init system,在安装和配置Docker时,可以参考我的这篇《当Docker遇到system》。Docker版本我选择了目前可以得到的lastest stable release: 1.12.5。

# docker version
Client:
 Version:      1.12.5
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   7392c3b
 Built:        Fri Dec 16 02:42:17 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.5
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   7392c3b
 Built:        Fri Dec 16 02:42:17 2016
 OS/Arch:      linux/amd64

至于Kubernetes版本,前面已经提到过了,我们就使用最新发布的Kubernetes 1.5.1版本。1.5.1是1.5.0的一个紧急fix版本,主要”to address default flag values which in isolation were not problematic, but in concert could result in an insecure cluster”。官方建议skip 1.5.0,直接用1.5.1。

这里再重申一下:Kubernetes的安装、配置和调通是很难的,在阿里云上调通就更难了,有时还需要些运气。Kubernetes、Docker、cni以及各种网络Add-ons都在active development中,也许今天还好用的step、tip和trick,明天就out-dated,因此在借鉴本文的操作步骤时,请谨记这些^0^。

二、安装包准备

我们这次新开了两个ECS实例,一个作为master node,一个作为minion node。Kubeadm默认安装时,master node将不会参与Pod调度,不会承载work load,即不会有非核心组件的Pod在Master node上被创建出来。当然通过kubectl taint命令可以解除这一限制,不过这是后话了。

集群拓扑:

master node:10.47.217.91,主机名:iZ25beglnhtZ
minion node:10.28.61.30,主机名:iZ2ze39jeyizepdxhwqci6Z

本次安装的主参考文档就是Kubernetes官方的那篇《Installing Kubernetes on Linux with kubeadm》。

本小节,我们将进行安装包准备,即将kubeadm以及此次安装所需要的k8s核心组件统统下载到上述两个Node上。注意:如果你有加速器,那么本节下面的安装过程将尤为顺利,反之,… :( 。以下命令,在两个Node上均要执行。

1、添加apt-key

# curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
OK

2、添加Kubernetes源并更新包信息

添加Kubernetes源到sources.list.d目录下:

# cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
  deb http://apt.kubernetes.io/ kubernetes-xenial main
  EOF

# cat /etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main

更新包信息:

# apt-get update
... ...
Hit:2 http://mirrors.aliyun.com/ubuntu xenial InRelease
Hit:3 https://apt.dockerproject.org/repo ubuntu-xenial InRelease
Get:4 http://mirrors.aliyun.com/ubuntu xenial-security InRelease [102 kB]
Get:1 https://packages.cloud.google.com/apt kubernetes-xenial InRelease [6,299 B]
Get:5 https://packages.cloud.google.com/apt kubernetes-xenial/main amd64 Packages [1,739 B]
Get:6 http://mirrors.aliyun.com/ubuntu xenial-updates InRelease [102 kB]
Get:7 http://mirrors.aliyun.com/ubuntu xenial-proposed InRelease [253 kB]
Get:8 http://mirrors.aliyun.com/ubuntu xenial-backports InRelease [102 kB]
Fetched 568 kB in 19s (28.4 kB/s)
Reading package lists... Done

3、下载Kubernetes核心组件

在此次安装中,我们通过apt-get就可以下载Kubernetes的核心组件,包括kubelet、kubeadm、kubectl和kubernetes-cni等。

# apt-get install -y kubelet kubeadm kubectl kubernetes-cni
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following package was automatically installed and is no longer required:
  libtimedate-perl
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  ebtables ethtool socat
The following NEW packages will be installed:
  ebtables ethtool kubeadm kubectl kubelet kubernetes-cni socat
0 upgraded, 7 newly installed, 0 to remove and 0 not upgraded.
Need to get 37.6 MB of archives.
After this operation, 261 MB of additional disk space will be used.
Get:2 http://mirrors.aliyun.com/ubuntu xenial/main amd64 ebtables amd64 2.0.10.4-3.4ubuntu1 [79.6 kB]
Get:6 http://mirrors.aliyun.com/ubuntu xenial/main amd64 ethtool amd64 1:4.5-1 [97.5 kB]
Get:7 http://mirrors.aliyun.com/ubuntu xenial/universe amd64 socat amd64 1.7.3.1-1 [321 kB]
Get:1 https://packages.cloud.google.com/apt kubernetes-xenial/main amd64 kubernetes-cni amd64 0.3.0.1-07a8a2-00 [6,877 kB]
Get:3 https://packages.cloud.google.com/apt kubernetes-xenial/main amd64 kubelet amd64 1.5.1-00 [15.1 MB]
Get:4 https://packages.cloud.google.com/apt kubernetes-xenial/main amd64 kubectl amd64 1.5.1-00 [7,954 kB]
Get:5 https://packages.cloud.google.com/apt kubernetes-xenial/main amd64 kubeadm amd64 1.6.0-alpha.0-2074-a092d8e0f95f52-00 [7,120 kB]
Fetched 37.6 MB in 36s (1,026 kB/s)
... ...
Unpacking kubeadm (1.6.0-alpha.0-2074-a092d8e0f95f52-00) ...
Processing triggers for systemd (229-4ubuntu13) ...
Processing triggers for ureadahead (0.100.0-19) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up ebtables (2.0.10.4-3.4ubuntu1) ...
update-rc.d: warning: start and stop actions are no longer supported; falling back to defaults
Setting up ethtool (1:4.5-1) ...
Setting up kubernetes-cni (0.3.0.1-07a8a2-00) ...
Setting up socat (1.7.3.1-1) ...
Setting up kubelet (1.5.1-00) ...
Setting up kubectl (1.5.1-00) ...
Setting up kubeadm (1.6.0-alpha.0-2074-a092d8e0f95f52-00) ...
Processing triggers for systemd (229-4ubuntu13) ...
Processing triggers for ureadahead (0.100.0-19) ...
... ...

下载后的kube组件并未自动运行起来。在 /lib/systemd/system下面我们能看到kubelet.service:

# ls /lib/systemd/system|grep kube
kubelet.service

//kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=http://kubernetes.io/docs/

[Service]
ExecStart=/usr/bin/kubelet
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

kubelet的版本:

# kubelet --version
Kubernetes v1.5.1

k8s的核心组件都有了,接下来我们就要boostrap kubernetes cluster了。同时,问题也就随之而来了,而这些问题以及问题的解决才是本篇要说明的重点。

三、初始化集群

前面说过,理论上通过kubeadm使用init和join命令即可建立一个集群,这init就是在master节点对集群进行初始化。和k8s 1.4之前的部署方式不同的是,kubeadm安装的k8s核心组件都是以容器的形式运行于master node上的。因此在kubeadm init之前,最好给master node上的docker engine挂上加速器代理,因为kubeadm要从gcr.io/google_containers repository中pull许多核心组件的images,大约有如下一些:

gcr.io/google_containers/kube-controller-manager-amd64   v1.5.1                     cd5684031720        2 weeks ago         102.4 MB
gcr.io/google_containers/kube-apiserver-amd64            v1.5.1                     8c12509df629        2 weeks ago         124.1 MB
gcr.io/google_containers/kube-proxy-amd64                v1.5.1                     71d2b27b03f6        2 weeks ago         175.6 MB
gcr.io/google_containers/kube-scheduler-amd64            v1.5.1                     6506e7b74dac        2 weeks ago         53.97 MB
gcr.io/google_containers/etcd-amd64                      3.0.14-kubeadm             856e39ac7be3        5 weeks ago         174.9 MB
gcr.io/google_containers/kubedns-amd64                   1.9                        26cf1ed9b144        5 weeks ago         47 MB
gcr.io/google_containers/dnsmasq-metrics-amd64           1.0                        5271aabced07        7 weeks ago         14 MB
gcr.io/google_containers/kube-dnsmasq-amd64              1.4                        3ec65756a89b        3 months ago        5.13 MB
gcr.io/google_containers/kube-discovery-amd64            1.0                        c5e0c9a457fc        3 months ago        134.2 MB
gcr.io/google_containers/exechealthz-amd64               1.2                        93a43bfb39bf        3 months ago        8.375 MB
gcr.io/google_containers/pause-amd64                     3.0                        99e59f495ffa        7 months ago        746.9 kB

在Kubeadm的文档中,Pod Network的安装是作为一个单独的步骤的。kubeadm init并没有为你选择一个默认的Pod network进行安装。我们将首选Flannel 作为我们的Pod network,这不仅是因为我们的上一个集群用的就是flannel,而且表现稳定。更是由于Flannel就是coreos为k8s打造的专属overlay network add-ons。甚至于flannel repository的readme.md都这样写着:“flannel is a network fabric for containers, designed for Kubernetes”。如果我们要使用Flannel,那么在执行init时,按照kubeadm文档要求,我们必须给init命令带上option:–pod-network-cidr=10.244.0.0/16。

1、执行kubeadm init

执行kubeadm init命令:

# kubeadm init --pod-network-cidr=10.244.0.0/16
[kubeadm] WARNING: kubeadm is in alpha, please do not use it for production clusters.
[preflight] Running pre-flight checks
[preflight] Starting the kubelet service
[init] Using Kubernetes version: v1.5.1
[tokens] Generated token: "2e7da9.7fc5668ff26430c7"
[certificates] Generated Certificate Authority key and certificate.
[certificates] Generated API Server key and certificate
[certificates] Generated Service Account signing keys
[certificates] Created keys and certificates in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[apiclient] Created API client, waiting for the control plane to become ready //如果没有挂加速器,可能会在这里hang住。
[apiclient] All control plane components are healthy after 54.789750 seconds
[apiclient] Waiting for at least one node to register and become ready
[apiclient] First node is ready after 1.003053 seconds
[apiclient] Creating a test deployment
[apiclient] Test deployment succeeded
[token-discovery] Created the kube-discovery deployment, waiting for it to become ready
[token-discovery] kube-discovery is ready after 62.503441 seconds
[addons] Created essential addon: kube-proxy
[addons] Created essential addon: kube-dns

Your Kubernetes master has initialized successfully!

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:

http://kubernetes.io/docs/admin/addons/

You can now join any number of machines by running the following on each node:

kubeadm join --token=2e7da9.7fc5668ff26430c7 123.56.200.187

init成功后的master node有啥变化?k8s的核心组件均正常启动:

# ps -ef|grep kube
root      2477  2461  1 16:36 ?        00:00:04 kube-proxy --kubeconfig=/run/kubeconfig
root     30860     1 12 16:33 ?        00:01:09 /usr/bin/kubelet --kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin --cluster-dns=10.96.0.10 --cluster-domain=cluster.local
root     30952 30933  0 16:33 ?        00:00:01 kube-scheduler --address=127.0.0.1 --leader-elect --master=127.0.0.1:8080
root     31128 31103  2 16:33 ?        00:00:11 kube-controller-manager --address=127.0.0.1 --leader-elect --master=127.0.0.1:8080 --cluster-name=kubernetes --root-ca-file=/etc/kubernetes/pki/ca.pem --service-account-private-key-file=/etc/kubernetes/pki/apiserver-key.pem --cluster-signing-cert-file=/etc/kubernetes/pki/ca.pem --cluster-signing-key-file=/etc/kubernetes/pki/ca-key.pem --insecure-experimental-approve-all-kubelet-csrs-for-group=system:kubelet-bootstrap --allocate-node-cidrs=true --cluster-cidr=10.244.0.0/16
root     31223 31207  2 16:34 ?        00:00:10 kube-apiserver --insecure-bind-address=127.0.0.1 --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota --service-cluster-ip-range=10.96.0.0/12 --service-account-key-file=/etc/kubernetes/pki/apiserver-key.pem --client-ca-file=/etc/kubernetes/pki/ca.pem --tls-cert-file=/etc/kubernetes/pki/apiserver.pem --tls-private-key-file=/etc/kubernetes/pki/apiserver-key.pem --token-auth-file=/etc/kubernetes/pki/tokens.csv --secure-port=6443 --allow-privileged --advertise-address=123.56.200.187 --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --anonymous-auth=false --etcd-servers=http://127.0.0.1:2379
root     31491 31475  0 16:35 ?        00:00:00 /usr/local/bin/kube-discovery

而且是多以container的形式启动:

# docker ps
CONTAINER ID        IMAGE                                                           COMMAND                  CREATED                  STATUS                  PORTS               NAMES
c16c442b7eca        gcr.io/google_containers/kube-proxy-amd64:v1.5.1                "kube-proxy --kubecon"   6 minutes ago            Up 6 minutes                                k8s_kube-proxy.36dab4e8_kube-proxy-sb4sm_kube-system_43fb1a2c-cb46-11e6-ad8f-00163e1001d7_2ba1648e
9f73998e01d7        gcr.io/google_containers/kube-discovery-amd64:1.0               "/usr/local/bin/kube-"   8 minutes ago            Up 8 minutes                                k8s_kube-discovery.7130cb0a_kube-discovery-1769846148-6z5pw_kube-system_1eb97044-cb46-11e6-ad8f-00163e1001d7_fd49c2e3
dd5412e5e15c        gcr.io/google_containers/kube-apiserver-amd64:v1.5.1            "kube-apiserver --ins"   9 minutes ago            Up 9 minutes                                k8s_kube-apiserver.1c5a91d9_kube-apiserver-iz25beglnhtz_kube-system_eea8df1717e9fea18d266103f9edfac3_8cae8485
60017f8819b2        gcr.io/google_containers/etcd-amd64:3.0.14-kubeadm              "etcd --listen-client"   9 minutes ago            Up 9 minutes                                k8s_etcd.c323986f_etcd-iz25beglnhtz_kube-system_3a26566bb004c61cd05382212e3f978f_06d517eb
03c2463aba9c        gcr.io/google_containers/kube-controller-manager-amd64:v1.5.1   "kube-controller-mana"   9 minutes ago            Up 9 minutes                                k8s_kube-controller-manager.d30350e1_kube-controller-manager-iz25beglnhtz_kube-system_9a40791dd1642ea35c8d95c9e610e6c1_3b05cb8a
fb9a724540a7        gcr.io/google_containers/kube-scheduler-amd64:v1.5.1            "kube-scheduler --add"   9 minutes ago            Up 9 minutes                                k8s_kube-scheduler.ef325714_kube-scheduler-iz25beglnhtz_kube-system_dc58861a0991f940b0834f8a110815cb_9b3ccda2
.... ...

不过这些核心组件并不是跑在pod network中的(没错,此时的pod network还没有创建),而是采用了host network。以kube-apiserver的pod信息为例:

kube-system   kube-apiserver-iz25beglnhtz            1/1       Running   0          1h        10.47.217.91   iz25beglnhtz

kube-apiserver的IP是host ip,从而推断容器使用的是host网络,这从其对应的pause容器的network属性就可以看出:

# docker ps |grep apiserver
a5a76bc59e38        gcr.io/google_containers/kube-apiserver-amd64:v1.5.1            "kube-apiserver --ins"   About an hour ago   Up About an hour                        k8s_kube-apiserver.2529402_kube-apiserver-iz25beglnhtz_kube-system_25d646be9a0092138dc6088fae6f1656_ec0079fc
ef4d3bf057a6        gcr.io/google_containers/pause-amd64:3.0                        "/pause"                 About an hour ago   Up About an hour                        k8s_POD.d8dbe16c_kube-apiserver-iz25beglnhtz_kube-system_25d646be9a0092138dc6088fae6f1656_bbfd8a31

inspect pause容器,可以看到pause container的NetworkMode的值:

"NetworkMode": "host",

如果kubeadm init执行过程中途出现了什么问题,比如前期忘记挂加速器导致init hang住,你可能会ctrl+c退出init执行。重新配置后,再执行kubeadm init,这时你可能会遇到下面kubeadm的输出:

# kubeadm init --pod-network-cidr=10.244.0.0/16
[kubeadm] WARNING: kubeadm is in alpha, please do not use it for production clusters.
[preflight] Running pre-flight checks
[preflight] Some fatal errors occurred:
    Port 10250 is in use
    /etc/kubernetes/manifests is not empty
    /etc/kubernetes/pki is not empty
    /var/lib/kubelet is not empty
    /etc/kubernetes/admin.conf already exists
    /etc/kubernetes/kubelet.conf already exists
[preflight] If you know what you are doing, you can skip pre-flight checks with `--skip-preflight-checks`

kubeadm会自动检查当前环境是否有上次命令执行的“残留”。如果有,必须清理后再行执行init。我们可以通过”kubeadm reset”来清理环境,以备重来。

# kubeadm reset
[preflight] Running pre-flight checks
[reset] Draining node: "iz25beglnhtz"
[reset] Removing node: "iz25beglnhtz"
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Removing kubernetes-managed containers
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/etcd]
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf]

2、安装flannel pod网络

kubeadm init之后,如果你探索一下当前cluster的状态或者核心组件的日志,你会发现某些“异常”,比如:从kubelet的日志中我们可以看到一直刷屏的错误信息:

Dec 26 16:36:48 iZ25beglnhtZ kubelet[30860]: E1226 16:36:48.365885   30860 docker_manager.go:2201] Failed to setup network for pod "kube-dns-2924299975-pddz5_kube-system(43fd7264-cb46-11e6-ad8f-00163e1001d7)" using network plugins "cni": cni config unintialized; Skipping pod

通过命令kubectl get pod –all-namespaces -o wide,你也会发现kube-dns pod处于ContainerCreating状态。

这些都不打紧,因为我们还没有为cluster安装Pod network呢。前面说过,我们要使用Flannel网络,因此我们需要执行如下安装命令:

#kubectl apply -f  https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
configmap "kube-flannel-cfg" created
daemonset "kube-flannel-ds" created

稍等片刻,我们再来看master node上的cluster信息:

# ps -ef|grep kube|grep flannel
root      6517  6501  0 17:20 ?        00:00:00 /opt/bin/flanneld --ip-masq --kube-subnet-mgr
root      6573  6546  0 17:20 ?        00:00:00 /bin/sh -c set -e -x; cp -f /etc/kube-flannel/cni-conf.json /etc/cni/net.d/10-flannel.conf; while true; do sleep 3600; done

# kubectl get pods --all-namespaces
NAMESPACE     NAME                                   READY     STATUS    RESTARTS   AGE
kube-system   dummy-2088944543-s0c5g                 1/1       Running   0          50m
kube-system   etcd-iz25beglnhtz                      1/1       Running   0          50m
kube-system   kube-apiserver-iz25beglnhtz            1/1       Running   0          50m
kube-system   kube-controller-manager-iz25beglnhtz   1/1       Running   0          50m
kube-system   kube-discovery-1769846148-6z5pw        1/1       Running   0          50m
kube-system   kube-dns-2924299975-pddz5              4/4       Running   0          49m
kube-system   kube-flannel-ds-5ww9k                  2/2       Running   0          4m
kube-system   kube-proxy-sb4sm                       1/1       Running   0          49m
kube-system   kube-scheduler-iz25beglnhtz            1/1       Running   0          49m

至少集群的核心组件已经全部run起来了。看起来似乎是成功了。

3、minion node:join the cluster

接下来,就该minion node加入cluster了。这里我们用到了kubeadm的第二个命令:kubeadm join。

在minion node上执行(注意:这里要保证master node的9898端口在防火墙是打开的):

# kubeadm join --token=2e7da9.7fc5668ff26430c7 123.56.200.187
[kubeadm] WARNING: kubeadm is in alpha, please do not use it for production clusters.
[preflight] Running pre-flight checks
[tokens] Validating provided token
[discovery] Created cluster info discovery client, requesting info from "http://123.56.200.187:9898/cluster-info/v1/?token-id=2e7da9"
[discovery] Cluster info object received, verifying signature using given token
[discovery] Cluster info signature and contents are valid, will use API endpoints [https://123.56.200.187:6443]
[bootstrap] Trying to connect to endpoint https://123.56.200.187:6443
[bootstrap] Detected server version: v1.5.1
[bootstrap] Successfully established connection with endpoint "https://123.56.200.187:6443"
[csr] Created API client to obtain unique certificate for this node, generating keys and certificate signing request
[csr] Received signed certificate from the API server:
Issuer: CN=kubernetes | Subject: CN=system:node:iZ2ze39jeyizepdxhwqci6Z | CA: false
Not before: 2016-12-26 09:31:00 +0000 UTC Not After: 2017-12-26 09:31:00 +0000 UTC
[csr] Generating kubelet configuration
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"

Node join complete:
* Certificate signing request sent to master and response
  received.
* Kubelet informed of new secure connection details.

Run 'kubectl get nodes' on the master to see this machine join.

也很顺利。我们在minion node上看到的k8s组件情况如下:

d85cf36c18ed        gcr.io/google_containers/kube-proxy-amd64:v1.5.1      "kube-proxy --kubecon"   About an hour ago   Up About an hour                        k8s_kube-proxy.36dab4e8_kube-proxy-lsn0t_kube-system_b8eddf1c-cb4e-11e6-ad8f-00163e1001d7_5826f32b
a60e373b48b8        gcr.io/google_containers/pause-amd64:3.0              "/pause"                 About an hour ago   Up About an hour                        k8s_POD.d8dbe16c_kube-proxy-lsn0t_kube-system_b8eddf1c-cb4e-11e6-ad8f-00163e1001d7_46bfcf67
a665145eb2b5        quay.io/coreos/flannel-git:v0.6.1-28-g5dde68d-amd64   "/bin/sh -c 'set -e -"   About an hour ago   Up About an hour                        k8s_install-cni.17d8cf2_kube-flannel-ds-tr8zr_kube-system_06eca729-cb72-11e6-ad8f-00163e1001d7_01e12f61
5b46f2cb0ccf        gcr.io/google_containers/pause-amd64:3.0              "/pause"                 About an hour ago   Up About an hour                        k8s_POD.d8dbe16c_kube-flannel-ds-tr8zr_kube-system_06eca729-cb72-11e6-ad8f-00163e1001d7_ac880d20

我们在master node上查看当前cluster状态:

# kubectl get nodes
NAME                      STATUS         AGE
iz25beglnhtz              Ready,master   1h
iz2ze39jeyizepdxhwqci6z   Ready          21s

k8s cluster创建”成功”!真的成功了吗?“折腾”才刚刚开始:(!

三、Flannel Pod Network问题

Join成功所带来的“余温”还未散去,我就发现了Flannel pod network的问题,troubleshooting正式开始:(。

1、minion node上的flannel时不时地报错

刚join时还好好的,可过了没一会儿,我们就发现在kubectl get pod –all-namespaces中有错误出现:

kube-system   kube-flannel-ds-tr8zr                  1/2       CrashLoopBackOff   189        16h

我们发现这是minion node上的flannel pod中的一个container出错导致的,跟踪到的具体错误如下:

# docker logs bc0058a15969
E1227 06:17:50.605110       1 main.go:127] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-tr8zr': Get https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-tr8zr: dial tcp 10.96.0.1:443: i/o timeout

10.96.0.1是pod network中apiserver service的cluster ip,而minion node上的flannel组件居然无法访问到这个cluster ip!这个问题的奇怪之处还在于,有些时候这个Pod在被调度restart N多次后或者被删除重启后,又突然变为running状态了,行为十分怪异。

在flannel github.com issues中,至少有两个open issue与此问题有密切关系:

https://github.com/coreos/flannel/issues/545

https://github.com/coreos/flannel/issues/535

这个问题暂无明确解。当minion node上的flannel pod自恢复为running状态时,我们又可以继续了。

2、minion node上flannel pod启动失败的一个应对方法

在下面issue中,很多developer讨论了minion node上flannel pod启动失败的一种可能原因以及临时应对方法:

https://github.com/kubernetes/kubernetes/issues/34101

这种说法大致就是minion node上的kube-proxy使用了错误的interface,通过下面方法可以fix这个问题。在minion node上执行:

#  kubectl -n kube-system get ds -l 'component=kube-proxy' -o json | jq '.items[0].spec.template.spec.containers[0].command |= .+ ["--cluster-cidr=10.244.0.0/16"]' | kubectl apply -f - && kubectl -n kube-system delete pods -l 'component=kube-proxy'
daemonset "kube-proxy" configured
pod "kube-proxy-lsn0t" deleted
pod "kube-proxy-sb4sm" deleted

执行后,flannel pod的状态:

kube-system   kube-flannel-ds-qw291                  2/2       Running   8          17h
kube-system   kube-flannel-ds-x818z                  2/2       Running   17         1h

经过17次restart,minion node上的flannel pod 启动ok了。其对应的flannel container启动日志如下:

# docker logs 1f64bd9c0386
I1227 07:43:26.670620       1 main.go:132] Installing signal handlers
I1227 07:43:26.671006       1 manager.go:133] Determining IP address of default interface
I1227 07:43:26.670825       1 kube.go:233] starting kube subnet manager
I1227 07:43:26.671514       1 manager.go:163] Using 59.110.67.15 as external interface
I1227 07:43:26.671575       1 manager.go:164] Using 59.110.67.15 as external endpoint
I1227 07:43:26.746811       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I1227 07:43:26.749785       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1227 07:43:26.752343       1 ipmasq.go:47] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
I1227 07:43:26.755126       1 manager.go:246] Lease acquired: 10.244.1.0/24
I1227 07:43:26.755444       1 network.go:58] Watching for L3 misses
I1227 07:43:26.755475       1 network.go:66] Watching for new subnet leases
I1227 07:43:27.755830       1 network.go:153] Handling initial subnet events
I1227 07:43:27.755905       1 device.go:163] calling GetL2List() dev.link.Index: 10
I1227 07:43:27.756099       1 device.go:168] calling NeighAdd: 123.56.200.187, ca:68:7c:9b:cc:67

issue中说到,在kubeadm init时,显式地指定–advertise-address将会避免这个问题。不过目前不要在–advertise-address后面写上多个IP,虽然文档上说是支持的,但实际情况是,当你显式指定–advertise-address的值为两个或两个以上IP时,比如下面这样:

#kubeadm init --api-advertise-addresses=10.47.217.91,123.56.200.187 --pod-network-cidr=10.244.0.0/16

master初始化成功后,当minion node执行join cluster命令时,会panic掉:

# kubeadm join --token=92e977.f1d4d090906fc06a 10.47.217.91
[kubeadm] WARNING: kubeadm is in alpha, please do not use it for production clusters.
... ...
[bootstrap] Successfully established connection with endpoint "https://10.47.217.91:6443"
[bootstrap] Successfully established connection with endpoint "https://123.56.200.187:6443"
E1228 10:14:05.405294   28378 runtime.go:64] Observed a panic: "close of closed channel" (close of closed channel)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49
/usr/local/go/src/runtime/asm_amd64.s:479
/usr/local/go/src/runtime/panic.go:458
/usr/local/go/src/runtime/chan.go:311
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/node/bootstrap.go:85
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:96
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:97
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:52
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/node/bootstrap.go:93
/usr/local/go/src/runtime/asm_amd64.s:2086
[csr] Created API client to obtain unique certificate for this node, generating keys and certificate signing request
panic: close of closed channel [recovered]
    panic: close of closed channel

goroutine 29 [running]:
panic(0x1342de0, 0xc4203eebf0)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
k8s.io/kubernetes/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/runtime/runtime.go:56 +0x126
panic(0x1342de0, 0xc4203eebf0)
    /usr/local/go/src/runtime/panic.go:458 +0x243
k8s.io/kubernetes/cmd/kubeadm/app/node.EstablishMasterConnection.func1.1()
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/node/bootstrap.go:85 +0x29d
k8s.io/kubernetes/pkg/util/wait.JitterUntil.func1(0xc420563ee0)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:96 +0x5e
k8s.io/kubernetes/pkg/util/wait.JitterUntil(0xc420563ee0, 0x12a05f200, 0x0, 0xc420022e01, 0xc4202c2060)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:97 +0xad
k8s.io/kubernetes/pkg/util/wait.Until(0xc420563ee0, 0x12a05f200, 0xc4202c2060)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/wait/wait.go:52 +0x4d
k8s.io/kubernetes/cmd/kubeadm/app/node.EstablishMasterConnection.func1(0xc4203a82f0, 0xc420269b90, 0xc4202c2060, 0xc4202c20c0, 0xc4203d8d80, 0x401, 0x480, 0xc4201e75e0, 0x17, 0xc4201e7560, ...)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/node/bootstrap.go:93 +0x100
created by k8s.io/kubernetes/cmd/kubeadm/app/node.EstablishMasterConnection
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/node/bootstrap.go:94 +0x3ed

关于join panic这个问题,在这个issue中有详细讨论:https://github.com/kubernetes/kubernetes/issues/36988

3、open /run/flannel/subnet.env: no such file or directory

前面说过,默认情况下,考虑安全原因,master node是不承担work load的,不参与pod调度。我们这里机器少,只能让master node也辛苦一下。通过下面这个命令可以让master node也参与pod调度:

# kubectl taint nodes --all dedicated-
node "iz25beglnhtz" tainted

接下来,我们create一个deployment,manifest描述文件如下:

//run-my-nginx.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-nginx
spec:
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx:1.10.1
        ports:
        - containerPort: 80

create后,我们发现调度到master上的my-nginx pod启动是ok的,但minion node上的pod则一直失败,查看到的失败原因如下:

Events:
  FirstSeen    LastSeen    Count    From                    SubObjectPath    Type        Reason        Message
  ---------    --------    -----    ----                    -------------    --------    ------        -------
  28s        28s        1    {default-scheduler }                    Normal        Scheduled    Successfully assigned my-nginx-2560993602-0440x to iz2ze39jeyizepdxhwqci6z
  27s        1s        26    {kubelet iz2ze39jeyizepdxhwqci6z}            Warning        FailedSync    Error syncing pod, skipping: failed to "SetupNetwork" for "my-nginx-2560993602-0440x_default" with SetupNetworkError: "Failed to setup network for pod \"my-nginx-2560993602-0440x_default(ba5ce554-cbf1-11e6-8c42-00163e1001d7)\" using network plugins \"cni\": open /run/flannel/subnet.env: no such file or directory; Skipping pod"

在minion node上的确没有找到/run/flannel/subnet.env该文件。但master node上有这个文件:

// /run/flannel/subnet.env

FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true

于是手动在minion node上创建一份/run/flannel/subnet.env,并复制master node同名文件的内容,保存。稍许片刻,minion node上的my-nginx pod从error变成running了。

4、no IP addresses available in network: cbr0

将之前的一个my-nginx deployment的replicas改为3,并创建基于该deployment中pods的my-nginx service:

//my-nginx-svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: my-nginx
  labels:
    run: my-nginx
spec:
  type: NodePort
  ports:
  - port: 80
    nodePort: 30062
    protocol: TCP
  selector:
    run: my-nginx

修改后,通过curl localhost:30062测试服务连通性。发现通过VIP负载均衡到master node上的my-nginx pod的request都成功得到了Response,但是负载均衡到minion node上pod的request,则阻塞在那里,直到timeout。查看pod信息才发现,原来新调度到minion node上的my-nginx pod并没有启动ok,错误原因如下:

Events:
  FirstSeen    LastSeen    Count    From                    SubObjectPath    Type        Reason        Message
  ---------    --------    -----    ----                    -------------    --------    ------        -------
  2m        2m        1    {default-scheduler }                    Normal        Scheduled    Successfully assigned my-nginx-1948696469-ph11m to iz2ze39jeyizepdxhwqci6z
  2m        0s        177    {kubelet iz2ze39jeyizepdxhwqci6z}            Warning        FailedSync    Error syncing pod, skipping: failed to "SetupNetwork" for "my-nginx-1948696469-ph11m_default" with SetupNetworkError: "Failed to setup network for pod \"my-nginx-1948696469-ph11m_default(3700d74a-cc12-11e6-8c42-00163e1001d7)\" using network plugins \"cni\": no IP addresses available in network: cbr0; Skipping pod"

查看minion node上/var/lib/cni/networks/cbr0目录,发现该目录下有如下文件:

10.244.1.10   10.244.1.12   10.244.1.14   10.244.1.16   10.244.1.18   10.244.1.2    10.244.1.219  10.244.1.239  10.244.1.3   10.244.1.5   10.244.1.7   10.244.1.9
10.244.1.100  10.244.1.120  10.244.1.140  10.244.1.160  10.244.1.180  10.244.1.20   10.244.1.22   10.244.1.24   10.244.1.30  10.244.1.50  10.244.1.70  10.244.1.90
10.244.1.101  10.244.1.121  10.244.1.141  10.244.1.161  10.244.1.187  10.244.1.200  10.244.1.220  10.244.1.240  10.244.1.31  10.244.1.51  10.244.1.71  10.244.1.91
10.244.1.102  10.244.1.122  10.244.1.142  10.244.1.162  10.244.1.182  10.244.1.201  10.244.1.221  10.244.1.241  10.244.1.32  10.244.1.52  10.244.1.72  10.244.1.92
10.244.1.103  10.244.1.123  10.244.1.143  10.244.1.163  10.244.1.183  10.244.1.202  10.244.1.222  10.244.1.242  10.244.1.33  10.244.1.53  10.244.1.73  10.244.1.93
10.244.1.104  10.244.1.124  10.244.1.144  10.244.1.164  10.244.1.184  10.244.1.203  10.244.1.223  10.244.1.243  10.244.1.34  10.244.1.54  10.244.1.74  10.244.1.94
10.244.1.105  10.244.1.125  10.244.1.145  10.244.1.165  10.244.1.185  10.244.1.204  10.244.1.224  10.244.1.244  10.244.1.35  10.244.1.55  10.244.1.75  10.244.1.95
10.244.1.106  10.244.1.126  10.244.1.146  10.244.1.166  10.244.1.186  10.244.1.205  10.244.1.225  10.244.1.245  10.244.1.36  10.244.1.56  10.244.1.76  10.244.1.96
10.244.1.107  10.244.1.127  10.244.1.147  10.244.1.167  10.244.1.187  10.244.1.206  10.244.1.226  10.244.1.246  10.244.1.37  10.244.1.57  10.244.1.77  10.244.1.97
10.244.1.108  10.244.1.128  10.244.1.148  10.244.1.168  10.244.1.188  10.244.1.207  10.244.1.227  10.244.1.247  10.244.1.38  10.244.1.58  10.244.1.78  10.244.1.98
10.244.1.109  10.244.1.129  10.244.1.149  10.244.1.169  10.244.1.189  10.244.1.208  10.244.1.228  10.244.1.248  10.244.1.39  10.244.1.59  10.244.1.79  10.244.1.99
10.244.1.11   10.244.1.13   10.244.1.15   10.244.1.17   10.244.1.19   10.244.1.209  10.244.1.229  10.244.1.249  10.244.1.4   10.244.1.6   10.244.1.8   last_reserved_ip
10.244.1.110  10.244.1.130  10.244.1.150  10.244.1.170  10.244.1.190  10.244.1.21   10.244.1.23   10.244.1.25   10.244.1.40  10.244.1.60  10.244.1.80
10.244.1.111  10.244.1.131  10.244.1.151  10.244.1.171  10.244.1.191  10.244.1.210  10.244.1.230  10.244.1.250  10.244.1.41  10.244.1.61  10.244.1.81
10.244.1.112  10.244.1.132  10.244.1.152  10.244.1.172  10.244.1.192  10.244.1.211  10.244.1.231  10.244.1.251  10.244.1.42  10.244.1.62  10.244.1.82
10.244.1.113  10.244.1.133  10.244.1.153  10.244.1.173  10.244.1.193  10.244.1.212  10.244.1.232  10.244.1.252  10.244.1.43  10.244.1.63  10.244.1.83
10.244.1.114  10.244.1.134  10.244.1.154  10.244.1.174  10.244.1.194  10.244.1.213  10.244.1.233  10.244.1.253  10.244.1.44  10.244.1.64  10.244.1.84
10.244.1.115  10.244.1.135  10.244.1.155  10.244.1.175  10.244.1.195  10.244.1.214  10.244.1.234  10.244.1.254  10.244.1.45  10.244.1.65  10.244.1.85
10.244.1.116  10.244.1.136  10.244.1.156  10.244.1.176  10.244.1.196  10.244.1.215  10.244.1.235  10.244.1.26   10.244.1.46  10.244.1.66  10.244.1.86
10.244.1.117  10.244.1.137  10.244.1.157  10.244.1.177  10.244.1.197  10.244.1.216  10.244.1.236  10.244.1.27   10.244.1.47  10.244.1.67  10.244.1.87
10.244.1.118  10.244.1.138  10.244.1.158  10.244.1.178  10.244.1.198  10.244.1.217  10.244.1.237  10.244.1.28   10.244.1.48  10.244.1.68  10.244.1.88
10.244.1.119  10.244.1.139  10.244.1.159  10.244.1.179  10.244.1.199  10.244.1.218  10.244.1.238  10.244.1.29   10.244.1.49  10.244.1.69  10.244.1.89

这已经将10.244.1.x段的所有ip占满,自然没有available的IP可供新pod使用了。至于为何占满,这个原因尚不明朗。下面两个open issue与这个问题相关:

https://github.com/containernetworking/cni/issues/306

https://github.com/kubernetes/kubernetes/issues/21656

进入到/var/lib/cni/networks/cbr0目录下,执行下面命令可以释放那些可能是kubelet leak的IP资源:

for hash in $(tail -n +1 * | grep '^[A-Za-z0-9]*$' | cut -c 1-8); do if [ -z $(docker ps -a | grep $hash | awk '{print $1}') ]; then grep -irl $hash ./; fi; done | xargs rm

执行后,目录下的文件列表变成了:

ls -l
total 32
drw-r--r-- 2 root root 12288 Dec 27 17:11 ./
drw-r--r-- 3 root root  4096 Dec 27 13:52 ../
-rw-r--r-- 1 root root    64 Dec 27 17:11 10.244.1.2
-rw-r--r-- 1 root root    64 Dec 27 17:11 10.244.1.3
-rw-r--r-- 1 root root    64 Dec 27 17:11 10.244.1.4
-rw-r--r-- 1 root root    10 Dec 27 17:11 last_reserved_ip

不过pod仍然处于失败状态,但这次失败的原因又发生了变化:

Events:
  FirstSeen    LastSeen    Count    From                    SubObjectPath    Type        Reason        Message
  ---------    --------    -----    ----                    -------------    --------    ------        -------
  23s        23s        1    {default-scheduler }                    Normal        Scheduled    Successfully assigned my-nginx-1948696469-7p4nn to iz2ze39jeyizepdxhwqci6z
  22s        1s        22    {kubelet iz2ze39jeyizepdxhwqci6z}            Warning        FailedSync    Error syncing pod, skipping: failed to "SetupNetwork" for "my-nginx-1948696469-7p4nn_default" with SetupNetworkError: "Failed to setup network for pod \"my-nginx-1948696469-7p4nn_default(a40fe652-cc14-11e6-8c42-00163e1001d7)\" using network plugins \"cni\": \"cni0\" already has an IP address different from 10.244.1.1/24; Skipping pod"

而/var/lib/cni/networks/cbr0目录下的文件又开始迅速增加!问题陷入僵局。

5、flannel vxlan不通,后端换udp,仍然不通

折腾到这里,基本筋疲力尽了。于是在两个node上执行kubeadm reset,准备重新来过。

kubeadm reset后,之前flannel创建的bridge device cni0和网口设备flannel.1依然健在。为了保证环境彻底恢复到初始状态,我们可以通过下面命令删除这两个设备:

# ifconfig  cni0 down
# brctl delbr cni0
# ip link delete flannel.1

有了前面几个问题的“磨炼”后,重新init和join的k8s cluster显得格外顺利。这次minion node没有再出现什么异常。

#  kubectl get nodes -o wide
NAME                      STATUS         AGE       EXTERNAL-IP
iz25beglnhtz              Ready,master   5m        <none>
iz2ze39jeyizepdxhwqci6z   Ready          51s       <none>

# kubectl get pod --all-namespaces
NAMESPACE     NAME                                   READY     STATUS    RESTARTS   AGE
default       my-nginx-1948696469-71h1l              1/1       Running   0          3m
default       my-nginx-1948696469-zwt5g              1/1       Running   0          3m
default       my-ubuntu-2560993602-ftdm6             1/1       Running   0          3m
kube-system   dummy-2088944543-lmlbh                 1/1       Running   0          5m
kube-system   etcd-iz25beglnhtz                      1/1       Running   0          6m
kube-system   kube-apiserver-iz25beglnhtz            1/1       Running   0          6m
kube-system   kube-controller-manager-iz25beglnhtz   1/1       Running   0          6m
kube-system   kube-discovery-1769846148-l5lfw        1/1       Running   0          5m
kube-system   kube-dns-2924299975-mdq5r              4/4       Running   0          5m
kube-system   kube-flannel-ds-9zwr1                  2/2       Running   0          5m
kube-system   kube-flannel-ds-p7xh2                  2/2       Running   0          1m
kube-system   kube-proxy-dwt5f                       1/1       Running   0          5m
kube-system   kube-proxy-vm6v2                       1/1       Running   0          1m
kube-system   kube-scheduler-iz25beglnhtz            1/1       Running   0          6m

接下来我们创建my-nginx deployment和service来测试flannel网络的连通性。通过curl my-nginx service的nodeport,发现可以reach master上的两个nginx pod,但是minion node上的pod依旧不通。

在master上看flannel docker的日志:

I1228 02:52:22.097083       1 network.go:225] L3 miss: 10.244.1.2
I1228 02:52:22.097169       1 device.go:191] calling NeighSet: 10.244.1.2, 46:6c:7a:a6:06:60
I1228 02:52:22.097335       1 network.go:236] AddL3 succeeded
I1228 02:52:55.169952       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:00.801901       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:03.801923       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:04.801764       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:05.801848       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:06.888269       1 network.go:225] L3 miss: 10.244.1.2
I1228 02:53:06.888340       1 device.go:191] calling NeighSet: 10.244.1.2, 46:6c:7a:a6:06:60
I1228 02:53:06.888507       1 network.go:236] AddL3 succeeded
I1228 02:53:39.969791       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:45.153770       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:48.154822       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:49.153774       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:50.153734       1 network.go:220] Ignoring not a miss: 46:6c:7a:a6:06:60, 10.244.1.2
I1228 02:53:52.154056       1 network.go:225] L3 miss: 10.244.1.2
I1228 02:53:52.154110       1 device.go:191] calling NeighSet: 10.244.1.2, 46:6c:7a:a6:06:60
I1228 02:53:52.154256       1 network.go:236] AddL3 succeeded

日志中有大量:“Ignoring not a miss”字样的日志,似乎vxlan网络有问题。这个问题与下面issue中描述颇为接近:

https://github.com/coreos/flannel/issues/427

Flannel默认采用vxlan作为backend,使用kernel vxlan默认的udp 8742端口。Flannel还支持udp的backend,使用udp 8285端口。于是试着更换一下flannel后端。更换flannel后端的步骤如下:

  • 将https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml文件下载到本地;
  • 修改kube-flannel.yml文件内容:主要是针对net-conf.json属性,增加”Backend”字段属性:
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  namespace: kube-system
  labels:
    tier: node
    app: flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "type": "flannel",
      "delegate": {
        "isDefaultGateway": true
      }
    }
  net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "udp",
        "Port": 8285
      }
    }
---
... ...
  • 卸载并重新安装pod网络
# kubectl delete -f kube-flannel.yml
configmap "kube-flannel-cfg" deleted
daemonset "kube-flannel-ds" deleted

# kubectl apply -f kube-flannel.yml
configmap "kube-flannel-cfg" created
daemonset "kube-flannel-ds" created

# netstat -an|grep 8285
udp        0      0 123.56.200.187:8285     0.0.0.0:*

经过测试发现:udp端口是通的。在两个node上tcpdump -i flannel0 可以看到udp数据包的发送和接收。但是两个node间的pod network依旧不通。

6、failed to register network: failed to acquire lease: node “iz25beglnhtz” not found

正常情况下master node和minion node上的flannel pod的启动日志如下:

master node flannel的运行:

I1227 04:56:16.577828       1 main.go:132] Installing signal handlers
I1227 04:56:16.578060       1 kube.go:233] starting kube subnet manager
I1227 04:56:16.578064       1 manager.go:133] Determining IP address of default interface
I1227 04:56:16.578576       1 manager.go:163] Using 123.56.200.187 as external interface
I1227 04:56:16.578616       1 manager.go:164] Using 123.56.200.187 as external endpoint
E1227 04:56:16.579079       1 network.go:106] failed to register network: failed to acquire lease: node "iz25beglnhtz" not found
I1227 04:56:17.583744       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I1227 04:56:17.585367       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1227 04:56:17.587765       1 ipmasq.go:47] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
I1227 04:56:17.589943       1 manager.go:246] Lease acquired: 10.244.0.0/24
I1227 04:56:17.590203       1 network.go:58] Watching for L3 misses
I1227 04:56:17.590255       1 network.go:66] Watching for new subnet leases
I1227 07:43:27.164103       1 network.go:153] Handling initial subnet events
I1227 07:43:27.164211       1 device.go:163] calling GetL2List() dev.link.Index: 5
I1227 07:43:27.164350       1 device.go:168] calling NeighAdd: 59.110.67.15, ca:50:97:1f:c2:ea

minion node上flannel的运行:

# docker logs 1f64bd9c0386
I1227 07:43:26.670620       1 main.go:132] Installing signal handlers
I1227 07:43:26.671006       1 manager.go:133] Determining IP address of default interface
I1227 07:43:26.670825       1 kube.go:233] starting kube subnet manager
I1227 07:43:26.671514       1 manager.go:163] Using 59.110.67.15 as external interface
I1227 07:43:26.671575       1 manager.go:164] Using 59.110.67.15 as external endpoint
I1227 07:43:26.746811       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I1227 07:43:26.749785       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1227 07:43:26.752343       1 ipmasq.go:47] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
I1227 07:43:26.755126       1 manager.go:246] Lease acquired: 10.244.1.0/24
I1227 07:43:26.755444       1 network.go:58] Watching for L3 misses
I1227 07:43:26.755475       1 network.go:66] Watching for new subnet leases
I1227 07:43:27.755830       1 network.go:153] Handling initial subnet events
I1227 07:43:27.755905       1 device.go:163] calling GetL2List() dev.link.Index: 10
I1227 07:43:27.756099       1 device.go:168] calling NeighAdd: 123.56.200.187, ca:68:7c:9b:cc:67

但在进行上面问题5的测试过程中,我们发现flannel container的启动日志中有如下错误:

master node:

# docker logs c2d1cee3df3d
I1228 06:53:52.502571       1 main.go:132] Installing signal handlers
I1228 06:53:52.502735       1 manager.go:133] Determining IP address of default interface
I1228 06:53:52.503031       1 manager.go:163] Using 123.56.200.187 as external interface
I1228 06:53:52.503054       1 manager.go:164] Using 123.56.200.187 as external endpoint
E1228 06:53:52.503869       1 network.go:106] failed to register network: failed to acquire lease: node "iz25beglnhtz" not found
I1228 06:53:52.503899       1 kube.go:233] starting kube subnet manager
I1228 06:53:53.522892       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I1228 06:53:53.524325       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1228 06:53:53.526622       1 ipmasq.go:47] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
I1228 06:53:53.528438       1 manager.go:246] Lease acquired: 10.244.0.0/24
I1228 06:53:53.528744       1 network.go:58] Watching for L3 misses
I1228 06:53:53.528777       1 network.go:66] Watching for new subnet leases

minion node:

# docker logs dcbfef45308b
I1228 05:28:05.012530       1 main.go:132] Installing signal handlers
I1228 05:28:05.012747       1 manager.go:133] Determining IP address of default interface
I1228 05:28:05.013011       1 manager.go:163] Using 59.110.67.15 as external interface
I1228 05:28:05.013031       1 manager.go:164] Using 59.110.67.15 as external endpoint
E1228 05:28:05.013204       1 network.go:106] failed to register network: failed to acquire lease: node "iz2ze39jeyizepdxhwqci6z" not found
I1228 05:28:05.013237       1 kube.go:233] starting kube subnet manager
I1228 05:28:06.041602       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I1228 05:28:06.042863       1 ipmasq.go:47] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1228 05:28:06.044896       1 ipmasq.go:47] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE
I1228 05:28:06.046497       1 manager.go:246] Lease acquired: 10.244.1.0/24
I1228 05:28:06.046780       1 network.go:98] Watching for new subnet leases
I1228 05:28:07.047052       1 network.go:191] Subnet added: 10.244.0.0/24

两个Node都有“注册网络”失败的错误:failed to register network: failed to acquire lease: node “xxxx” not found。很难断定是否是因为这两个错误导致的两个node间的网络不通。从整个测试过程来看,这个问题时有时无。在下面flannel issue中也有类似的问题讨论:

https://github.com/coreos/flannel/issues/435

Flannel pod network的诸多问题让我决定暂时放弃在kubeadm创建的kubernetes cluster中继续使用Flannel。

四、Calico pod network

Kubernetes支持的pod network add-ons中,除了Flannel,还有calicoWeave net等。这里我们试试基于边界网关BGP协议实现的Calico pod network。Calico Project针对在kubeadm建立的K8s集群的Pod网络安装也有专门的文档。文档中描述的需求和约束我们均满足,比如:

master node带有kubeadm.alpha.kubernetes.io/role: master标签:

# kubectl get nodes -o wide --show-labels
NAME           STATUS         AGE       EXTERNAL-IP   LABELS
iz25beglnhtz   Ready,master   3m        <none>        beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubeadm.alpha.kubernetes.io/role=master,kubernetes.io/hostname=iz25beglnhtz

在安装calico之前,我们还是要执行kubeadm reset重置环境,并将flannel创建的各种网络设备删除,可参考上面几个小节中的命令。

1、初始化集群

使用calico的kubeadm init无需再指定–pod-network-cidr=10.244.0.0/16 option:

# kubeadm init --api-advertise-addresses=10.47.217.91
[kubeadm] WARNING: kubeadm is in alpha, please do not use it for production clusters.
[preflight] Running pre-flight checks
[preflight] Starting the kubelet service
[init] Using Kubernetes version: v1.5.1
[tokens] Generated token: "531b3f.3bd900d61b78d6c9"
[certificates] Generated Certificate Authority key and certificate.
[certificates] Generated API Server key and certificate
[certificates] Generated Service Account signing keys
[certificates] Created keys and certificates in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[apiclient] Created API client, waiting for the control plane to become ready
[apiclient] All control plane components are healthy after 13.527323 seconds
[apiclient] Waiting for at least one node to register and become ready
[apiclient] First node is ready after 0.503814 seconds
[apiclient] Creating a test deployment
[apiclient] Test deployment succeeded
[token-discovery] Created the kube-discovery deployment, waiting for it to become ready
[token-discovery] kube-discovery is ready after 1.503644 seconds
[addons] Created essential addon: kube-proxy
[addons] Created essential addon: kube-dns

Your Kubernetes master has initialized successfully!

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:

http://kubernetes.io/docs/admin/addons/

You can now join any number of machines by running the following on each node:

kubeadm join --token=531b3f.3bd900d61b78d6c9 10.47.217.91

2、创建calico network

# kubectl apply -f http://docs.projectcalico.org/v2.0/getting-started/kubernetes/installation/hosted/kubeadm/calico.yaml
configmap "calico-config" created
daemonset "calico-etcd" created
service "calico-etcd" created
daemonset "calico-node" created
deployment "calico-policy-controller" created
job "configure-calico" created

实际创建过程需要一段时间,因为calico需要pull 一些images:

# docker images
REPOSITORY                                               TAG                        IMAGE ID            CREATED             SIZE
quay.io/calico/node                                      v1.0.0                     74bff066bc6a        7 days ago          256.4 MB
calico/ctl                                               v1.0.0                     069830246cf3        8 days ago          43.35 MB
calico/cni                                               v1.5.5                     ada87b3276f3        12 days ago         67.13 MB
gcr.io/google_containers/etcd                            2.2.1                      a6cd91debed1        14 months ago       28.19 MB

calico在master node本地创建了两个network device:

# ip a
... ...
47: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 192.168.91.0/32 scope global tunl0
       valid_lft forever preferred_lft forever
48: califa32a09679f@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 62:39:10:55:44:c8 brd ff:ff:ff:ff:ff:ff link-netnsid 0

3、minion node join

执行下面命令,将minion node加入cluster:

# kubeadm join --token=531b3f.3bd900d61b78d6c9 10.47.217.91

calico在minion node上也创建了一个network device:

57988: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 192.168.136.192/32 scope global tunl0
       valid_lft forever preferred_lft forever

join成功后,我们查看一下cluster status:

# kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                       READY     STATUS    RESTARTS   AGE       IP             NODE
kube-system   calico-etcd-488qd                          1/1       Running   0          18m       10.47.217.91   iz25beglnhtz
kube-system   calico-node-jcb3c                          2/2       Running   0          18m       10.47.217.91   iz25beglnhtz
kube-system   calico-node-zthzp                          2/2       Running   0          4m        10.28.61.30    iz2ze39jeyizepdxhwqci6z
kube-system   calico-policy-controller-807063459-f21q4   1/1       Running   0          18m       10.47.217.91   iz25beglnhtz
kube-system   dummy-2088944543-rtsfk                     1/1       Running   0          23m       10.47.217.91   iz25beglnhtz
kube-system   etcd-iz25beglnhtz                          1/1       Running   0          23m       10.47.217.91   iz25beglnhtz
kube-system   kube-apiserver-iz25beglnhtz                1/1       Running   0          23m       10.47.217.91   iz25beglnhtz
kube-system   kube-controller-manager-iz25beglnhtz       1/1       Running   0          23m       10.47.217.91   iz25beglnhtz
kube-system   kube-discovery-1769846148-51wdk            1/1       Running   0          23m       10.47.217.91   iz25beglnhtz
kube-system   kube-dns-2924299975-fhf5f                  4/4       Running   0          23m       192.168.91.1   iz25beglnhtz
kube-system   kube-proxy-2s7qc                           1/1       Running   0          4m        10.28.61.30    iz2ze39jeyizepdxhwqci6z
kube-system   kube-proxy-h2qds                           1/1       Running   0          23m       10.47.217.91   iz25beglnhtz
kube-system   kube-scheduler-iz25beglnhtz                1/1       Running   0          23m       10.47.217.91   iz25beglnhtz

所有组件都是ok的。似乎是好兆头!但跨node的pod network是否联通,还需进一步探究。

4、探究跨node的pod network联通性

我们依旧利用上面测试flannel网络的my-nginx-svc.yaml和run-my-nginx.yaml,创建my-nginx service和my-nginx deployment。注意:这之前要先在master node上执行一下”kubectl taint nodes –all dedicated-”,以让master node承载work load。

遗憾的是,结果和flannel很相似,分配到master node上http request得到了nginx的响应;minion node上的pod依旧无法联通。

这次我不想在calico这块过多耽搁,我要快速看看下一个候选者:weave net是否满足要求。

五、weave network for pod

经过上面那么多次尝试,结果是令人扫兴的。Weave network似乎是最后一颗救命稻草了。有了前面的铺垫,这里就不详细列出各种命令的输出细节了。Weave network也有专门的官方文档用于指导如何与kubernetes集群集成,我们主要也是参考它。

1、安装weave network add-on

在kubeadm reset后,我们重新初始化了集群。接下来我们安装weave network add-on:

# kubectl apply -f https://git.io/weave-kube
daemonset "weave-net" created

前面无论是Flannel还是calico,在安装pod network add-on时至少都还是顺利的。不过在Weave network这次,我们遭遇“当头棒喝”:(:

# kubectl get pod --all-namespaces -o wide
NAMESPACE     NAME                                   READY     STATUS              RESTARTS   AGE       IP             NODE
kube-system   dummy-2088944543-4kxtk                 1/1       Running             0          42m       10.47.217.91   iz25beglnhtz
kube-system   etcd-iz25beglnhtz                      1/1       Running             0          42m       10.47.217.91   iz25beglnhtz
kube-system   kube-apiserver-iz25beglnhtz            1/1       Running             0          42m       10.47.217.91   iz25beglnhtz
kube-system   kube-controller-manager-iz25beglnhtz   1/1       Running             0          42m       10.47.217.91   iz25beglnhtz
kube-system   kube-discovery-1769846148-pzv8p        1/1       Running             0          42m       10.47.217.91   iz25beglnhtz
kube-system   kube-dns-2924299975-09dcb              0/4       ContainerCreating   0          42m       <none>         iz25beglnhtz
kube-system   kube-proxy-z465f                       1/1       Running             0          42m       10.47.217.91   iz25beglnhtz
kube-system   kube-scheduler-iz25beglnhtz            1/1       Running             0          42m       10.47.217.91   iz25beglnhtz
kube-system   weave-net-3wk9h                        0/2       CrashLoopBackOff    16         17m       10.47.217.91   iz25beglnhtz

安装后,weave-net pod提示:CrashLoopBackOff。追踪其Container log,得到如下错误信息:

“`

docker logs cde899efa0af

time=”2016-12-28T08:25:29Z” level=info msg=”Starting Weaveworks NPC 1.8.2″
ti

当Docker遇到systemd

近期在做Kubernetes集群的升级的相关试验,即从原先的K8s 1.3.7版本升级到最新的K8s 1.5.1版本。k8s自1.4版本开始引入kubeadm,试图简化K8s的安装和使用门槛,提升开发者体验。但kubeadm仅支持16.04及以上的Ubuntu版本,于是我们在升级K8s集群前会遇到另外一个问题:Ubuntu 16.04已经由Upstart初始化系统换成了systemd初始化系统,Ubuntu 16.04上的Docker engine的使用和配置方法与以前在Ubuntu 14.04上将有所不同。Docker是K8s支持的容器引擎之一,也是目前最主流的容器引擎,弄清楚Docker的配置和使用也是后续用好K8s的前提之一。于是这里打算记录一下Docker与Systemd是如何相生共存的^0^。

一、Ubuntu 16.04安装Docker

Aliyun目前上没有提供官方Ubuntu 16.04 ECS,最高仅支持到Ubuntu 14.04.4。因此在Aliyun ECS上用16.04需要手工upgrade到16.04(不过建议在upgrade前做个snapshot,一旦upgrade失败,好恢复)。升级后的Ubuntu环境信息如下:

Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-58-generic x86_64)

kubeadm文档中认为Docker 1.11.2版本与之更配哟,不过对于更新的版本似乎配合起来也没有什么大问题。我们这里安装目前可以找到的最新stable release: docker 1.12.5:

# docker version
Client:
 Version:      1.12.5
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   7392c3b
 Built:        Fri Dec 16 02:42:17 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.5
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   7392c3b
 Built:        Fri Dec 16 02:42:17 2016
 OS/Arch:      linux/amd64

上面是你安装docker成功后,才能输出的version信息哦^0^。

安装Docker的方法随着docker的快速演进也在变化中,随着Docker的成熟,其方法趋于稳定。官方提供的在Ubuntu安装Docker的方法成为主流,我们这里也不例外的参考这一方法。不过这一方法有一前提,那就是你最好配备的“加(fan)速(qiang)器(qi)”,否则好慢,甚至是不成功。

详细步骤如下:(熟悉之的观众可略过之^_^)

1、从keyserver获取key

# apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys  58118E89F3A912897C070ADBF76221572C52609D

Executing: /tmp/tmp.OoFaQ0V0gx/gpg.1.sh --keyserver
hkp://p80.pool.sks-keyservers.net:80
--recv-keys
58118E89F3A912897C070ADBF76221572C52609D
gpg: requesting key 2C52609D from hkp server p80.pool.sks-keyservers.net
gpg: key 2C52609D: public key "Docker Release Tool (releasedocker) <docker@docker.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)

2、添加Docker源

创建/etc/apt/sources.list.d/docker.list文件,写入:

deb https://apt.dockerproject.org/repo ubuntu-xenial main

执行apt-get update更新包信息:

... ...
Get:11 https://apt.dockerproject.org/repo ubuntu-xenial InRelease [30.2 kB]
Fetched 30.2 kB in 2s (14.1 kB/s)
Reading package lists... Done

3、安装Docker engine

执行安装命令,安装Docker engine:

# apt install docker-engine
... ...
Setting up docker-engine (1.12.5-0~ubuntu-xenial) ...
Setting up liberror-perl (0.17-1.2) ...
Setting up git-man (1:2.7.4-0ubuntu1) ...
Setting up git (1:2.7.4-0ubuntu1) ...
Processing triggers for libc-bin (2.23-0ubuntu5) ...
Processing triggers for systemd (229-4ubuntu13) ...
Processing triggers for ureadahead (0.100.0-19) ...

验证安装结果:

# which docker
/usr/bin/docker

# docker version
... ... //输出和上一节相同的结果

# ps -ef|grep docker
root     22132     1  0 11:18 ?        00:00:00 /usr/bin/dockerd -H fd://
root     22162 22132  0 11:18 ?        00:00:00 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --shim docker-containerd-shim --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --runtime docker-runc

安装后,Docker引擎自动启动了。

二、Docker服务的使能与启停

控制Docker服务开机自启以及启停操作的脚本已经由upstart初始化系统的/etc/init.d/docker变为了systemd初始化系统的/lib/systemd/system/docker.service。

在systemd下,docker service的脚本路径通过下面命令可以找到:

# systemctl show --property=FragmentPath docker
FragmentPath=/lib/systemd/system/docker.service

通过下面命令可以查看docker service的是否是开机自启:

# systemctl is-enabled docker
enabled

通过systemctl enable和disable命令可以使能开机自启或取消开机自启。

传统Ubuntu通过service docker start/stop/restart启动、停止或重启服务,换到systemd后,我们需要用systemctl start/stop/restart docker来启动、停止或重启服务。

三、Docker的EnvironmentFile

以前我们给Docker engine设置一个http_proxy、设置–insecure-registry或–registry-mirror、配置一个dns啥的,都可以通过/etc/default/docker中的DOCKER_OPTS以及相关export的环境变量实现。但在Ubuntu 16.04下这个配置文件变成了这样:

# Docker Upstart and SysVinit configuration file

#
# THIS FILE DOES NOT APPLY TO SYSTEMD
#
#   Please see the documentation for "systemd drop-ins":
#   https://docs.docker.com/engine/articles/systemd/
... ...

问题来了!我们怎么配置Docker engine呢?Docker官方推荐在如下路径下面创建配置文件(比如http-proxy.conf),以override默认的docker.service文件中的配置:

/etc/systemd/system/docker.service.d

不过经测试后(after systemctl daemon-reload; systemctl restart docker),发现并不生效。

我们来使用EnvironmentFile对Docker Engine进行配置。编辑/lib/systemd/system/docker.service文件,添加如下内容:

ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS
EnvironmentFile=-/etc/default/docker

习惯了使用/etc/default/docker配置DOCKER_OPTS等配置,于是在EnvironmentFile中直接使用了该文件。

///etc/default/docker
DOCKER_OPTS="--dns 8.8.8.8 --dns 8.8.4.4"

# If you need Docker to use an HTTP proxy, it can also be specified here.
#export http_proxy="http://127.0.0.1:3128/"
http_proxy="http://xxxxx"
https_proxy="xxxx"
no_proxy="127.0.0.1,localhost"

保存后,执行:

systemctl daemon-reload
systemctl restart docker

你会发现配置生效了。

经常接触/etc/default/docker的人会发现,上述文件中的http_proxy等变量前面的export关键字没有了。没错,在systemd环境下,不再需要export了,如果加上export,反倒会导致配置不生效。

四、Docker引擎的日志

最后,Docker引擎的日志哪里去了?以前不是在/var/log/upstart/下面么?Ubuntu 16.04中,这个目录下连docker字样的影儿都没看到。

在systemd下面,我们需要搬出journalctl工具。想看docker service的实时日志,请执行:

# journalctl -u docker -f

看历史日志:

# journalctl --since "1 hour ago" -u docker

更多journalctl用法,可以参考其man pages

使用Visual Studio Code辅助Go源码编写

作为VIMer,日常编码中,Vim编辑器依然是我的首选。以前以C语言为主要语言的时候是这样,现在以Go为主要语言时亦是这样。不过近期发现Mac上使用Vim在编写Go代码时,Vim时不时的“抽风”:出现一些“屏幕字符被篡改”的问题,比如下面这幅图中”func”变成了”fknc”:

虽然一段时间后,显示会自动更正过来,但这种“篡改”是会让你产生“幻觉”的。你会想:是不是我真的将”func”写成”fknc”了呢?久而久之,这个瑕疵将会影响你的编码效率。至于为何会出现这个问题,初步怀疑可能是因为vim加载较多插件导致的一些性能问题,我在安装了Ubuntu 16.04的台式机上至今还没发现这个问题(相同的.vimrc配置)。

于是,我打算找一款辅助编辑器,用于在被上面这个问题折磨得开始“厌恶”Vim的某些时候,切换一下,平复一下心情^0^。我看中了Microsoft开源的Visual Studio Code,简称:VSCode。

一、与Microsoft的Visual Studio的渊源

Microsoft做IDE还是很专业的,也是很认真的。大学那时候学C,嫌弃Turbo C太简陋,基本上都是在D版Visual Studio 6.0上完成各种作业和小程序的制作的。后来在2001年微软发布了.net战略,发布了C#语言,同时也发布了Visual Studio .NET IDE。估计我也算是国内第一批使用到Visual Studio.NET IDE的人吧,那时候微软俱乐部在校园里免费发送Vs.net beta版光盘,我拿到了一份,并第一时间体验了vs.net。Visual Studio .NET与之前的VS 6.0有着天壤之别,功能强大,界面也做了重新设计,支持微软的各种语言,包括C#、C/C++(包括managed c++)、VB、ASP.net等,并在一年后的正式版发布后,逐渐在桌面应用程序开发中成为霸主,把那个时候在IDE领域的竞争对手Borland公司彻底打垮。但Visual Studio从此也变得更加庞大和臃肿,安装一个VS,没有几个G空间是不行的。想想那个时候机器的配置,跑个VS.net还真是心有而力不足。

工作之后,进入服务端编程领域,结识了Unix、Linux以及VimGCC,就再也没怎么碰过Visual Studio。随着工作OS也从Windows切换到Ubuntu,基本就和VS绝缘了。之后随着Java语言成为企业级应用的主角、Web时代的到来以及开源IDE(比如:Eclipse)的兴起,微软的Visual Studio不再那么耀眼,或者说是人们对于IDE的关注并不像开发GUI程序那个年代那么强烈了。但鉴于微软自身产品体系的庞大,VS始终在市场中占有一席之地。

而近些年,一些跨平台、轻量级、插件结构、支持智能感知、可随意定制的文本编辑器的出现,比如:Sublime TextAtom等让开发人员喜不自禁。这些编辑器并非定位于IDE,但功能又不输给IDE很多,尤其在支持编码、调试这些环节,它们完全可以与专业IDE媲美,但资源消耗却是像Visual Studio、Eclipse这样大而全的IDE所无法匹敌的。而Visual Studio Code恰是微软在这方面的一个尝试,也是微软最新公司战略的体现之一:拥抱所有开发者(不仅仅是Windows上的哦)。

二、VSCode安装

VSCode发布于2015年4月的Build大会上。发布后,迅速得到开发者响应,大家普遍反映:VSCode性能不错、关注细节、体验良好,虽然当时VSCode的插件还不算丰富。一年多过去后,VSCode已经演化到了1.8.1版本(截至2016年12月末),支持所有主流编程语言的开发,配套的插件也十分丰富了。VSCode的安装简单的很,这一向都是微软的强项,你可以在其官方站上下载到各个平台的安装包(Linux平台也有.deb/.rpm两种包格式供选择,并提供32bit和64bit两种版本)。下载后安装即可。

1、VSCode配置和数据存储路径

VSCode安装后,一般不必关心其配置和数据存储路径的位置。但作为有一些Geek精神的developer来说,弄清楚其安装和配置的来龙去脉还是很有意义的。

在Mac上:

VSCode存储运行数据和配置文件的目录在:~/Library/Application Support/Code下:

 ~/Library/Application Support/Code]$ls
Backups/        CachedData/        Cookies-journal        Local Storage/        User/
Cache/            Cookies            GPUCache/        Preferences        storage.json

$ls User
keybindings.json    locale.json        settings.json        snippets/        workspaceStorage/

在Ubuntu中:

VSCode存储运行数据和配置文件的目录在~/.config/Code下面:

~/.config/Code$ ls
Backups  Cache  CachedData  Cookies  Cookies-journal  GPUCache  Local Storage  storage.json  User

至于Windows平台,请自行探索^_^。

2、启动方式

VSCode有两种启动方式:桌面启动和命令行启动。桌面启动自不必说了。命令行启动的示例如下:

$ code main.go

code命令会打开一个VSCode窗口并加载命令参数中的文件内容,这里是main.go。

三、VSCode的配置

一般来说,VSCode启动即可用了。但要想发挥出VSCode的能量,我们必须对其进行一番配置。VSCode的配置有几十上百项,这里无法全覆盖,仅说明一下我个人比较关注的。

1、安装插件

像VSCode这种小清新文本编辑器要想对编程语言有很好的支持,必须安装相应语言的插件。以Go为例,我们至少要安装vscode-go插件。vscode-go之于VSCode,就好比vim-go之于VIM。并且和vim-go类似,vscode-go实现的各种Features也是依赖诸多已存在的Go周边工具,包括:

gocode: go get -u -v github.com/nsf/gocode
godef: go get -u -v github.com/rogpeppe/godef
gogetdoc: go get -u -v github.com/zmb3/gogetdoc
golint: go get -u -v github.com/golang/lint/golint
go-outline: go get -u -v github.com/lukehoban/go-outline
goreturns: go get -u -v sourcegraph.com/sqs/goreturns
gorename: go get -u -v golang.org/x/tools/cmd/gorename
gopkgs: go get -u -v github.com/tpng/gopkgs
go-symbols: go get -u -v github.com/newhook/go-symbols
guru: go get -u -v golang.org/x/tools/cmd/guru
gotests: go get -u -v github.com/cweill/gotests/...

因此,要想实现vscode-go官网页面中demo中哪些神奇的Feature,你必须将上面的这些依赖工具逐一安装成功。如果缺少一个依赖工具,VSCode会在窗口右下角的状态栏里显示:“Analysis Tools Missing”字样,以提示你安装这些工具。

VSCode当然也支持Vim-mode的编辑模式,如果你也和我一样,喜欢用vim-mode在VSCode中进行编辑,可以安装VSCodeVim插件

VSCode的插件安装方式分为两种:在线安装和VSIX方式安装。

在线安装,顾名思义,即在VSCode的窗口左侧边栏中点击“Extensions”按钮,在打开的Extensions搜索框中搜索你想要的插件名称,或者选择预制的条件获得插件信息。选中你要安装的插件,点击“Install”按钮即可完成安装。

VSIX安装:即到插件官网将插件文件下载到本地(插件安装文件一般以.vsix或.zip结尾),在窗口中选择:”Install from VSIX…”,选择你下载的插件文件即可。

安装后的插件都被放在~/.vscode/extensions目录下(mac和linux)。

2、更改语言设置

VSCode在初次启动时会判断当前系统语言,并以相应的语言作为默认窗口显示语言。比如:我的是中文OS X系统,那么默认VSCode的窗口文字都是中文。如果我要将其改为英文,应该如何操作呢?

F1登场!这里的F1可不是赛车比赛,而是快捷键F1,估计也是整个VSCode最常用的快捷键之一了。敲击F1后,VSCode会显示其“Command Palette”输入框,这里面包含了当前VSCode可以执行的所有操作命令,支持Search。我们输入”language”,在搜索结果中选择“Configure Language”,VSCode打开一个新的编辑窗口,加载~/Library/Application Support/Code/User/locale.json文件:

{
    // 定义 VSCode 的显示语言。
    // 请参阅 https://go.microsoft.com/fwlink/?LinkId=761051,了解支持的语言列表。
    // 要更改值需要重启 VSCode。
    "locale": "zh-cn"
}

当前语言为中文,如果我们要将其改为英文,则修改该文件中的”locale”项:

{
    // 定义 VSCode 的显示语言。
    // 请参阅 https://go.microsoft.com/fwlink/?LinkId=761051,了解支持的语言列表。
    // 要更改值需要重启 VSCode。
    "locale": "en-US"
}

保存,重启VSCode。再次启动的VSCode将会以英文界面示人了。

3、User Settings和Workspace Settings

UserSettings是一种“全局”设置,而Workspace Settings则顾名思义,是一种针对一个特定目录或project的设置。

UserSettings设置后的数据保存在~/Library/Application Support/Code下(以mac为例),而Workspace Setting设置后的数据则保存在某个项目特定目录下的.vscode目录下。

在菜单栏,选择【Preferences -> User Settings】可以打开~/Library/Application Support/Code/User/settings.json文件。默认情况下,该文件为空。VSCode采用默认设置。如果你要个性化设置,那么可将对应的配置项copy一份到settings.json中,并赋予其新值,保存即可。新值将覆盖默认值。以字体大小为例,我们将默认的editor.fontSize 12改为10:

// Place your settings in this file to overwrite the default settings
{
    "editor.fontSize": 10,
}

保存后,可以看到窗口中所有文字的Size都变小了。

在菜单栏,选择【Preferences -> Workspace Settings】可打开当前工作目录下的.vscode的settings.json文件,其工作原理和配置方法与User Settings一样,只是生效范围仅限于该工作区范畴。

4、Color Theme

VSCode内置了主流的配色方案,比如:monokai、solarized dark/light等。F1,输入”color”搜索,选择:“Perefences: Color Theme”(在MAC上也可以用cmd+k, cmd+t打开),在下拉列表中选择你喜欢的配色Theme即可,即可生效。

四、vscode-go的使用

前面说过,和vim-go一样,vscode-go插件实现了Go编码中需要的各种功能:自动format、自动增删import、build on save、lint on save、定义跳转、原型信息快速提示、自动补全、code snippets等。另外它通过带颜色的波浪线提示代码问题(虽然有时候反应有点慢),包括语法问题、不符合idiomatic go规则的问题(比如appId这个命名,它会建议你改为appID)等。

code snippets非常好用,内置的code snippets在~/.vscode/extensions/lukehoban.Go-0.6.51/snippets/go.json中可以找到,类似这样的定义:

//~/.vscode/extensions/lukehoban.Go-0.6.51/snippets/go.json
{
        ".source.go": {
                "single import": {
                        "prefix": "im",
                        "body": "import \"${1:package}\""
                },
                "multiple imports": {
                        "prefix": "ims",
                        "body": "import (\n\t\"${1:package}\"\n)"
                },
                "single constant": {
                        "prefix": "co",
                        "body": "const ${1:name} = ${2:value}"
                },
                "multiple constants": {
                        "prefix": "cos",
                        "body": "const (\n\t${1:name} = ${2:value}\n)"
                },
                "type interface declaration": {
                        "prefix": "tyi",
                        "body": "type ${1:name} interface {\n\t$0\n}"
                },
                "type struct declaration": {
                        "prefix": "tys",
                        "body": "type ${1:name} struct {\n\t$0\n}"
                },
                "package main and main function": {
                        "prefix": "pkgm",
                        "body": "package main\n\nfunc main() {\n\t$0\n}"
                },
... ...

敲入”prefix”的值,比如”ims”,输入tab,vscode-go将为你展开为:

import (
    "package"
)

在使用vscode时遇到过一次代码自动补全“失灵”的问题。vscode-go只会提示:”PANIC,PANIC,PANIC”。经查,这个是gocode daemon的问题,我的解决方法是:

gocode close //关闭gocode daemon
gocode -s &  //重启之。

五、小结

在诸多轻量级编辑器中,我还是比较看好vscode的,毕竟其背后有着Microsoft积淀多年的IDE产品开发经验。并且和Microsoft以往产品最大的不同就是其是开源项目。

关于Vscode的使用和奇技淫巧可以参见其官方的这篇文档“VS Code Tips and Tricks”。

关于Vscode的各种周边工具和资料列表,请参考Awesome-vscode项目

快捷键往往是开发人员的最爱,VSCode官方制作了三个平台的VSCode的快捷键worksheet:

https://code.visualstudio.com/shortcuts/keyboard-shortcuts-windows.pdf

https://code.visualstudio.com/shortcuts/keyboard-shortcuts-macos.pdf

https://code.visualstudio.com/shortcuts/keyboard-shortcuts-linux.pdf

VSCode还在快速发展,离完善还有不小提升空间。比如:在使用过程中也发现了VSCode 窗口无响应或代码编辑错乱之情况。不过作为Go编码的一个辅助编辑器,VSCode还是完全胜任和超出预期的。




这里是Tony Bai的个人Blog,欢迎访问、订阅和留言!订阅Feed请点击上面图片

如果您觉得这里的文章对您有帮助,请扫描上方二维码进行捐赠,加油后的Tony Bai将会为您呈现更多精彩的文章,谢谢!

如果您喜欢通过微信App浏览本站内容,可以扫描下方二维码,订阅本站官方微信订阅号“iamtonybai”;点击二维码,可直达本人官方微博主页^_^:



本站Powered by Digital Ocean VPS。

选择Digital Ocean VPS主机,即可获得10美元现金充值,可免费使用两个月哟!

著名主机提供商Linode 10$优惠码:linode10,在这里注册即可免费获得。

阿里云推荐码:1WFZ0V立享9折!

View Tony Bai's profile on LinkedIn


文章

评论

  • 正在加载...

分类

标签

归档











更多