标签 容器 下的文章

在Kubernetes集群上部署高可用Harbor镜像仓库

关于基于Harbor的高可用私有镜像仓库,在我的博客里曾不止一次提到,在源创会2017沈阳站上,我还专门以此题目和大家做了分享。事后,很多人通过微博私信个人公众号或博客评论问我是否可以在Kubernetes集群上安装高可用的Harbor仓库,今天我就用这篇文章来回答大家这个问题。

一、Kubernetes上的高可用Harbor方案

首先,我可以肯定给出一个回答:Harbor支持在Kubernetes部署。只不过Harbor官方的默认安装并非是高可用的,而是“单点式”的。在《基于Harbor的高可用企业级私有容器镜像仓库部署实践》一文中,我曾谈到了一种在裸机或VM上的、基于Cephfs共享存储的高可用Harbor方案。在Kubernetes上部署,其高可用的思路也是类似的,可见下面这幅示意图:

img{512x368}

围绕这幅示意图,简单说明一下我们的方案:

  • 通过在Kubernetes上启动Harbor内部各组件的多个副本的方式实现Harbor服务的计算高可用;
  • 通过挂载CephFS共享存储的方式实现镜像数据高可用;
  • Harbor使用的配置数据和关系数据放在外部(External)数据库集群中,保证数据高可用和实时一致性;
  • 通过外部Redis集群实现UI组件的session共享。

方案确定后,接下来我们就开始部署。

二、环境准备

在Harbor官方的对Kubernetes支持的说明中,提到当前的Harbor on kubernetes相关脚本和配置在Kubernetes v1.6.5和Harbor v1.2.0上验证测试通过了,因此在我们的实验环境中,Kubernetes至少要准备v1.6.5及以后版本。下面是我的环境的一些信息:

Kubernetes使用v1.7.3版本:

# kubelet --version
Kubernetes v1.7.3

Docker使用17.03.2版本:

# docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 03:35:14 2017
 OS/Arch:      linux/amd64
 Experimental: false

关于Harbor的相关脚本,我们直接用master branch中的,而不是v1.2.0这个release版本中的。切记!否则你会发现v1.2.0版本源码中的相关kubernetes支持脚本根本就没法工作,甚至缺少adminserver组件的相关脚本。不过Harbor相关组件的image版本,我们使用的还是v1.2.0的:

Harbor源码的版本:

commit 82d842d77c01657589d67af0ea2d0c66b1f96014
Merge pull request #3741 from wy65701436/add-tc-concourse   on Dec 4, 2017

Harbor各组件的image的版本:

REPOSITORY                      TAG                 IMAGE ID
vmware/harbor-jobservice      v1.2.0          1fb18427db11
vmware/harbor-ui              v1.2.0          b7069ac3bd4b
vmware/harbor-adminserver     v1.2.0          a18331f0c1ae
vmware/registry               2.6.2-photon    c38af846a0da
vmware/nginx-photon           1.11.13         2971c92cc1ae

除此之外,高可用Harbor使用外部的DB cluster和redis cluster,DB cluster我们采用MySQL,对于MySQL cluster,可以使用mysql galera cluster或MySQL5.7以上版本自带的Group Replication (MGR) 集群。

三、探索harbor on k8s部署脚本和配置

我们在本地创建harbor-install-on-k8s目录,并将Harbor最新源码下载到该目录下:

# mkdir harbor-install-on-k8s
# cd harbor-install-on-k8s
# wget -c https://github.com/vmware/harbor/archive/master.zip
# unzip master.zip
# cd harbor-master
# ls -F
AUTHORS  CHANGELOG.md  contrib/  CONTRIBUTING.md  docs/
LICENSE  make/  Makefile  NOTICE  partners.md  README.md
ROADMAP.md  src/  tests/  tools/  VERSION

将Harbor部署到k8s上的脚本就在make/kubernetes目录下:

# cd harbor-master/make
# tree kubernetes
kubernetes
├── adminserver
│   ├── adminserver.rc.yaml
│   └── adminserver.svc.yaml
├── jobservice
│   ├── jobservice.rc.yaml
│   └── jobservice.svc.yaml
├── k8s-prepare
├── mysql
│   ├── mysql.rc.yaml
│   └── mysql.svc.yaml
├── nginx
│   ├── nginx.rc.yaml
│   └── nginx.svc.yaml
├── pv
│   ├── log.pvc.yaml
│   ├── log.pv.yaml
│   ├── registry.pvc.yaml
│   ├── registry.pv.yaml
│   ├── storage.pvc.yaml
│   └── storage.pv.yaml
├── registry
│   ├── registry.rc.yaml
│   └── registry.svc.yaml
├── templates
│   ├── adminserver.cm.yaml
│   ├── jobservice.cm.yaml
│   ├── mysql.cm.yaml
│   ├── nginx.cm.yaml
│   ├── registry.cm.yaml
│   └── ui.cm.yaml
└── ui
    ├── ui.rc.yaml
    └── ui.svc.yaml

8 directories, 25 files

  • k8s-prepare脚本:根据templates下的模板文件以及harbor.cfg中的配置生成各个组件,比如registry等的最终configmap配置文件。它的作用类似于用docker-compose工具部署Harbor时的prepare脚本;
  • templates目录:templates目录下放置各个组件的配置模板文件(configmap文件模板),将作为k8s-prepare的输入;
  • pv目录:Harbor组件所使用的存储插件的配置,默认情况下使用hostpath,对于高可用Harbor而言,我们这里将使用cephfs;
  • 其他组件目录,比如:registry:这些目录中存放这各个组件的service yaml和rc yaml,用于在Kubernetes cluster启动各个组件时使用。

下面我用一个示意图来形象地描述一下配置的生成过程以及各个文件在后续Harbor组件启动中的作用:

img{512x368}

由于使用external mysql db,Harbor自带的mysql组件我们不会使用,对应的pv目录下的storage.pv.yaml和storage.pvc.yaml我们也不会去关注和使用。

四、部署步骤

1、配置和创建挂载Cephfs的pv和pvc

我们先在共享分布式存储CephFS上为Harbor的存储需求创建目录:apps/harbor-k8s,并在harbor-k8s下创建两个子目录:log和registry,分别满足jobservice和registry的存储需求:

# cd /mnt   // CephFS的根目录挂载到了/mnt下面
# mkdir -p apps/harbor-k8s/log
# mkdir -p apps/harbor-k8s/registry
# tree apps/harbor-k8s
apps/harbor-k8s
├── log
└── registry

关于CephFS的挂载等具体操作步骤,可以参见我的《Kubernetes集群跨节点挂载CephFS》一文。

接下来,创建用于k8s pv挂载cephfs的ceph-secret,我们编写一个ceph-secret.yaml文件:

//ceph-secret.yaml
apiVersion: v1
data:
  key: {base64 encoding of the ceph admin.secret}
kind: Secret
metadata:
  name: ceph-secret
type: Opaque

创建ceph-secret:

# kubectl create -f ceph-secret.yaml
secret "ceph-secret" created

最后,我们来修改pv、pvc文件并创建对应的pv和pvc资源,要修改的文件包括pv/log.xxx和pv/registry.xxx,我们的目的就是用cephfs替代原先的hostPath:

//log.pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: log-pv
  labels:
    type: log
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteMany
  cephfs:
    monitors:
      - {ceph-mon-node-ip}:6789
    path: /apps/harbor-k8s/log
    user: admin
    secretRef:
      name: ceph-secret
    readOnly: false
  persistentVolumeReclaimPolicy: Retain

//log.pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: log-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  selector:
    matchLabels:
      type: log

// registry.pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: registry-pv
  labels:
    type: registry
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  cephfs:
    monitors:
      - 10.47.217.91:6789
    path: /apps/harbor-k8s/registry
    user: admin
    secretRef:
      name: ceph-secret
    readOnly: false
  persistentVolumeReclaimPolicy: Retain

//registry.pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: registry-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
  selector:
    matchLabels:
      type: registry

创建pv和pvc:

# kubectl create -f log.pv.yaml
persistentvolume "log-pv" created
# kubectl create -f log.pvc.yaml
persistentvolumeclaim "log-pvc" created
# kubectl create -f registry.pv.yaml
persistentvolume "registry-pv" created
# kubectl create -f registry.pvc.yaml
persistentvolumeclaim "registry-pvc" created
# kubectl get pvc
NAME           STATUS    VOLUME        CAPACITY   ACCESSMODES   STORAGECLASS   AGE
log-pvc        Bound     log-pv        1Gi        RWX                          31s
registry-pvc   Bound     registry-pv   5Gi        RWX                          2s
# kubectl get pv
NAME          CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM                  STORAGECLASS   REASON    AGE
log-pv        1Gi        RWX           Retain          Bound     default/log-pvc                                 36s
registry-pv   5Gi        RWX           Retain          Bound     default/registry-pvc                            6s

2、创建和初始化Harbor用的数据库

我们需要在External DB中创建Harbor访问数据库所用的user(harbork8s/harbork8s)以及所使用的数据库(registry_k8s):

mysql> create user harbork8s identified  by 'harbork8s';
Query OK, 0 rows affected (0.03 sec)

mysql> GRANT ALL PRIVILEGES ON *.* TO 'harbork8s'@'%' IDENTIFIED BY 'harbork8s' WITH GRANT OPTION;
Query OK, 0 rows affected, 1 warning (0.00 sec)

# mysql> create database registry_k8s;
Query OK, 1 row affected (0.00 sec)

mysql> grant all on registry_k8s.* to 'harbork8s' identified by 'harbork8s';
Query OK, 0 rows affected, 1 warning (0.00 sec)

由于目前Harbor还不支持自动init数据库,因此我们需要为新建的registry_k8s数据库做初始化,具体的方案就是先使用docker-compose工具在本地启动一个harbor,通过mysqldump将harbor-db container中的数据表dump出来,再导入到external db中的registry_k8s中,具体操作步骤如下:

# wget -c http://harbor.orientsoft.cn/harbor-1.2.0/harbor-offline-installer-v1.2.0.tgz
# tar zxvf harbor-offline-installer-v1.2.0.tgz

进入harbor目录,修改harbor.cfg中的hostname:

hostname = hub.tonybai.com:31777

# ./prepare
# docker-compose up -d

找到harbor_db的container id: 77fde71390e7,进入容器,并将数据库registry dump出来:

# docker exec -i -t  77fde71390e7 bash
# mysqldump -u root -pxxx --databases registry > registry.dump

离开容器,将容器内导出的registry.dump copy到本地:
# docker cp 77fde71390e7:/tmp/registry.dump ./

修改registry.dump为registry_k8s.dump,修改其内容中的registry为registry_k8s,然后导入到external db:

# mysqldump -h external_db_ip -P 3306 -u harbork8s -pharbork8s
mysql> source ./registry_k8s.dump;

3、配置make/harbor.cfg

harbor.cfg是整个配置生成的重要输入,我们在k8s-prepare执行之前,先要根据我们的需要和环境对harbor.cfg进行配置:

// make/harbor.cfg
hostname = hub.tonybai.com:31777
db_password = harbork8s
db_host = {external_db_ip}
db_user = harbork8s

4、对templates目录下的configmap配置模板(*.cm.yaml)进行配置调整

  • templates/adminserver.cm.yaml:
MYSQL_HOST: {external_db_ip}
MYSQL_USR: harbork8s
MYSQL_DATABASE: registry_k8s
RESET: "true"

注:adminserver.cm.yaml没有使用harbor.cfg中的有关数据库的配置项,而是需要单独再配置一遍,这块估计将来会fix掉这个问题。

  • templates/registry.cm.yaml:
rootcertbundle: /etc/registry/root.crt
  • templates/ui.cm.yaml:

ui组件需要添加session共享。ui组件读取_REDIS_URL环境变量:

//vmware/harbor/src/ui/main.go
... ..
    redisURL := os.Getenv("_REDIS_URL")
    if len(redisURL) > 0 {
        beego.BConfig.WebConfig.Session.SessionProvider = "redis"
        beego.BConfig.WebConfig.Session.SessionProviderConfig = redisURL
    }
... ...

而redisURL的格式在beego的源码中有说明:

// beego/session/redis/sess_redis.go

// SessionInit init redis session
// savepath like redis server addr,pool size,password,dbnum
// e.g. 127.0.0.1:6379,100,astaxie,0
func (rp *Provider) SessionInit(maxlifetime int64, savePath string) error {...}

因此,我们在templates/ui.cm.yaml中添加一行:

_REDIS_URL: {redis_ip}:6379,100,{redis_password},11

jobservice.cm.yaml和nginx.cm.yaml无需改变。

5、对各组件目录下的xxx.rc.yaml和xxx.svc.yaml配置模板进行配置调整

  • adminserver/adminserver.rc.yaml
replicas: 3
  • adminserver/adminserver.svc.yaml

不变。

  • jobservice/jobservice.rc.yaml、jobservice/jobservice.svc.yaml

不变。

  • nginx/nginx.rc.yaml
replicas: 3
  • nginx/nginx.svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  type: NodePort
  ports:
    - name: http
      port: 80
      nodePort: 31777
      protocol: TCP
  selector:
    name: nginx-apps
  • registry/registry.rc.yaml
replicas: 3
mountPath: /etc/registry

这里有一个严重的bug,即registry.rc.yaml中configmap的默认mount路径:/etc/docker/registry与registry的docker image中的registry配置文件的路径/etc/registry不一致,这将导致我们精心配置的registry的configmap根本没有发挥作用,数据依然在memory中,而不是在我们配置的Cephfs中。这样一旦registry container退出,仓库的image数据就会丢失。同时也无法实现数据的高可用。因此,我们将mountPath都改为与registry image的一致,即:/etc/registry目录。

  • registry/registry.svc.yaml

不变。

  • ui/ui.rc.yaml
replicas: 3
  • ui/ui.svc.yaml
- name: _REDIS_URL
             valueFrom:
               configMapKeyRef:
                 name: harbor-ui-config
                 key: _REDIS_URL

6、执行k8s-prepare

执行k8s-prepare,生成各个组件的configmap文件:

# ./k8s-prepare
# git status
 ... ...

    adminserver/adminserver.cm.yaml
    jobservice/jobservice.cm.yaml
    mysql/mysql.cm.yaml
    nginx/nginx.cm.yaml
    registry/registry.cm.yaml
    ui/ui.cm.yaml

7、启动Harbor组件

  • 创建configmap
# kubectl apply -f jobservice/jobservice.cm.yaml
configmap "harbor-jobservice-config" created
# kubectl apply -f nginx/nginx.cm.yaml
configmap "harbor-nginx-config" created
# kubectl apply -f registry/registry.cm.yaml
configmap "harbor-registry-config" created
# kubectl apply -f ui/ui.cm.yaml
configmap "harbor-ui-config" created
# kubectl apply -f adminserver/adminserver.cm.yaml
configmap "harbor-adminserver-config" created

# kubectl get cm
NAME                        DATA      AGE
harbor-adminserver-config   42        14s
harbor-jobservice-config    8         16s
harbor-nginx-config         3         16s
harbor-registry-config      2         15s
harbor-ui-config            9         15s
  • 创建harbor各组件对应的k8s service
# kubectl apply -f jobservice/jobservice.svc.yaml
service "jobservice" created
# kubectl apply -f nginx/nginx.svc.yaml
service "nginx" created
# kubectl apply -f registry/registry.svc.yaml
service "registry" created
# kubectl apply -f ui/ui.svc.yaml
service "ui" created
# kubectl apply -f adminserver/adminserver.svc.yaml
service "adminserver" created

# kubectl get svc
NAME               CLUSTER-IP      EXTERNAL-IP   PORT(S)
adminserver        10.103.7.8      <none>        80/TCP
jobservice         10.104.14.178   <none>        80/TCP
nginx              10.103.46.129   <nodes>       80:31777/TCP
registry           10.101.185.42   <none>        5000/TCP,5001/TCP
ui                 10.96.29.187    <none>        80/TCP
  • 创建rc,启动各个组件pods
# kubectl apply -f registry/registry.rc.yaml
replicationcontroller "registry-rc" created
# kubectl apply -f jobservice/jobservice.rc.yaml
replicationcontroller "jobservice-rc" created
# kubectl apply -f ui/ui.rc.yaml
replicationcontroller "ui-rc" created
# kubectl apply -f nginx/nginx.rc.yaml
replicationcontroller "nginx-rc" created
# kubectl apply -f adminserver/adminserver.rc.yaml
replicationcontroller "adminserver-rc" created

#kubectl get pods
NAMESPACE     NAME                  READY     STATUS    RESTARTS   AGE
default       adminserver-rc-9pc78  1/1       Running   0          3m
default       adminserver-rc-pfqtv  1/1       Running   0          3m
default       adminserver-rc-w55sx  1/1       Running   0          3m
default       jobservice-rc-d18zk   1/1       Running   1          3m
default       nginx-rc-3t5km        1/1       Running   0          3m
default       nginx-rc-6wwtz        1/1       Running   0          3m
default       nginx-rc-dq64p        1/1       Running   0          3m
default       registry-rc-6w3b7     1/1       Running   0          3m
default       registry-rc-dfdld     1/1       Running   0          3m
default       registry-rc-t6fnx     1/1       Running   0          3m
default       ui-rc-0kwrz           1/1       Running   1          3m
default       ui-rc-kzs8d           1/1       Running   1          3m
default       ui-rc-vph6d           1/1       Running   1          3m

五、验证与Troubleshooting

1、docker cli访问

由于harbor默认使用了http访问,因此在docker login前先要将我们的仓库地址加到/etc/docker/daemon.json的insecure-registries中:

///etc/docker/daemon.json
{
  "insecure-registries": ["hub.tonybai.com:31777"]
}

systemctl daemon-reload and restart后,我们就可以通过docker login登录新建的仓库了(初始密码:Harbor12345):

 docker login hub.tonybai.com:31777
Username (admin): admin
Password:
Login Succeeded

2、docker push & pull

我们测试上传一个busybox image:

# docker pull busybox
Using default tag: latest
latest: Pulling from library/busybox
0ffadd58f2a6: Pull complete
Digest: sha256:bbc3a03235220b170ba48a157dd097dd1379299370e1ed99ce976df0355d24f0
Status: Downloaded newer image for busybox:latest
# docker tag busybox:latest hub.tonybai.com:31777/library/busybox:latest
# docker push hub.tonybai.com:31777/library/busybox:latest
The push refers to a repository [hub.tonybai.com:31777/library/busybox]
0271b8eebde3: Preparing
0271b8eebde3: Pushing [==================================================>] 1.338 MB
0271b8eebde3: Pushed
latest: digest: sha256:179cf024c8a22f1621ea012bfc84b0df7e393cb80bf3638ac80e30d23e69147f size: 527

下载刚刚上传的busybox:

# docker pull hub.tonybai.com:31777/library/busybox:latest
latest: Pulling from library/busybox
414e5515492a: Pull complete
Digest: sha256:179cf024c8a22f1621ea012bfc84b0df7e393cb80bf3638ac80e30d23e69147f
Status: Downloaded newer image for hub.tonybai.com:31777/library/busybox:latest

3、访问Harbor UI

在浏览器中打开http://hub.tonybai.com:31777,用admin/Harbor12345登录,如果看到下面页面,说明安装部署成功了:

img{512x368}

六、参考资料


微博:@tonybai_cn
微信公众号:iamtonybai
github.com: https://github.com/bigwhite

微信赞赏:
img{512x368}

理解Docker的多阶段镜像构建

Docker技术从2013年诞生到目前已经4年有余了。对于已经接纳和使用Docker技术在日常开发工作中的开发者而言,构建Docker镜像已经是家常便饭。但这是否意味着Docker的image构建机制已经相对完美了呢?不是的,Docker官方依旧在持续优化镜像构建机制。这不,从今年发布的Docker 17.05版本起,Docker开始支持容器镜像的多阶段构建(multi-stage build)了。

什么是镜像多阶段构建呢?直接给出概念定义太突兀,这里先卖个关子,我们先从日常开发中用到的镜像构建的方式和所遇到的镜像构建的问题说起。

一、同构的镜像构建

我们在做镜像构建时的一个常见的场景就是:应用在开发者自己的开发机或服务器上直接编译,编译出的二进制程序再打入镜像。这种情况一般要求编译环境与镜像所使用的base image是兼容的,比如说:我在Ubuntu 14.04上编译应用,并将应用打入基于ubuntu系列base image的镜像。这种构建我称之为“同构的镜像构建”,因为应用的编译环境与其部署运行的环境是兼容的:我在Ubuntu 14.04下编译出来的应用,可以基本无缝地在基于ubuntu:14.04及以后版本base image镜像(比如:16.04、16.10、17.10等)中运行;但在不完全兼容的base image中,比如centos中就可能会运行失败。

1、同构镜像构建举例

这里举个同构镜像构建的例子(后续的章节也是基于这个例子的),注意:我们的编译环境为Ubuntu 16.04 x86_64虚拟机、Go 1.8.3和docker 17.09.0-ce

我们用一个Go语言中最常见的http server作为例子:

// github.com/bigwhite/experiments/multi_stage_image_build/isomorphism/httpserver.go
package main

import (
        "net/http"
        "log"
        "fmt"
)

func home(w http.ResponseWriter, req *http.Request) {
        w.Write([]byte("Welcome to this website!\n"))
}

func main() {
        http.HandleFunc("/", home)
        fmt.Println("Webserver start")
        fmt.Println("  -> listen on port:1111")
        err := http.ListenAndServe(":1111", nil)
        if err != nil {
                log.Fatal("ListenAndServe:", err)
        }
}

编译这个程序:

# go build -o myhttpserver httpserver.go
# ./myhttpserver
Webserver start
  -> listen on port:1111

这个例子看起来很简单,也没几行代码,但背后Go net/http包在底层做了大量的事情,包括很多系统调用,能够反映出应用与操作系统的“耦合”,这在后续的讲解中会体现出来。接下来我们就来为这个程序构建一个docker image,并基于这个image来启动一个myhttpserver容器。我们选择ubuntu:14.04作为base image:

// github.com/bigwhite/experiments/multi_stage_image_build/isomorphism/Dockerfile
From ubuntu:14.04

COPY ./myhttpserver /root/myhttpserver
RUN chmod +x /root/myhttpserver

WORKDIR /root
ENTRYPOINT ["/root/myhttpserver"]

执行构建:

# docker build -t myrepo/myhttpserver:latest .
Sending build context to Docker daemon  5.894MB
Step 1/5 : FROM ubuntu:14.04
 ---> dea1945146b9
Step 2/5 : COPY ./myhttpserver /root/myhttpserver
 ---> 993e5129c081
Step 3/5 : RUN chmod +x /root/myhttpserver
 ---> Running in 104d84838ab2
 ---> ebaeca006490
Removing intermediate container 104d84838ab2
Step 4/5 : WORKDIR /root
 ---> 7afdc2356149
Removing intermediate container 450ccfb09ffd
Step 5/5 : ENTRYPOINT /root/myhttpserver
 ---> Running in 3182766e2a68
 ---> 77f315e15f14
Removing intermediate container 3182766e2a68
Successfully built 77f315e15f14
Successfully tagged myrepo/myhttpserver:latest

# docker images
REPOSITORY            TAG                 IMAGE ID            CREATED             SIZE
myrepo/myhttpserver   latest              77f315e15f14        18 seconds ago      200MB

# docker run myrepo/myhttpserver
Webserver start
  -> listen on port:1111

以上是最基本的image build方法。

接下来,我们可能会遇到如下需求:
* 搭建一个Go程序的构建环境有时候是很耗时的,尤其是对那些依赖很多第三方开源包的Go应用来说,下载包就需要很长时间。我们最好将这些易变的东西统统打包到一个用于Go程序构建的builder image中;
* 我们看到上面我们构建出的myrepo/myhttpserver image的SIZE是200MB,这似乎有些过于“庞大”了。虽然每个主机node上的docker有cache image layer的能力,但我们还是希望能build出更加精简短小的image。

2、借助golang builder image

Docker Hub上提供了一个带有go dev环境的官方golang image repository,我们可以直接使用这个golang builder image来辅助构建我们的应用image;对于一些对第三方包依赖较多的Go应用,我们也可以以这个golang image为base image定制我们自己的专用builder image。

我们基于golang:latest这个base image构建我们的golang-builder image,我们编写一个Dockerfile.build用于build golang-builder image:

// github.com/bigwhite/experiments/multi_stage_image_build/isomorphism/Dockerfile.build
FROM golang:latest

WORKDIR /go/src
COPY httpserver.go .

RUN go build -o myhttpserver ./httpserver.go

在同目录下构建golang-builder image:

# docker build -t myrepo/golang-builder:latest -f Dockerfile.build .
Sending build context to Docker daemon  5.895MB
Step 1/4 : FROM golang:latest
 ---> 1a34fad76b34
Step 2/4 : WORKDIR /go/src
 ---> 2361824677d3
Removing intermediate container 01d8f4e9f0c4
Step 3/4 : COPY httpserver.go .
 ---> 1ff14bb0bc56
Step 4/4 : RUN go build -o myhttpserver ./httpserver.go
 ---> Running in 37a1b76b7b9e
 ---> 2ac5347bb923
Removing intermediate container 37a1b76b7b9e
Successfully built 2ac5347bb923
Successfully tagged myrepo/golang-builder:latest

REPOSITORY              TAG                 IMAGE ID            CREATED             SIZE
myrepo/golang-builder   latest              2ac5347bb923        3 minutes ago       739MB

接下来,我们就基于golang-builder中已经build完毕的myhttpserver来构建我们最终的应用image:

# docker create --name appsource myrepo/golang-builder:latest
# docker cp appsource:/go/src/myhttpserver ./
# docker rm -f appsource
# docker rmi myrepo/golang-builder:latest
# docker build -t myrepo/myhttpserver:latest .

这段命令的逻辑就是从基于golang-builder image启动的容器appsource中将已经构建完毕的myhttpserver拷贝到主机当前目录中,然后删除临时的container appsource以及上面构建的那个golang-builder image;最后的步骤和第一个例子一样,基于本地目录中的已经构建完的myhttpserver构建出最终的image。为了方便,你也可以将这一系列命令放到一个Makefile中去。

3、使用size更小的alpine image

builder image并不能帮助我们为最终的应用image“减重”,myhttpserver image的Size依旧停留在200MB。要想“减重”,我们需要更小的base image,我们选择了alpineAlpine image的size不到4M,再加上应用的size,最终应用Image的Size估计可以缩减到20M以下。

结合builder image,我们只需将Dockerfile的base image改为alpine:latest:

// github.com/bigwhite/experiments/multi_stage_image_build/isomorphism/Dockerfile.alpine

From alpine:latest

COPY ./myhttpserver /root/myhttpserver
RUN chmod +x /root/myhttpserver

WORKDIR /root
ENTRYPOINT ["/root/myhttpserver"]

构建alpine版应用image:

# docker build -t myrepo/myhttpserver-alpine:latest -f Dockerfile.alpine .
Sending build context to Docker daemon  6.151MB
Step 1/5 : FROM alpine:latest
 ---> 053cde6e8953
Step 2/5 : COPY ./myhttpserver /root/myhttpserver
 ---> ca0527a62d39
Step 3/5 : RUN chmod +x /root/myhttpserver
 ---> Running in 28d0a8a577b2
 ---> a3833af97b5e
Removing intermediate container 28d0a8a577b2
Step 4/5 : WORKDIR /root
 ---> 667345b78570
Removing intermediate container fa59883e9fdb
Step 5/5 : ENTRYPOINT /root/myhttpserver
 ---> Running in adcb5b976ca3
 ---> 582fa2aedc64
Removing intermediate container adcb5b976ca3
Successfully built 582fa2aedc64
Successfully tagged myrepo/myhttpserver-alpine:latest

# docker images
REPOSITORY                   TAG                 IMAGE ID            CREATED             SIZE
myrepo/myhttpserver-alpine   latest              582fa2aedc64        4 minutes ago       16.3MB

16.3MB,Size的确降下来了!我们基于该image启动一个容器,看应用运行是否有什么问题:

# docker run myrepo/myhttpserver-alpine:latest
standard_init_linux.go:185: exec user process caused "no such file or directory"

容器启动失败了!为什么呢?因为alpine image并非ubuntu环境的同构image。我们在下面详细说明。

二、异构的镜像构建

我们的image builder: myrepo/golang-builder:latest是基于golang:latest这个image。golang base image有两个模板:Dockerfile-debain.template和Dockerfile-alpine.template。而golang:latest是基于debian模板的,与ubuntu兼容。构建出来的myhttpserver对动态共享链接库的情况如下:

 # ldd myhttpserver
    linux-vdso.so.1 =>  (0x00007ffd0c355000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffa8b36f000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffa8afa5000)
    /lib64/ld-linux-x86-64.so.2 (0x000055605ea5d000)

debian系的linux distribution使用了glibc。但alpine则不同,alpine使用的是musl libc的实现,因此当我们运行上面的那个容器时,加载器因找不到myhttpserver依赖的libc.so.6而失败退出。

这种构建环境与运行环境不兼容的情况我这里称之为“异构的镜像构建”。那么如何解决这个问题呢?我们继续看:

1、静态构建

在主流编程语言中,Go的移植性已经是数一数二的了,尤其是Go 1.5之后,Go将runtime中的C代码都用Go重写了,对libc的依赖已经降到最低了,但仍有一些feature提供了两个版本的实现:C实现和Go实现。并且默认情况下,即在CGO_ENABLED=1的情况下,程序和预编译的标准库都采用了C的实现。关于这方面的详细论述请参见我之前写的《也谈Go的可移植性》一文,这里就不赘述了。于是采用了不同libc实现的debian系和alpine系自然存在不兼容的情况。要解决这个问题,我们首先考虑对Go程序进行静态构建,然后将静态构建后的Go应用放入alpine image中。

我们修改一下Dockerfile.build,在编译Go源文件时加上CGO_ENABLED=0:

// github.com/bigwhite/experiments/multi_stage_image_build/heterogeneous/Dockerfile.build

FROM golang:latest

WORKDIR /go/src
COPY httpserver.go .

RUN CGO_ENABLED=0 go build -o myhttpserver ./httpserver.go

构建这个builder image:

# docker build -t myrepo/golang-static-builder:latest -f Dockerfile.build .
Sending build context to Docker daemon  4.096kB
Step 1/4 : FROM golang:latest
 ---> 1a34fad76b34
Step 2/4 : WORKDIR /go/src
 ---> 593cd9692019
Removing intermediate container ee005d487ad5
Step 3/4 : COPY httpserver.go .
 ---> a095eb69e716
Step 4/4 : RUN CGO_ENABLED=0 go build -o myhttpserver ./httpserver.go
 ---> Running in d9f3b3a6c36c
 ---> c06fe8dccbad
Removing intermediate container d9f3b3a6c36c
Successfully built c06fe8dccbad
Successfully tagged myrepo/golang-static-builder:latest

# docker images
REPOSITORY                     TAG                 IMAGE ID            CREATED             SIZE
myrepo/golang-static-builder   latest              c06fe8dccbad        31 seconds ago      739MB

接下来,我们再基于golang-static-builder中已经build完毕的静态连接的myhttpserver来构建我们最终的应用image:

# docker create --name appsource myrepo/golang-static-builder:latest
# docker cp appsource:/go/src/myhttpserver ./
# ldd myhttpserver
    not a dynamic executable
# docker rm -f appsource
# docker rmi myrepo/golang-static-builder:latest
# docker build -t myrepo/myhttpserver-alpine:latest -f Dockerfile.alpine .

运行新image:

# docker run myrepo/myhttpserver-alpine:latest
Webserver start
  -> listen on port:1111

Note: 我们可以用strace来证明静态连接时Go只使用的是Go自己的runtime实现,而并未使用到libc.a中的代码:

# CGO_ENABLED=0 strace -f go build httpserver.go 2>&1 | grep open | grep -o '/.*\.a'  > go-static-build-strace-file-open.txt

打开go-static-build-strace-file-open.txt文件查看文件内容,你不会找到libc.a这个文件(在Ubuntu下,一般libc.a躺在/usr/lib/x86_64-linux-gnu/下面),这说明go build根本没有尝试去open libc.a文件并获取其中的符号定义。

2、使用alpine golang builder

我们的Go应用运行在alpine based的container中,我们可以使用alpine golang builder来构建我们的应用(无需静态链接)。前面提到过golang有alpine模板:

REPOSITORY                   TAG                 IMAGE ID            CREATED             SIZE
golang                       alpine              9e3f14138abd        7 days ago          269MB

alpine版golang builder的Dockerfile内容如下:

//github.com/bigwhite/experiments/multi_stage_image_build/heterogeneous/Dockerfile.alpine.build

FROM golang:alpine

WORKDIR /go/src
COPY httpserver.go .

RUN go build -o myhttpserver ./httpserver.go

后续的操作与前面golang builder的操作并不二致:利用alpine golang builder构建我们的应用,并将其打入alpine image,这里就不赘述了。

三、多阶段镜像构建:提升开发者体验

在Docker 17.05以前,我们都是像上面那样构建镜像的。你会发现即便采用异构image builder模式,我们也要维护两个Dockerfile,并且还要在docker build命令之外执行一些诸如从容器内copy应用程序、清理build container和build image等的操作。Docker社区看到了这个问题,于是实现了多阶段镜像构建机制(multi-stage)。

我们先来看一下针对上面例子,multi-stage build所使用Dockerfile:

//github.com/bigwhite/experiments/multi_stage_image_build/multi_stages/Dockerfile

FROM golang:alpine as builder

WORKDIR /go/src
COPY httpserver.go .

RUN go build -o myhttpserver ./httpserver.go

From alpine:latest

WORKDIR /root/
COPY --from=builder /go/src/myhttpserver .
RUN chmod +x /root/myhttpserver

ENTRYPOINT ["/root/myhttpserver"]

看完这个Dockerfile的内容,你的第一赶脚是不是把之前的两个Dockerfile合并在一块儿了,每个Dockerfile单独作为一个“阶段”!事实也是这样,但这个Docker也多了一些新的语法形式,用于建立各个“阶段”之间的联系。针对这样一个Dockerfile,我们应该知道以下几点:

  • 支持Multi-stage build的Dockerfile在以往的多个build阶段之间建立内在连接,让后一个阶段构建可以使用前一个阶段构建的产物,形成一条构建阶段的chain;
  • Multi-stages build的最终结果仅产生一个image,避免产生冗余的多个临时images或临时容器对象,这正是我们所需要的:我们只要结果。

我们来使用multi-stage来build一下上述例子:

# docker build -t myrepo/myhttserver-multi-stage:latest .
Sending build context to Docker daemon  3.072kB
Step 1/9 : FROM golang:alpine as builder
 ---> 9e3f14138abd
Step 2/9 : WORKDIR /go/src
 ---> Using cache
 ---> 7a99431d1be6
Step 3/9 : COPY httpserver.go .
 ---> 43a196658e09
Step 4/9 : RUN go build -o myhttpserver ./httpserver.go
 ---> Running in 9e7b46f68e88
 ---> 90dc73912803
Removing intermediate container 9e7b46f68e88
Step 5/9 : FROM alpine:latest
 ---> 053cde6e8953
Step 6/9 : WORKDIR /root/
 ---> Using cache
 ---> 30d95027ee6a
Step 7/9 : COPY --from=builder /go/src/myhttpserver .
 ---> f1620b64c1ba
Step 8/9 : RUN chmod +x /root/myhttpserver
 ---> Running in e62809993a22
 ---> 6be6c28f5fd6
Removing intermediate container e62809993a22
Step 9/9 : ENTRYPOINT /root/myhttpserver
 ---> Running in e4000d1dde3d
 ---> 639cec396c96
Removing intermediate container e4000d1dde3d
Successfully built 639cec396c96
Successfully tagged myrepo/myhttserver-multi-stage:latest

# docker images
REPOSITORY                       TAG                 IMAGE ID            CREATED             SIZE
myrepo/myhttserver-multi-stage   latest              639cec396c96        About an hour ago   16.3MB

我们来Run一下这个image:

# docker run myrepo/myhttserver-multi-stage:latest
Webserver start
  -> listen on port:1111

四、小结

多阶段镜像构建可以让开发者通过一个Dockerfile,一次性地、更容易地构建出size较小的image,体验良好并且更容易接入CI/CD等自动化系统。不过当前多阶段构建仅是在Docker 17.05及之后的版本中才能得到支持。如果想学习和实践这方面功能,但又没有环境,可以使用play-with-docker提供的实验环境。

img{512x368}
Play with Docker labs

以上所有示例代码可以在这里下载到。


微博:@tonybai_cn
微信公众号:iamtonybai
github.com: https://github.com/bigwhite

再谈Docker容器单机网络:利用iptables trace和ebtables log

这大半年一直在搞Kubernetes。每次搭建Kubernetes集群,或多或少都会被Kubernetes的“网络插件们”折腾折腾。因此,要说目前Kubernetes中最难搞的是什么?个人觉得莫过于其Pod网络了,至少也是最难搞的之一。除此之外,以Service和Pod为中心的Kubernetes架构还大量利用iptables规则来实现Service的反向代理和负载均衡,这又与Docker原生容器单机网络实现所基于的linux bridgeiptables规则糅合在一起,让troubleshooting时的难度又增加了一些。

去年曾经花过一段研究Docker网络,但现在看来当时在某些关键环节的理解上还有些模糊,于是花了周末的闲暇时间对Docker容器单机网络做了一次再理解。这次重新认识利用上了iptables的Trace功能以及数据链路层的ebtables,让我可以更清晰地看到单机容器网络的网络数据流流向。同时,有了容器网络理解这个基础,对后续解决K8s Pod网络问题也是大有裨益的。

本文从某个角度来说也可以理解为自我答疑,我不会从最最基础的Docker网络结构说起,对Docker容器单机网络结构不了解的童鞋,可以先看看我之前写的《理解Docker单机容器网络》和《理解Docker容器网络之Linux Network Namespace》两篇文章。

一、实验环境

1、主机环境和工具版本

Docker的默认单机容器网络从最初的版本开始就几乎没有变过,因此理论上下面的分析适用于Docker的大部分版本。我的实验环境如下:

Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-63-generic x86_64)

# docker version
Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:18 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:40:56 2017
 OS/Arch:      linux/amd64
 Experimental: false

# iptables --version
iptables v1.6.0
# ebtables --version
ebtables v2.0.10-4 (December 2011)

2、容器网络及拓扑

我们需要制作一个用于实验的容器镜像。因为这里仅用ping包进行测试,这里我们仅基于ubuntu:14.04 base image制作一个简单的安装有必要网络工具的image:

//Dockerfile

From ubuntu:14.04
RUN apt-get update && apt-get install -y curl iptables
ENTRYPOINT ["tail", "-f", "/var/log/bootstrap.log"]

// 制作镜像:

# docker build -t foo:latest ./

启动两个容器:

# docker run --name c1 -d --cap-add=NET_ADMIN foo:latest
7a01a19d9328b39f094c9a9c76340d179baaf93afb52189816bcc79f8319cb64
# docker run --name c2 -d --cap-add=NET_ADMIN foo:latest
94a2f1841f6d95fd0682299b17c0aedb60c1047786c8e75b0f1ab7316a995409

容器启动后的网络信息汇总如下:

# ifconfig -a
docker0   Link encap:Ethernet  HWaddr 02:42:ff:27:17:4d
          inet addr:192.168.0.1  Bcast:0.0.0.0  Mask:255.255.240.0
          ... ...

eth0      Link encap:Ethernet  HWaddr 00:16:3e:06:3a:3a
          inet addr:10.171.77.0  Bcast:10.171.79.255  Mask:255.255.248.0
          ... ...

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          ... ...

veth0594f4b Link encap:Ethernet  HWaddr 96:5b:d4:80:73:5f
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          ... ...

veth57a3dec Link encap:Ethernet  HWaddr 02:52:e9:60:ea:b1
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          ... ...

为了方便大家理解,这里附上一幅简易的容器网络拓扑:

img{512x368}

二、调试工具配置

Docker单机容器网络默认使用的是桥接网络,所有启动的容器均桥接在Docker引擎创建的docker0 linux bridge上,因此内核对Linux bridge的处理逻辑是理解Docker容器网络的关键。

与硬件网桥/交换机不同的是,Linux Bridge还具备三层网络,即IP层的功能,也就是docker0既是一个网桥也是一个具备三层转发功能的网卡设备。传统意义上,按照iso网络七层规范,iptables工作在三层,而网桥是一个二层(数据链路层)设备,但Linux协议栈针对网桥设备的实现却在网络层的规则链(ebtables)中串接了iptables的规则链处理,即在二层也可以处理ip包,这是为了实现桥接透明防火墙的需要。但实现也会保证每个packet数据包仅会走一次iptable的某个chain,要么在linker layer走,要么在network layer走,不会出现在linker layer走一次,又在network layer重复走一次的情况。关于这种基于linux bridge的ebtables和iptables的交互规则,在netfilter官网的一篇名为《ebtables/iptables interaction on a Linux-based bridge》文档中有详细说明,这篇文章也是后续分析的一个重要参考。下面这幅图也是文章中提到的那幅netfilter数据流全图,后续在分析时会反复回到这幅图(后续简称为:全图):

img{512x368}
建议:右键在新标签中打开图片看大图

关于数据包在iptables的各条chain的流经图可以参见下面:

img{512x368}

1、iptables TRACE target的设置

在本次实验中,我们主要需要查看数据包的流转路径,因此我们需要针对iptables的data flow进行跟踪。之前,我曾使用过iptables提供的LOG target或mark set&match方式来跟踪iptables中的数据流,但这两种方式都不理想,需要针对特定流程插入LOG target或match在入口包设定好的mark,对iptables规则的侵入较大,调试和观察也较为复杂;iptables自身提供了TRACE功能,一旦设定,当数据包匹配到任意chain上任意table的处理规则时,iptables会在系统日志(/var/log/syslog)中自动输出此时的数据包状态日志。

我们来为iptables规则添加TRACE,TRACE target只能在iptables的raw表中添加,raw表中有两条iptables built-in chain: PREROUTING和OUTPUT,分别代表网卡数据入口和本地进程下推数据的出口。TRACE target就添加在这两条chain上,步骤如下:

# iptables -t raw -A OUTPUT -p icmp -j TRACE
# iptables -t raw -A PREROUTING -p icmp -j TRACE

注意:我们采用icmp协议(ping协议)进行测试,因此我们只TRACE icmp协议的请求和应答包。

2、ebtables的调试设置

我们的重点在iptables,为ebtables只是辅助,帮助我们看清数据包到底是在哪一层被hook进iptables的规则链中进行处理的。因此我们在全图中的每个ebtables的built-in chain上都加上LOG(ebtables目前还不支持TRACE):

# ebtables -t broute -A BROUTING -p ipv4 --ip-proto 1 --log-level 6 --log-ip --log-prefix "TRACE: eb:broute:BROUTING" -j ACCEPT
# ebtables -t nat -A OUTPUT -p ipv4 --ip-proto 1 --log-level 6 --log-ip --log-prefix "TRACE: eb:nat:OUTPUT"  -j ACCEPT
# ebtables -t nat -A PREROUTING -p ipv4 --ip-proto 1 --log-level 6 --log-ip --log-prefix "TRACE: eb:nat:PREROUTING" -j ACCEPT
# ebtables -t filter -A INPUT -p ipv4 --ip-proto 1 --log-level 6 --log-ip --log-prefix "TRACE: eb:filter:INPUT" -j ACCEPT
# ebtables -t filter -A FORWARD -p ipv4 --ip-proto 1 --log-level 6 --log-ip --log-prefix "TRACE: eb:filter:FORWARD" -j ACCEPT
# ebtables -t filter -A OUTPUT -p ipv4 --ip-proto 1 --log-level 6 --log-ip --log-prefix "TRACE: eb:filter:OUTPUT" -j ACCEPT
# ebtables -t nat -A POSTROUTING -p ipv4 --ip-proto 1 --log-level 6 --log-ip --log-prefix "TRACE: eb:nat:POSTROUTING" -j ACCEPT

注意:这里--ip-proto 1 表示仅match icmp packet。

3、iptables和ebtables规则全文

启动两个容器并添加上述规则后,当前的的iptables规则如下:(通过iptables-save输出的按table组织的rules)

# iptables-save
# Generated by iptables-save v1.6.0 on Sun Nov  5 14:50:46 2017
*raw

: PREROUTING ACCEPT [1564539:108837380]
:OUTPUT ACCEPT [1504962:130805835]
-A PREROUTING -p icmp -j TRACE
-A OUTPUT -p icmp -j TRACE
COMMIT
# Completed on Sun Nov  5 14:50:46 2017
# Generated by iptables-save v1.6.0 on Sun Nov  5 14:50:46 2017
*filter
:INPUT ACCEPT [1564535:108837044]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [1504968:130806627]

: DOCKER - [0:0]

: DOCKER-ISOLATION - [0:0]

: DOCKER-USER - [0:0]

-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER-ISOLATION -j RETURN
-A DOCKER-USER -j RETURN
COMMIT
# Completed on Sun Nov  5 14:50:46 2017
# Generated by iptables-save v1.6.0 on Sun Nov  5 14:50:46 2017
*nat

: PREROUTING ACCEPT [280:14819]
:INPUT ACCEPT [278:14651]
:OUTPUT ACCEPT [639340:38370263]

: POSTROUTING ACCEPT [639342:38370431]

: DOCKER - [0:0]

-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 192.168.0.0/20 ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
COMMIT
# Completed on Sun Nov  5 14:50:46 2017

而ebtables的规则如下:

# ebtables-save
# Generated by ebtables-save v1.0 on Sun Nov  5 16:51:50 CST 2017
*nat
: PREROUTING ACCEPT
:OUTPUT ACCEPT
: POSTROUTING ACCEPT
-A PREROUTING -p IPv4 --ip-proto icmp --log-level info --log-prefix "TRACE: eb:nat:PREROUTING" --log-ip -j ACCEPT
-A OUTPUT -p IPv4 --ip-proto icmp --log-level info --log-prefix "TRACE: eb:nat:OUTPUT" --log-ip -j ACCEPT
-A POSTROUTING -p IPv4 --ip-proto icmp --log-level info --log-prefix "TRACE: eb:nat:POSTROUTING" --log-ip -j ACCEPT

*broute
:BROUTING ACCEPT
-A BROUTING -p IPv4 --ip-proto icmp --log-level info --log-prefix "TRACE: eb:broute:BROUTING" --log-ip -j ACCEPT

*filter
:INPUT ACCEPT
:FORWARD ACCEPT
:OUTPUT ACCEPT
-A INPUT -p IPv4 --ip-proto icmp --log-level info --log-prefix "TRACE: eb:filter:INPUT" --log-ip -j ACCEPT
-A FORWARD -p IPv4 --ip-proto icmp --log-level info --log-prefix "TRACE: eb:filter:FORWARD" --log-ip -j ACCEPT
-A OUTPUT -p IPv4 --ip-proto icmp --log-level info --log-prefix "TRACE: eb:filter:OUTPUT" --log-ip -j ACCEPT

对于iptables,我们还可以通过iptables命令输出另外一种组织形式的规则列表,我们这里列出filter和nat这两个重要的table的规则(输出规则number,便于后续match分析时查看):

# iptables -nL --line-numbers -v -t filter
Chain INPUT (policy ACCEPT 2558K packets, 178M bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1       10   840 DOCKER-USER  all  --  *      *       0.0.0.0/0            0.0.0.0/0
2       10   840 DOCKER-ISOLATION  all  --  *      *       0.0.0.0/0            0.0.0.0/0
3        7   588 ACCEPT     all  --  *      docker0  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
4        3   252 DOCKER     all  --  *      docker0  0.0.0.0/0            0.0.0.0/0
5        0     0 ACCEPT     all  --  docker0 !docker0  0.0.0.0/0            0.0.0.0/0
6        3   252 ACCEPT     all  --  docker0 docker0  0.0.0.0/0            0.0.0.0/0

Chain OUTPUT (policy ACCEPT 2460K packets, 214M bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain DOCKER (1 references)
num   pkts bytes target     prot opt in     out     source               destination

Chain DOCKER-ISOLATION (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1       10   840 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0

Chain DOCKER-USER (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1       10   840 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0

# iptables -nL --line-numbers -v -t nat
Chain PREROUTING (policy ACCEPT 884 packets, 46522 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1      881 46270 DOCKER     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain INPUT (policy ACCEPT 881 packets, 46270 bytes)
num   pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 1048K packets, 63M bytes)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 DOCKER     all  --  *      *       0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 1048K packets, 63M bytes)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 MASQUERADE  all  --  *      !docker0  192.168.0.0/20       0.0.0.0/0

Chain DOCKER (2 references)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 RETURN     all  --  docker0 *       0.0.0.0/0            0.0.0.0/0

三、Container to Container

下面,我们分三种情况来看看容器网络的数据包是如何流动的,首先是Container to Container。

img{512x368}

我们在容器C1中执行ping 3次 C2的命令:

# docker exec c1 ping -c 3 192.168.0.3
PING 192.168.0.3 (192.168.0.3) 56(84) bytes of data.
64 bytes from 192.168.0.3: icmp_seq=1 ttl=64 time=0.226 ms
64 bytes from 192.168.0.3: icmp_seq=2 ttl=64 time=0.159 ms
64 bytes from 192.168.0.3: icmp_seq=3 ttl=64 time=0.185 ms

--- 192.168.0.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.159/0.190/0.226/0.027 ms

在容器c1(192.168.0.2)中,icmp request由ping程序(c1 namespace中的local process)发出。c1 network namespace中的路由表如下:

# docker exec c1 netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         192.168.0.1     0.0.0.0         UG        0 0          0 eth0
192.168.0.0     0.0.0.0         255.255.240.0   U         0 0          0 eth0

由于目标容器地址为192.168.0.3,在容器c1的直连网络上,走第二条直连路由(非默认路由),数据包通过eth0发出。

由于c1 namespace中的eth0通过veth机制连接在host namespace的docker0 bridge的一个Slave port上,因此上述数据包通过docker0 bridge的slave port: veth0594f4b流入docker0 bridge。

这里再强调一下linux bridge设备。Linux下的Bridge是一种虚拟设备,它依赖于一个或多个从设备。它不是内核虚拟出的和从设备同一层次的镜像设备,而是内核虚拟出的一个高一层次的设备,并把从设备虚拟化为端口port,同时处理各个从设备的数据收发及转发。bridge设备是建立在从设备之上的(这些从设备可以是实际设备,也可以是vlan设备等),并且我们可以为bridge准备一个IP(bridge设备的MAC地址是它所有从设备中最小的MAC地址),这样该主机就可以通过这个bridge设备与网络中的其它主机通信了。另外一旦某个网络设备被“插到”linux bridge上,这个网络设备将会变为bridge的从设备,被虚拟化为端口port,从设备的IP及MAC都不再可用,好似被bridge剥夺了被内核网络栈处理的资格;它们被设置为接收任何包,对其流入的数据包的处理交由bridge完成,并最终由bridge设备来决定数据包的去向:接收到本机、转发或丢弃。

因此,位于host namespace的docker0 bridge从slave port: veth0594f4b收到icmp request后,我们不会看到veth0594f4b这一netdev被内核网络栈程序单独处理(比如:单独走一遍ebtables和iptables chains),而是进入bridge处理逻辑(此时可以回顾一下上面的全图)。由于数据包已经进入到了host namespace,因此我们可以通过ebtables和iptables输出的Trace和log来跟踪数据包流转的路径了:

1、start -> bridgecheck -> linker layer

TRACE: eb:broute:BROUTING IN=veth0594f4b OUT= MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:c0:a8:00:03 proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.3, IP tos=0x00, IP proto=1
TRACE: eb:nat:PREROUTING IN=veth0594f4b OUT= MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:c0:a8:00:03 proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.3, IP tos=0x00, IP proto=1

从最初的trace log来看,在bridge check之后(发现it is a linux bridge),数据包进入到linker layer中;并且在linker layer的BROUTING built-in chain之后,数据包没有被转移到上面的network layer,而是继续linker layer的行程:进入linker layer的nat:PREROUTING中。

2、call iptables chain rules in linker layer

结合全图中的图示和日志输出,在linker layer的nat:PREROUTING之后,linker layer调用了上层iptables的处理规则:raw:PREROUTING和nat:PREROUTING:

TRACE: raw:PREROUTING:policy:2 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1
TRACE: nat:PREROUTING:policy:2 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1

Trace target在数据包match table、chains的policy或rules时会输出日志,日志格式:”TRACE:tablename:chainname:type:rulenum”。当匹配到的是普通rules时,type=”rule”;当碰到一个user-defined chain的return target时,type=”return”;当匹配到built-in chain(比如:PREROUTING、INPUT、OUTPUT、FORWARD和POSTROUTING)的default policy时,type=”policy”。

从上面的日志输出来看,似乎PREROUTING chain的raw table中的Trace target不能被trace自身match,因此trace log输出的是匹配raw table built-in chain: PREROUTING的default policy: ACCEPT,num=2(policy和rules整体排序后的序号);在PREROUTING chain的nat表中匹配时,Trace也仅匹配到了default policy,rule 1(target: Docker)没有匹配上;

这里有一点奇怪的是mangle table没有任何输出,即便是default policy的也没有,原因暂不明。

3、bridge decision

根据全图和后续的日志,我们得到了bridge decision的结果:继续在linker layer上处理数据包,一路向右。不过在处理的路径上依旧调用了iptables的rules:

TRACE: eb:filter:FORWARD IN=veth0594f4b OUT=veth57a3dec MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:c0:a8:00:03 proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.3, IP tos=0x00, IP proto=1
TRACE: filter:FORWARD:rule:1 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1
TRACE: filter:DOCKER-USER:return:1 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1
TRACE: filter:FORWARD:rule:2 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1
TRACE: filter:DOCKER-ISOLATION:return:1 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1
TRACE: filter:FORWARD:rule:4 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1
TRACE: filter:DOCKER:return:1 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1
TRACE: filter:FORWARD:rule:6 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1

bridge decision决定的依据或则规则是什么呢?《ebtables/iptables interaction on a Linux-based bridge》一文给了我们一些答案:

The bridge's decision for a frame can be one of these:

* bridge it, if the destination MAC address is on another side of the bridge;
* flood it over all the forwarding bridge ports, if the position of the box with the destination MAC is unknown to the bridge;
* pass it to the higher protocol code (the IP code), if the destination MAC address is that of the bridge or of one of its ports;
* ignore it, if the destination MAC address is located on the same side of the bridge.

不过即便按照这几条规则,我依然有一定困惑,那就是真实的处理是:依旧在linker layer,但掺杂了上层网络层的处理规则。

另外,你可能会发现iptables log里MAC值的格式很怪异(比如:MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00),非常long。其实这个MAC值是一个组合:Souce MAC, Destination MAC和 frame type的组合。

02:42:c0:a8:00:03: Destination MAC=00:60:dd:45:67:ea
02:42:c0:a8:00:02: Source MAC=00:60:dd:45:4c:92
08:00 : Type=08:00 (ethernet frame carried an IPv4 datagram)

4、eb:nat:POSTROUTING -> nat:POSTROUTING -> egress(qdisc)

最后packet进入linker layer的POSTROUTING built-in chain:

TRACE: eb:nat:POSTROUTING IN= OUT=veth57a3dec MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:c0:a8:00:03 proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.3, IP tos=0x00, IP proto=1
TRACE: nat:POSTROUTING:policy:2 IN= OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47066 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=1

iptables nat:POSTROUTING没有匹配上docker引擎增加的那条target为DOCKER的rule,于是输出了default policy的日志。

进入到egress(qdisc)后,相当于数据包到了bridge上的另一个slave port(veth57a3dec)上,此时数据包必须被送回网络上,于是进入到容器C2的eth0中。离开了host namespace,我们的日志便追踪不到了。

容器c2因为所在的network namespace是独立于host namespace的,因此有自己的iptables规则(如果未设置,均为默认accept),不受host namespace中的iptables的影响。

5、”消失”的iptable的nat:PREROUTING和nat:POSTROUTING

C2容器回复ping response的路径与request甚为相似,这里一次性将全部日志列出:

TRACE: eb:broute:BROUTING IN=veth57a3dec OUT= MAC source = 02:42:c0:a8:00:03 MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.3 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: eb:nat:PREROUTING IN=veth57a3dec OUT= MAC source = 02:42:c0:a8:00:03 MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.3 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: raw:PREROUTING:policy:2 IN=docker0 OUT= PHYSIN=veth57a3dec MAC=02:42:c0:a8:00:02:02:42:c0:a8:00:03:08:00 SRC=192.168.0.3 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=5962 PROTO=ICMP TYPE=0 CODE=0 ID=90 SEQ=1

TRACE: eb:filter:FORWARD IN=veth57a3dec OUT=veth0594f4b MAC source = 02:42:c0:a8:00:03 MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.3 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: filter:FORWARD:rule:1 IN=docker0 OUT=docker0 PHYSIN=veth57a3dec PHYSOUT=veth0594f4b MAC=02:42:c0:a8:00:02:02:42:c0:a8:00:03:08:00 SRC=192.168.0.3 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=5962 PROTO=ICMP TYPE=0 CODE=0 ID=90 SEQ=1
TRACE: filter:DOCKER-USER:return:1 IN=docker0 OUT=docker0 PHYSIN=veth57a3dec PHYSOUT=veth0594f4b MAC=02:42:c0:a8:00:02:02:42:c0:a8:00:03:08:00 SRC=192.168.0.3 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=5962 PROTO=ICMP TYPE=0 CODE=0 ID=90 SEQ=1
TRACE: filter:FORWARD:rule:2 IN=docker0 OUT=docker0 PHYSIN=veth57a3dec PHYSOUT=veth0594f4b MAC=02:42:c0:a8:00:02:02:42:c0:a8:00:03:08:00 SRC=192.168.0.3 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=5962 PROTO=ICMP TYPE=0 CODE=0 ID=90 SEQ=1
TRACE: filter:DOCKER-ISOLATION:return:1 IN=docker0 OUT=docker0 PHYSIN=veth57a3dec PHYSOUT=veth0594f4b MAC=02:42:c0:a8:00:02:02:42:c0:a8:00:03:08:00 SRC=192.168.0.3 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=5962 PROTO=ICMP TYPE=0 CODE=0 ID=90 SEQ=1
TRACE: filter:FORWARD:rule:3 IN=docker0 OUT=docker0 PHYSIN=veth57a3dec PHYSOUT=veth0594f4b MAC=02:42:c0:a8:00:02:02:42:c0:a8:00:03:08:00 SRC=192.168.0.3 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=5962 PROTO=ICMP TYPE=0 CODE=0 ID=90 SEQ=1

TRACE: eb:nat:POSTROUTING IN= OUT=veth0594f4b MAC source = 02:42:c0:a8:00:03 MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.3 IP DST=192.168.0.2, IP tos=0x00, IP proto=1

仔细观察,我们发现虽然与request的路径类似,但依旧有不同:iptable的nat:PREROUTING和nat:POSTROUTING消失了。Why?iptables就是这么设计的。iptables会跟踪connection的state,当一个connection的首个包经过一次后,connection的state由NEW变成了ESTABLISHED;对于ESTABLISHED的connection的后续packets,内核会自动按照该connection的首个包在nat:PREROUTING和nat:POSTROUTING环节的处理方式进行处理,而不再流经这两个链中的nat表逻辑。而ebtables中似乎没有这个逻辑。

后续的ping的第二个、第三个流程也印证了上述设计,这里仅列出ping request packet 2:

TRACE: eb:broute:BROUTING IN=veth0594f4b OUT= MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:c0:a8:00:03 proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.3, IP tos=0x00, IP proto=1
TRACE: eb:nat:PREROUTING IN=veth0594f4b OUT= MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:c0:a8:00:03 proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.3, IP tos=0x00, IP proto=1
TRACE: raw:PREROUTING:policy:2 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47310 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=2
TRACE: eb:filter:FORWARD IN=veth0594f4b OUT=veth57a3dec MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:c0:a8:00:03 proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.3, IP tos=0x00, IP proto=1
TRACE: filter:FORWARD:rule:1 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47310 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=2
TRACE: filter:DOCKER-USER:return:1 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47310 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=2
TRACE: filter:FORWARD:rule:2 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47310 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=2
TRACE: filter:DOCKER-ISOLATION:return:1 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47310 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=2
TRACE: filter:FORWARD:rule:3 IN=docker0 OUT=docker0 PHYSIN=veth0594f4b PHYSOUT=veth57a3dec MAC=02:42:c0:a8:00:03:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.3 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=47310 DF PROTO=ICMP TYPE=8 CODE=0 ID=90 SEQ=2
TRACE: eb:nat:POSTROUTING IN= OUT=veth57a3dec MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:c0:a8:00:03 proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.3, IP tos=0x00, IP proto=1

全部日志内容请参见:docker-bridge-network-demo-iptables-trace-log.txt文件,这里不赘述。

四、Local Process to Container

img{512x368}

很多”疑难”环节在上面的container to container数据流分析时已经做了解惑,因此后续local process to container和container to external流程将不会再细致描述,说明会略微泛泛一些,不那么细致。

我们在host上执行ping C1三次:

# ping -c 3 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=0.160 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=0.105 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=0.131 ms

--- 192.168.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.105/0.132/0.160/0.022 ms

1、local process -> routing decision -> iptables OUTPUT chain

ping request数据包从本地的ping process发出,根据目的地址路由后,选择docker0作为OUT设备:

TRACE: raw:OUTPUT:policy:2 IN= OUT=docker0 SRC=192.168.0.1 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18692 DF PROTO=ICMP TYPE=8 CODE=0 ID=30245 SEQ=1 UID=0 GID=0
TRACE: mangle:OUTPUT:policy:1 IN= OUT=docker0 SRC=192.168.0.1 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18692 DF PROTO=ICMP TYPE=8 CODE=0 ID=30245 SEQ=1 UID=0 GID=0
TRACE: nat:OUTPUT:policy:2 IN= OUT=docker0 SRC=192.168.0.1 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18692 DF PROTO=ICMP TYPE=8 CODE=0 ID=30245 SEQ=1 UID=0 GID=0
TRACE: filter:OUTPUT:policy:1 IN= OUT=docker0 SRC=192.168.0.1 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18692 DF PROTO=ICMP TYPE=8 CODE=0 ID=30245 SEQ=1 UID=0 GID=0

奇怪的是这次mangle chain居然有trace log输出:(。

2、进入linker layer:iptables POSTROUTING -> ebtables OUTPUT -> ebtables POSTROUTING

由于是OUT是bridge设备,因此要进入到ebtable中走一遭:

TRACE: mangle:POSTROUTING:policy:1 IN= OUT=docker0 SRC=192.168.0.1 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18692 DF PROTO=ICMP TYPE=8 CODE=0 ID=30245 SEQ=1 UID=0 GID=0
TRACE: nat:POSTROUTING:policy:2 IN= OUT=docker0 SRC=192.168.0.1 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=18692 DF PROTO=ICMP TYPE=8 CODE=0 ID=30245 SEQ=1 UID=0 GID=0
TRACE: eb:nat:OUTPUT IN= OUT=veth57a3dec MAC source = 02:42:ff:27:17:4d MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.1 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: eb:filter:OUTPUT IN= OUT=veth57a3dec MAC source = 02:42:ff:27:17:4d MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.1 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: eb:nat:POSTROUTING IN= OUT=veth57a3dec MAC source = 02:42:ff:27:17:4d MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.1 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: eb:nat:OUTPUT IN= OUT=veth0594f4b MAC source = 02:42:ff:27:17:4d MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.1 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: eb:filter:OUTPUT IN= OUT=veth0594f4b MAC source = 02:42:ff:27:17:4d MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.1 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: eb:nat:POSTROUTING IN= OUT=veth0594f4b MAC source = 02:42:ff:27:17:4d MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=192.168.0.1 IP DST=192.168.0.2, IP tos=0x00, IP proto=1

icmp的response和container to container类似,入口走的是linker layer(由于是桥设备),在bridge decision后,走到INPUT chain:

TRACE: eb:broute:BROUTING IN=veth0594f4b OUT= MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:ff:27:17:4d proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.1, IP tos=0x00, IP proto=1
TRACE: eb:nat:PREROUTING IN=veth0594f4b OUT= MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:ff:27:17:4d proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.1, IP tos=0x00, IP proto=1
TRACE: raw:PREROUTING:policy:2 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.1 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=56535 PROTO=ICMP TYPE=0 CODE=0 ID=30245 SEQ=1
TRACE: mangle:PREROUTING:policy:1 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.1 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=56535 PROTO=ICMP TYPE=0 CODE=0 ID=30245 SEQ=1
TRACE: eb:filter:INPUT IN=veth0594f4b OUT= MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:ff:27:17:4d proto = 0x0800 IP SRC=192.168.0.2 IP DST=192.168.0.1, IP tos=0x00, IP proto=1
TRACE: mangle:INPUT:policy:1 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.1 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=56535 PROTO=ICMP TYPE=0 CODE=0 ID=30245 SEQ=1
TRACE: filter:INPUT:policy:1 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=192.168.0.1 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=56535 PROTO=ICMP TYPE=0 CODE=0 ID=30245 SEQ=1

以上我们可以与到非桥设备的ping做比对,我们在host上ping 另外一个LAN中的host:

# ping -c 1 10.28.61.30
PING 10.28.61.30 (10.28.61.30) 56(84) bytes of data.
64 bytes from 10.28.61.30: icmp_seq=1 ttl=57 time=1.09 ms

--- 10.28.61.30 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.093/1.093/1.093/0.000 ms

得到的trace log如下:

icmp request:

TRACE: raw:OUTPUT:policy:2 IN= OUT=eth0 SRC=10.171.77.0 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=4494 DF PROTO=ICMP TYPE=8 CODE=0 ID=30426 SEQ=1 UID=0 GID=0
TRACE: mangle:OUTPUT:policy:1 IN= OUT=eth0 SRC=10.171.77.0 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=4494 DF PROTO=ICMP TYPE=8 CODE=0 ID=30426 SEQ=1 UID=0 GID=0
TRACE: nat:OUTPUT:policy:2 IN= OUT=eth0 SRC=10.171.77.0 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=4494 DF PROTO=ICMP TYPE=8 CODE=0 ID=30426 SEQ=1 UID=0 GID=0
TRACE: filter:OUTPUT:policy:1 IN= OUT=eth0 SRC=10.171.77.0 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=4494 DF PROTO=ICMP TYPE=8 CODE=0 ID=30426 SEQ=1 UID=0 GID=0
TRACE: mangle:POSTROUTING:policy:1 IN= OUT=eth0 SRC=10.171.77.0 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=4494 DF PROTO=ICMP TYPE=8 CODE=0 ID=30426 SEQ=1 UID=0 GID=0
TRACE: nat:POSTROUTING:policy:2 IN= OUT=eth0 SRC=10.171.77.0 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=4494 DF PROTO=ICMP TYPE=8 CODE=0 ID=30426 SEQ=1 UID=0 GID=0

icmp response:

TRACE: raw:PREROUTING:policy:2 IN=eth0 OUT= MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=10.171.77.0 LEN=84 TOS=0x00 PREC=0x00 TTL=57 ID=61118 PROTO=ICMP TYPE=0 CODE=0 ID=30426 SEQ=1
TRACE: mangle:PREROUTING:policy:1 IN=eth0 OUT= MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=10.171.77.0 LEN=84 TOS=0x00 PREC=0x00 TTL=57 ID=61118 PROTO=ICMP TYPE=0 CODE=0 ID=30426 SEQ=1
TRACE: mangle:INPUT:policy:1 IN=eth0 OUT= MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=10.171.77.0 LEN=84 TOS=0x00 PREC=0x00 TTL=57 ID=61118 PROTO=ICMP TYPE=0 CODE=0 ID=30426 SEQ=1
TRACE: filter:INPUT:policy:1 IN=eth0 OUT= MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=10.171.77.0 LEN=84 TOS=0x00 PREC=0x00 TTL=57 ID=61118 PROTO=ICMP TYPE=0 CODE=0 ID=30426 SEQ=1

可以对照着全图看出在request出去时,发现OUT设备不是bridge,直接走network layer的iptables rules,并从xfrm lookup出去,走到egress(qdisc); response回来时,进行bridge check后,发现IN设备eth0不是bridge,因此直接上到network layer,走iptable chain rules到local process。ebtable的log一行也没有输出。

后续的两个icmp request&response大致相同,并且依旧不走nat PREROUTING和nat POSTROUTING,因为不再是NEW connection。

五、Container to External

img{512x368}

我们在c1 容器中ping 外部的一个节点三次:

# docker exec c1 ping -c 3 10.28.61.30
PING 10.28.61.30 (10.28.61.30) 56(84) bytes of data.
64 bytes from 10.28.61.30: icmp_seq=1 ttl=56 time=1.32 ms
64 bytes from 10.28.61.30: icmp_seq=2 ttl=56 time=1.30 ms
64 bytes from 10.28.61.30: icmp_seq=3 ttl=56 time=1.21 ms

--- 10.28.61.30 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 1.219/1.280/1.323/0.060 ms

1、start -> bridgecheck -> linker layer

和Container to Container的开端很类似,在bridge check后,数据流进入linker layer(docker0 is a bridge),并在该层进行iptables PREROUTING rules的处理,直到bridge decision之前:

TRACE: eb:broute:BROUTING IN=veth0594f4b OUT= MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:ff:27:17:4d proto = 0x0800 IP SRC=192.168.0.2 IP DST=10.28.61.30, IP tos=0x00, IP proto=1
TRACE: eb:nat:PREROUTING IN=veth0594f4b OUT= MAC source = 02:42:c0:a8:00:02 MAC dest = 02:42:ff:27:17:4d proto = 0x0800 IP SRC=192.168.0.2 IP DST=10.28.61.30, IP tos=0x00, IP proto=1
TRACE: raw:PREROUTING:policy:2 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1
TRACE: mangle:PREROUTING:policy:1 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1
TRACE: nat:PREROUTING:policy:2 IN=docker0 OUT= PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1

2、ebtable filter:INPUT -> routing decision -> iptables FORWARD

目的地址为外部host ip,需要三层介入转发,于是数据包经由eb:filter:INPUT向上走到达network layer的routing decision,根据路由表,将包转发到eth0:

TRACE: mangle:FORWARD:policy:1 IN=docker0 OUT=eth0 PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1
TRACE: filter:FORWARD:rule:1 IN=docker0 OUT=eth0 PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1
TRACE: filter:DOCKER-USER:return:1 IN=docker0 OUT=eth0 PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1
TRACE: filter:FORWARD:rule:2 IN=docker0 OUT=eth0 PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1
TRACE: filter:DOCKER-ISOLATION:return:1 IN=docker0 OUT=eth0 PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1
TRACE: filter:FORWARD:rule:5 IN=docker0 OUT=eth0 PHYSIN=veth0594f4b MAC=02:42:ff:27:17:4d:02:42:c0:a8:00:02:08:00 SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1

3、iptables nat:POSTROUTING match rule 1

由于要流出到主机外,因此在最后iptables nat:POSTROUTING中,数据包匹配到rule 1,即做MASQUERADE,将数据包源地址更换为host ip:10.171.77.0。

TRACE: mangle:POSTROUTING:policy:1 IN= OUT=eth0 PHYSIN=veth0594f4b SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1
TRACE: nat:POSTROUTING:rule:1 IN= OUT=eth0 PHYSIN=veth0594f4b SRC=192.168.0.2 DST=10.28.61.30 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=57351 DF PROTO=ICMP TYPE=8 CODE=0 ID=94 SEQ=1

4、iptables prerouting、forward、postrouting -> ebtabls output、postrouting

返回的应答由于IN设备为eth0,因此直接上到network layer进行iptable chain的处理。在路由后,OUT设备为docker0(bridge设备),因此在最后的环节需要下降到linker layer做output和postrouting处理:

TRACE: raw:PREROUTING:policy:2 IN=eth0 OUT= MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=10.171.77.0 LEN=84 TOS=0x00 PREC=0x00 TTL=57 ID=58706 PROTO=ICMP TYPE=0 CODE=0 ID=94 SEQ=1
TRACE: mangle:PREROUTING:policy:1 IN=eth0 OUT= MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=10.171.77.0 LEN=84 TOS=0x00 PREC=0x00 TTL=57 ID=58706 PROTO=ICMP TYPE=0 CODE=0 ID=94 SEQ=1
TRACE: mangle:FORWARD:policy:1 IN=eth0 OUT=docker0 MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=56 ID=58706 PROTO=ICMP TYPE=0 CODE=0 ID=94 SEQ=1
TRACE: filter:FORWARD:rule:1 IN=eth0 OUT=docker0 MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=56 ID=58706 PROTO=ICMP TYPE=0 CODE=0 ID=94 SEQ=1
TRACE: filter:DOCKER-USER:return:1 IN=eth0 OUT=docker0 MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=56 ID=58706 PROTO=ICMP TYPE=0 CODE=0 ID=94 SEQ=1
TRACE: filter:FORWARD:rule:2 IN=eth0 OUT=docker0 MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=56 ID=58706 PROTO=ICMP TYPE=0 CODE=0 ID=94 SEQ=1
TRACE: filter:DOCKER-ISOLATION:return:1 IN=eth0 OUT=docker0 MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=56 ID=58706 PROTO=ICMP TYPE=0 CODE=0 ID=94 SEQ=1
TRACE: filter:FORWARD:rule:3 IN=eth0 OUT=docker0 MAC=00:16:3e:06:3a:3a:00:2a:6a:aa:12:7c:08:00 SRC=10.28.61.30 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=56 ID=58706 PROTO=ICMP TYPE=0 CODE=0 ID=94 SEQ=1
TRACE: mangle:POSTROUTING:policy:1 IN= OUT=docker0 SRC=10.28.61.30 DST=192.168.0.2 LEN=84 TOS=0x00 PREC=0x00 TTL=56 ID=58706 PROTO=ICMP TYPE=0 CODE=0 ID=94 SEQ=1
TRACE: eb:nat:OUTPUT IN= OUT=veth0594f4b MAC source = 02:42:ff:27:17:4d MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=10.28.61.30 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: eb:filter:OUTPUT IN= OUT=veth0594f4b MAC source = 02:42:ff:27:17:4d MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=10.28.61.30 IP DST=192.168.0.2, IP tos=0x00, IP proto=1
TRACE: eb:nat:POSTROUTING IN= OUT=veth0594f4b MAC source = 02:42:ff:27:17:4d MAC dest = 02:42:c0:a8:00:02 proto = 0x0800 IP SRC=10.28.61.30 IP DST=192.168.0.2, IP tos=0x00, IP proto=1

后续的请求和应答基本类似,少的还是nat PREROUTING和nat POSTROUTING,因为不再是NEW connection。

六、小结

个人赶脚:iptables的规则还是太复杂了,再加上bridge的ebtable规则,让人有些眼花缭乱。尤其是kube-proxy的规则又与docker的规则鞣合在一起,iptables的rules列表就显得更为冗长和复杂了。但目前kube-proxy稳定版依然以iptables为主要实现机制,不过kube-proxy对ipvs的支持也已经在路上了(kubernetes 1.8中ipvs处于alpha阶段),希望后续我们能有更多的选择。

此次实验全部日志内容参见:docker-bridge-network-demo-iptables-trace-log.txt文件

七、参考资料


微博:@tonybai_cn
微信公众号:iamtonybai
github.com: https://github.com/bigwhite




这里是Tony Bai的个人Blog,欢迎访问、订阅和留言!订阅Feed请点击上面图片

如果您觉得这里的文章对您有帮助,请扫描上方二维码进行捐赠,加油后的Tony Bai将会为您呈现更多精彩的文章,谢谢!

如果您希望通过微信捐赠,请用微信客户端扫描下方赞赏码:


如果您希望通过比特币或以太币捐赠,可以扫描下方二维码:

比特币:


以太币:


如果您喜欢通过微信App浏览本站内容,可以扫描下方二维码,订阅本站官方微信订阅号“iamtonybai”;点击二维码,可直达本人官方微博主页^_^:



本站Powered by Digital Ocean VPS。

选择Digital Ocean VPS主机,即可获得10美元现金充值,可免费使用两个月哟!

著名主机提供商Linode 10$优惠码:linode10,在这里注册即可免费获得。

阿里云推荐码:1WFZ0V立享9折!

View Tony Bai's profile on LinkedIn


文章

评论

  • 正在加载...

分类

标签

归档











更多