标签 Linux 下的文章

使用Visual Studio Code辅助Go源码编写

作为VIMer,日常编码中,Vim编辑器依然是我的首选。以前以C语言为主要语言的时候是这样,现在以Go为主要语言时亦是这样。不过近期发现Mac上使用Vim在编写Go代码时,Vim时不时的“抽风”:出现一些“屏幕字符被篡改”的问题,比如下面这幅图中”func”变成了”fknc”:

虽然一段时间后,显示会自动更正过来,但这种“篡改”是会让你产生“幻觉”的。你会想:是不是我真的将”func”写成”fknc”了呢?久而久之,这个瑕疵将会影响你的编码效率。至于为何会出现这个问题,初步怀疑可能是因为vim加载较多插件导致的一些性能问题,我在安装了Ubuntu 16.04的台式机上至今还没发现这个问题(相同的.vimrc配置)。

于是,我打算找一款辅助编辑器,用于在被上面这个问题折磨得开始“厌恶”Vim的某些时候,切换一下,平复一下心情^0^。我看中了Microsoft开源的Visual Studio Code,简称:VSCode。

一、与Microsoft的Visual Studio的渊源

Microsoft做IDE还是很专业的,也是很认真的。大学那时候学C,嫌弃Turbo C太简陋,基本上都是在D版Visual Studio 6.0上完成各种作业和小程序的制作的。后来在2001年微软发布了.net战略,发布了C#语言,同时也发布了Visual Studio .NET IDE。估计我也算是国内第一批使用到Visual Studio.NET IDE的人吧,那时候微软俱乐部在校园里免费发送Vs.net beta版光盘,我拿到了一份,并第一时间体验了vs.net。Visual Studio .NET与之前的VS 6.0有着天壤之别,功能强大,界面也做了重新设计,支持微软的各种语言,包括C#、C/C++(包括managed c++)、VB、ASP.net等,并在一年后的正式版发布后,逐渐在桌面应用程序开发中成为霸主,把那个时候在IDE领域的竞争对手Borland公司彻底打垮。但Visual Studio从此也变得更加庞大和臃肿,安装一个VS,没有几个G空间是不行的。想想那个时候机器的配置,跑个VS.net还真是心有而力不足。

工作之后,进入服务端编程领域,结识了Unix、Linux以及VimGCC,就再也没怎么碰过Visual Studio。随着工作OS也从Windows切换到Ubuntu,基本就和VS绝缘了。之后随着Java语言成为企业级应用的主角、Web时代的到来以及开源IDE(比如:Eclipse)的兴起,微软的Visual Studio不再那么耀眼,或者说是人们对于IDE的关注并不像开发GUI程序那个年代那么强烈了。但鉴于微软自身产品体系的庞大,VS始终在市场中占有一席之地。

而近些年,一些跨平台、轻量级、插件结构、支持智能感知、可随意定制的文本编辑器的出现,比如:Sublime TextAtom等让开发人员喜不自禁。这些编辑器并非定位于IDE,但功能又不输给IDE很多,尤其在支持编码、调试这些环节,它们完全可以与专业IDE媲美,但资源消耗却是像Visual Studio、Eclipse这样大而全的IDE所无法匹敌的。而Visual Studio Code恰是微软在这方面的一个尝试,也是微软最新公司战略的体现之一:拥抱所有开发者(不仅仅是Windows上的哦)。

二、VSCode安装

VSCode发布于2015年4月的Build大会上。发布后,迅速得到开发者响应,大家普遍反映:VSCode性能不错、关注细节、体验良好,虽然当时VSCode的插件还不算丰富。一年多过去后,VSCode已经演化到了1.8.1版本(截至2016年12月末),支持所有主流编程语言的开发,配套的插件也十分丰富了。VSCode的安装简单的很,这一向都是微软的强项,你可以在其官方站上下载到各个平台的安装包(Linux平台也有.deb/.rpm两种包格式供选择,并提供32bit和64bit两种版本)。下载后安装即可。

1、VSCode配置和数据存储路径

VSCode安装后,一般不必关心其配置和数据存储路径的位置。但作为有一些Geek精神的developer来说,弄清楚其安装和配置的来龙去脉还是很有意义的。

在Mac上:

VSCode存储运行数据和配置文件的目录在:~/Library/Application Support/Code下:

 ~/Library/Application Support/Code]$ls
Backups/        CachedData/        Cookies-journal        Local Storage/        User/
Cache/            Cookies            GPUCache/        Preferences        storage.json

$ls User
keybindings.json    locale.json        settings.json        snippets/        workspaceStorage/

在Ubuntu中:

VSCode存储运行数据和配置文件的目录在~/.config/Code下面:

~/.config/Code$ ls
Backups  Cache  CachedData  Cookies  Cookies-journal  GPUCache  Local Storage  storage.json  User

至于Windows平台,请自行探索^_^。

2、启动方式

VSCode有两种启动方式:桌面启动和命令行启动。桌面启动自不必说了。命令行启动的示例如下:

$ code main.go

code命令会打开一个VSCode窗口并加载命令参数中的文件内容,这里是main.go。

三、VSCode的配置

一般来说,VSCode启动即可用了。但要想发挥出VSCode的能量,我们必须对其进行一番配置。VSCode的配置有几十上百项,这里无法全覆盖,仅说明一下我个人比较关注的。

1、安装插件

像VSCode这种小清新文本编辑器要想对编程语言有很好的支持,必须安装相应语言的插件。以Go为例,我们至少要安装vscode-go插件。vscode-go之于VSCode,就好比vim-go之于VIM。并且和vim-go类似,vscode-go实现的各种Features也是依赖诸多已存在的Go周边工具,包括:

gocode: go get -u -v github.com/nsf/gocode
godef: go get -u -v github.com/rogpeppe/godef
gogetdoc: go get -u -v github.com/zmb3/gogetdoc
golint: go get -u -v github.com/golang/lint/golint
go-outline: go get -u -v github.com/lukehoban/go-outline
goreturns: go get -u -v sourcegraph.com/sqs/goreturns
gorename: go get -u -v golang.org/x/tools/cmd/gorename
gopkgs: go get -u -v github.com/tpng/gopkgs
go-symbols: go get -u -v github.com/newhook/go-symbols
guru: go get -u -v golang.org/x/tools/cmd/guru
gotests: go get -u -v github.com/cweill/gotests/...

因此,要想实现vscode-go官网页面中demo中哪些神奇的Feature,你必须将上面的这些依赖工具逐一安装成功。如果缺少一个依赖工具,VSCode会在窗口右下角的状态栏里显示:“Analysis Tools Missing”字样,以提示你安装这些工具。

VSCode当然也支持Vim-mode的编辑模式,如果你也和我一样,喜欢用vim-mode在VSCode中进行编辑,可以安装VSCodeVim插件

VSCode的插件安装方式分为两种:在线安装和VSIX方式安装。

在线安装,顾名思义,即在VSCode的窗口左侧边栏中点击“Extensions”按钮,在打开的Extensions搜索框中搜索你想要的插件名称,或者选择预制的条件获得插件信息。选中你要安装的插件,点击“Install”按钮即可完成安装。

VSIX安装:即到插件官网将插件文件下载到本地(插件安装文件一般以.vsix或.zip结尾),在窗口中选择:”Install from VSIX…”,选择你下载的插件文件即可。

安装后的插件都被放在~/.vscode/extensions目录下(mac和linux)。

2、更改语言设置

VSCode在初次启动时会判断当前系统语言,并以相应的语言作为默认窗口显示语言。比如:我的是中文OS X系统,那么默认VSCode的窗口文字都是中文。如果我要将其改为英文,应该如何操作呢?

F1登场!这里的F1可不是赛车比赛,而是快捷键F1,估计也是整个VSCode最常用的快捷键之一了。敲击F1后,VSCode会显示其“Command Palette”输入框,这里面包含了当前VSCode可以执行的所有操作命令,支持Search。我们输入”language”,在搜索结果中选择“Configure Language”,VSCode打开一个新的编辑窗口,加载~/Library/Application Support/Code/User/locale.json文件:

{
    // 定义 VSCode 的显示语言。
    // 请参阅 https://go.microsoft.com/fwlink/?LinkId=761051,了解支持的语言列表。
    // 要更改值需要重启 VSCode。
    "locale": "zh-cn"
}

当前语言为中文,如果我们要将其改为英文,则修改该文件中的”locale”项:

{
    // 定义 VSCode 的显示语言。
    // 请参阅 https://go.microsoft.com/fwlink/?LinkId=761051,了解支持的语言列表。
    // 要更改值需要重启 VSCode。
    "locale": "en-US"
}

保存,重启VSCode。再次启动的VSCode将会以英文界面示人了。

3、User Settings和Workspace Settings

UserSettings是一种“全局”设置,而Workspace Settings则顾名思义,是一种针对一个特定目录或project的设置。

UserSettings设置后的数据保存在~/Library/Application Support/Code下(以mac为例),而Workspace Setting设置后的数据则保存在某个项目特定目录下的.vscode目录下。

在菜单栏,选择【Preferences -> User Settings】可以打开~/Library/Application Support/Code/User/settings.json文件。默认情况下,该文件为空。VSCode采用默认设置。如果你要个性化设置,那么可将对应的配置项copy一份到settings.json中,并赋予其新值,保存即可。新值将覆盖默认值。以字体大小为例,我们将默认的editor.fontSize 12改为10:

// Place your settings in this file to overwrite the default settings
{
    "editor.fontSize": 10,
}

保存后,可以看到窗口中所有文字的Size都变小了。

在菜单栏,选择【Preferences -> Workspace Settings】可打开当前工作目录下的.vscode的settings.json文件,其工作原理和配置方法与User Settings一样,只是生效范围仅限于该工作区范畴。

4、Color Theme

VSCode内置了主流的配色方案,比如:monokai、solarized dark/light等。F1,输入”color”搜索,选择:“Perefences: Color Theme”(在MAC上也可以用cmd+k, cmd+t打开),在下拉列表中选择你喜欢的配色Theme即可,即可生效。

四、vscode-go的使用

前面说过,和vim-go一样,vscode-go插件实现了Go编码中需要的各种功能:自动format、自动增删import、build on save、lint on save、定义跳转、原型信息快速提示、自动补全、code snippets等。另外它通过带颜色的波浪线提示代码问题(虽然有时候反应有点慢),包括语法问题、不符合idiomatic go规则的问题(比如appId这个命名,它会建议你改为appID)等。

code snippets非常好用,内置的code snippets在~/.vscode/extensions/lukehoban.Go-0.6.51/snippets/go.json中可以找到,类似这样的定义:

//~/.vscode/extensions/lukehoban.Go-0.6.51/snippets/go.json
{
        ".source.go": {
                "single import": {
                        "prefix": "im",
                        "body": "import \"${1:package}\""
                },
                "multiple imports": {
                        "prefix": "ims",
                        "body": "import (\n\t\"${1:package}\"\n)"
                },
                "single constant": {
                        "prefix": "co",
                        "body": "const ${1:name} = ${2:value}"
                },
                "multiple constants": {
                        "prefix": "cos",
                        "body": "const (\n\t${1:name} = ${2:value}\n)"
                },
                "type interface declaration": {
                        "prefix": "tyi",
                        "body": "type ${1:name} interface {\n\t$0\n}"
                },
                "type struct declaration": {
                        "prefix": "tys",
                        "body": "type ${1:name} struct {\n\t$0\n}"
                },
                "package main and main function": {
                        "prefix": "pkgm",
                        "body": "package main\n\nfunc main() {\n\t$0\n}"
                },
... ...

敲入”prefix”的值,比如”ims”,输入tab,vscode-go将为你展开为:

import (
    "package"
)

在使用vscode时遇到过一次代码自动补全“失灵”的问题。vscode-go只会提示:”PANIC,PANIC,PANIC”。经查,这个是gocode daemon的问题,我的解决方法是:

gocode close //关闭gocode daemon
gocode -s &  //重启之。

五、小结

在诸多轻量级编辑器中,我还是比较看好vscode的,毕竟其背后有着Microsoft积淀多年的IDE产品开发经验。并且和Microsoft以往产品最大的不同就是其是开源项目。

关于Vscode的使用和奇技淫巧可以参见其官方的这篇文档“VS Code Tips and Tricks”。

关于Vscode的各种周边工具和资料列表,请参考Awesome-vscode项目

快捷键往往是开发人员的最爱,VSCode官方制作了三个平台的VSCode的快捷键worksheet:

https://code.visualstudio.com/shortcuts/keyboard-shortcuts-windows.pdf

https://code.visualstudio.com/shortcuts/keyboard-shortcuts-macos.pdf

https://code.visualstudio.com/shortcuts/keyboard-shortcuts-linux.pdf

VSCode还在快速发展,离完善还有不小提升空间。比如:在使用过程中也发现了VSCode 窗口无响应或代码编辑错乱之情况。不过作为Go编码的一个辅助编辑器,VSCode还是完全胜任和超出预期的。

Kubernetes集群的安全配置

使用kubernetes/cluster/kube-up.sh脚本在装有Ubuntu操作系统的bare metal上搭建的Kubernetes集群并不安全,甚至可以说是“完全不设防的”,这是因为Kubernetes集群的核心组件:kube-apiserver启用了insecure-port。insecure-port背后的api server默认完全信任访问该端口的流量,内部无任何安全机制。并且监听insecure-port的api server bind的insecure-address为0.0.0.0。也就是说任何内外部请求,都可以通过insecure-port端口任意操作Kubernetes集群。我们的平台虽小,但“裸奔”的k8s集群也并不是我们想看到的,适当的安全配置是需要的。

在本文中,我将和大家一起学习一下Kubernetes提供的安全机制,并通过安全配置调整,实现K8s集群的“有限”安全。

一、集群现状

我们先来“回顾”一下集群现状,为后续配置调整提供一个可回溯和可比对的“基线”。

1、Nodes

集群基本信息:

# kubectl cluster-info
Kubernetes master is running at http://10.47.136.60:8080
KubeDNS is running at http://10.47.136.60:8080/api/v1/proxy/namespaces/kube-system/services/kube-dns

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

当前集群逻辑上由一个master node和两个worker nodes组成:

单master: 10.47.136.60
worker nodes: 10.47.136.60和10.46.181.146

# kubectl get node --show-labels=true
NAME            STATUS    AGE       LABELS
10.46.181.146   Ready     41d       beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=10.46.181.146
10.47.136.60    Ready     41d       beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=10.47.136.60
2、kubernetes核心组件的启动参数

我们再来明确一下当前集群中各k8s核心组件的启动参数,这些参数决定着组件背后的行为:

master node & worker node1 – 10.47.136.60上:

root       22000       1  0 Oct17 ?        03:52:55 /opt/bin/kube-controller-manager --master=127.0.0.1:8080 --root-ca-file=/srv/kubernetes/ca.crt --service-account-private-key-file=/srv/kubernetes/server.key --logtostderr=true

root       22021       1  1 Oct17 ?        17:11:15 /opt/bin/kube-apiserver --insecure-bind-address=0.0.0.0 --insecure-port=8080 --etcd-servers=http://127.0.0.1:4001 --logtostderr=true --service-cluster-ip-range=192.168.3.0/24 --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,SecurityContextDeny,ResourceQuota --service-node-port-range=30000-32767 --advertise-address=10.47.136.60 --client-ca-file=/srv/kubernetes/ca.crt --tls-cert-file=/srv/kubernetes/server.cert --tls-private-key-file=/srv/kubernetes/server.key

root       22121       1  0 Oct17 ?        00:22:30 /opt/bin/kube-scheduler --logtostderr=true --master=127.0.0.1:8080

root     2140405       1  0 Nov15 ?        00:05:26 /opt/bin/kube-proxy --hostname-override=10.47.136.60 --master=http://10.47.136.60:8080 --logtostderr=true

root     1912455       1  1 Nov15 ?        03:43:09 /opt/bin/kubelet --hostname-override=10.47.136.60 --api-servers=http://10.47.136.60:8080 --logtostderr=true --cluster-dns=192.168.3.10 --cluster-domain=cluster.local --config=

worker node2 – 10.46.181.146上:

root      7934     1  1 Nov15 ?        03:06:00 /opt/bin/kubelet --hostname-override=10.46.181.146 --api-servers=http://10.47.136.60:8080 --logtostderr=true --cluster-dns=192.168.3.10 --cluster-domain=cluster.local --config=
root     23026     1  0 Nov15 ?        00:04:49 /opt/bin/kube-proxy --hostname-override=10.46.181.146 --master=http://10.47.136.60:8080 --logtostderr=true

从master node的核心组件kube-apiserver 的启动命令行参数也可以看出我们在开篇处所提到的那样:apiserver insecure-port开启,且bind 0.0.0.0:8080,可以任意访问,连basic_auth都没有。当然api server不只是监听这一个端口,在api server源码中,我们可以看到默认情况下,apiserver还监听了另外一个secure port,该端口的默认值是6443,通过lsof命令查看6443端口的监听进程也可以印证这一点:

//master node上

# lsof -i tcp:6443
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
kube-apis 22021 root   46u  IPv6 921529      0t0  TCP *:6443 (LISTEN)
3、私钥文件和公钥证书

通过安装脚本在bare-metal上安装的k8s集群,在master node上你会发现如下文件:

root@node1:/srv/kubernetes# ls
ca.crt  kubecfg.crt  kubecfg.key  server.cert  server.key

这些私钥文件和公钥证书是在k8s(1.3.7)集群安装过程由安装脚本创建的,在kubernetes/cluster/common.sh中你可以发现function create-certs这样一个函数,这些文件就是它创建的。

# Create certificate pairs for the cluster.
# $1: The public IP for the master.
#
# These are used for static cert distribution (e.g. static clustering) at
# cluster creation time. This will be obsoleted once we implement dynamic
# clustering.
#
# The following certificate pairs are created:
#
#  - ca (the cluster's certificate authority)
#  - server
#  - kubelet
#  - kubecfg (for kubectl)
#
# TODO(roberthbailey): Replace easyrsa with a simple Go program to generate
# the certs that we need.
#
# Assumed vars
#   KUBE_TEMP
#
# Vars set:
#   CERT_DIR
#   CA_CERT_BASE64
#   MASTER_CERT_BASE64
#   MASTER_KEY_BASE64
#   KUBELET_CERT_BASE64
#   KUBELET_KEY_BASE64
#   KUBECFG_CERT_BASE64
#   KUBECFG_KEY_BASE64
function create-certs {
  local -r primary_cn="${1}"
  ... ...

}

简单描述一下这些文件的用途:

- ca.crt:the cluster's certificate authority,CA证书,即根证书,内置CA公钥,用于验证某.crt文件,是否是CA签发的证书;
- server.cert:kube-apiserver服务端公钥数字证书;
- server.key:kube-apiserver服务端私钥文件;
- kubecfg.crt 和kubecfg.key:按照 create-certs函数注释中的说法:这两个文件是为kubectl访问apiserver[双向证书验证](http://tonybai.com/2015/04/30/go-and-https/)时使用的。

不过,这里我们没有CA的key,无法签发新证书,如果要用这几个文件,那么就仅能限于这几个文件。我们可以利用kubecfg.crt 和kubecfg.key 作为访问api server的client端的key和crt使用。我们来查看一下这几个文件:

查看ca.crt:

#openssl x509 -noout -text -in ca.crt
... ...
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 16946557986148168970 (0xeb2e44b3a1ebb50a)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=10.47.136.60@1476362758
        Validity
            Not Before: Oct 13 12:45:58 2016 GMT
            Not After : Oct 11 12:45:58 2026 GMT
        Subject: CN=10.47.136.60@1476362758
... ..

查看server.cert:

...
 Data:
        Version: 3 (0x2)
        Serial Number: 1 (0x1)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=10.47.136.60@1476362758
        Validity
            Not Before: Oct 13 12:45:59 2016 GMT
            Not After : Oct 11 12:45:59 2026 GMT
        Subject: CN=kubernetes-master
...

查看kubecfg.crt:

...
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 2 (0x2)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=10.47.136.60@1476362758
        Validity
            Not Before: Oct 13 12:45:59 2016 GMT
            Not After : Oct 11 12:45:59 2026 GMT
        Subject: CN=kubecfg
...

再来验证一下server.cert和kubecfg.crt是否是ca.crt签发的:

# openssl verify -CAfile ca.crt kubecfg.crt
kubecfg.crt: OK

# openssl verify -CAfile ca.crt server.cert
server.cert: OK

在前面的apiserver的启动参数展示中,我们已经看到kube-apiserver使用了ca.crt, server.cert和server.key:

/opt/bin/kube-apiserver --insecure-bind-address=0.0.0.0 --insecure-port=8080 --etcd-servers=http://127.0.0.1:4001 --logtostderr=true --service-cluster-ip-range=192.168.3.0/24 --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,SecurityContextDeny,ResourceQuota --service-node-port-range=30000-32767 --advertise-address=10.47.136.60 --client-ca-file=/srv/kubernetes/ca.crt --tls-cert-file=/srv/kubernetes/server.cert --tls-private-key-file=/srv/kubernetes/server.key

在后续章节中,我们还会详细说明这些密钥和公钥证书在K8s集群安全中所起到的作用。

二、集群环境

还是那句话,Kubernetes在active development中,老版本和新版本的安全机制可能有较大变动,本篇中的配置方案和步骤都是针对一定环境有效的,我们的环境如下:

OS:
Ubuntu 14.04.4 LTS Kernel:3.19.0-70-generic #78~14.04.1-Ubuntu SMP Fri Sep 23 17:39:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Docker:
# docker version
Client:
 Version:      1.12.2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   bb80604
 Built:        Tue Oct 11 17:00:50 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   bb80604
 Built:        Tue Oct 11 17:00:50 2016
 OS/Arch:      linux/amd64

Kubernetes集群:1.3.7

私有镜像仓库:阿里云镜像仓库

三、目标

目前,我们尚不具备一步迈向“绝对安全”的能力,在目标设定时,我们的一致想法是在当前阶段“有限安全”的K8s集群更适合我们。在这一原则下,我们针对不同情况提出不同的目标设定。

前面说过,k8s针对insecure port(–insecure-bind-address=0.0.0.0 –insecure-port=8080)的流量没有任何安全机制限制,相当于k8s“裸奔”。但是走k8s apiserver secure port(–bind-address=0.0.0.0 –secure-port=6443)的流量,将会遇到验证、授权等安全机制的限制。具体使用哪个端口与API server的交互方式,要视情况而定。

在分情况说明之前,将api server的insecure port的bind address由0.0.0.0改为local address是必须要做的。

1、Cluster -> Master(apiserver)

从集群到Apiserver的流量也可以细分为几种情况:

a) kubernetes component on master node -> apiserver

由于master node上的components与apiserver运行在一台机器上,因此可以通过local address的insecure-port访问apiserver,无需走insecure port。从现状中当前master上的component组件的启动参数来看,目前已经符合要求,于是针对这些components,我们无需再做配置上的调整。

b) kubernetes component on worker node -> apiserver

目标是实现kubernetes components on worker node和运行于master上的apiserver之间的基于https的双向认证。kubernetes的各个组件均支持在命令行参数中传入tls相关参数,比如ca文件路径,比如client端的cert文件和key等。

c) componet in pod for kubernetes -> apiserver

像kube dns和kube dashboard这些运行于pod中的k8s 组件也是在k8s cluster范围内调度的,它们可能运行在任何一个worker node上。理想情况下,它们与master上api server的通信也应该是基于一定安全机制的。不过在本篇中,我们暂时不动它们的设置,以免对其他目标的实现造成一定障碍和更多的工作量,在后续文章中,可能会专门将dns和dashboard拿出来做安全加固说明。因此,dns和dashboard在这里仍然使用的是insecure-port:

root     10531 10515  0 Nov15 ?        00:03:02 /dashboard --port=9090 --apiserver-host=http://10.47.136.60:8080
root     2018255 2018240  0 Nov15 ?        00:03:50 /kube-dns --domain=cluster.local. --dns-port=10053 --kube-master-url=http://10.47.136.60:8080
d) user service in pod -> apiserver

我们的集群管理程序也是以service的形式运行在k8s cluster中的,这些程序如何访问apiserver才是我们关心的重点,我们希望管理程序通过secure-port,在一定的安全机制下与apiserver交互。

2、Master(apiserver) -> Cluster

apiserver作为client端访问Cluster,在k8s文档中,这个访问路径主要包含两种情况:

a) apiserver与各个node上kubelet交互,采集Pod的log;
b) apiserver通过自身的proxy功能访问node、pod以及集群中的各种service。

在“有限安全”的原则下,我们暂不考虑这种情况下的安全机制。

四、Kubernetes的安全机制

kube-apiserver是整个kubernetes集群的核心,无论是kubectl还是通过api管理集群,最终都会落到与kube-apiserver的交互,apiserver是集群管理命令的入口。kube-apiserver同时监听两个端口:insecure-port和secure-port。之前提到过:通过insecure-port进入apiserver的流量可以有控制整个集群的全部权限;而通过secure-port的流量将经过k8s的安全机制的重重考验,这也是这一节我们重要要说明的。insecure-port的存在一般是为了集群bootstrap或集群开发调试使用的。官方文档建议:集群外部流量都应该走secure port。insecure-port可通过firewall rule使外部流量unreachable。

下面这幅官方图示准确解释了通过secure port的流量将要通过的“安全关卡”:

img{512x368}

我们可以看到外界到APIServer的请求先后经过了:

安全通道(tls) -> Authentication(身份验证) -> Authorization(授权)-> Admission Control(入口条件控制)
  • 安全通道:即基于tls的https的安全通道建立,对流量进行加密,防止嗅探、身份冒充和篡改;

  • Authentication:即身份验证,这个环节它面对的输入是整个http request。它负责对来自client的请求进行身份校验,支持的方法包括:client证书验证(https双向验证)、basic auth、普通token以及jwt token(用于serviceaccount)。APIServer启动时,可以指定一种Authentication方法,也可以指定多种方法。如果指定了多种方法,那么APIServer将会逐个使用这些方法对客户端请求进行验证,只要请求数据通过其中一种方法的验证,APIServer就会认为Authentication成功;

  • Authorization:授权。这个阶段面对的输入是http request context中的各种属性,包括:user、group、request path(比如:/api/v1、/healthz、/version等)、request verb(比如:get、list、create等)。APIServer会将这些属性值与事先配置好的访问策略(access policy)相比较。APIServer支持多种authorization mode,包括AlwaysAllow、AlwaysDeny、ABAC、RBAC和Webhook。APIServer启动时,可以指定一种authorization mode,也可以指定多种authorization mode,如果是后者,只要Request通过了其中一种mode的授权,那么该环节的最终结果就是授权成功。

  • Admission Control:从技术的角度看,Admission control就像a chain of interceptors(拦截器链模式),它拦截那些已经顺利通过authentication和authorization的http请求。http请求沿着APIServer启动时配置的admission control chain顺序逐一被拦截和处理,如果某个interceptor拒绝了该http请求,那么request将会被直接reject掉,而不是像authentication或authorization那样有继续尝试其他interceptor的机会。

五、实现安全传输通道(https)与身份校验(authentication)

在建立安全传输通道、身份校验环节,我们根据”目标“设定一节中的分类,也分为三种情况:

a) 运行于master上的核心k8s components走insecure port,这个暂不用修改配置;
b) worker node上的k8s组件配置通过insecure-port访问,并采用https双向认证的身份验证机制;
c) pod in k8s访问apiserver,通过https+ basic auth的方式进行身份验证。

APIServer直接使用了集群创建时创建的ca.crt、server.cert和server.key,由于没有ca.key,所以我们只能直接利用其它两个文件: kubecfg.key和kubecfg.crt作为客户端的私钥文件和公钥证书。当然你也可以手动重新创建ca,并将apiserver使用的.key、.crt以及各个components的client.key和client.crt都生成一份,并用你生成的Ca签发。这里我们就偷个懒儿了。

在开始之前,我们再来看看apiserver的启动参数:

root       22021       1  1 Oct17 ?        17:11:15 /opt/bin/kube-apiserver --insecure-bind-address=0.0.0.0 --insecure-port=8080 --etcd-servers=http://127.0.0.1:4001 --logtostderr=true --service-cluster-ip-range=192.168.3.0/24 --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,SecurityContextDeny,ResourceQuota --service-node-port-range=30000-32767 --advertise-address=10.47.136.60 --client-ca-file=/srv/kubernetes/ca.crt --tls-cert-file=/srv/kubernetes/server.cert --tls-private-key-file=/srv/kubernetes/server.key

由于之前简述了Kubernetes的安全机制,于是我们对这些参数又有了进一步认识

https安全通道建立阶段:端口6443(通过 /opt/bin/kube-apiserver --help查看options说明可以得到),公钥证书server.cert ,私钥文件:server.key。
Authentication阶段:从当前启动参数中,我们仅能看到一种机制:--client-ca-file=/srv/kubernetes/ca.crt,也就是client证书校验机制。apiserver会用/srv/kubernetes/ca.crt对client端发过来的client.crt进行验证。
Authorization阶段:通过 /opt/bin/kube-apiserver --help查看options说明可以得到:--authorization-mode="AlwaysAllow",也就是说在这一环节,所有Request都可以顺利通过。
Admission Control阶段:apiserver指定了“NamespaceLifecycle,LimitRanger,ServiceAccount,SecurityContextDeny,ResourceQuota”这样一个interceptor链。

我们首先来测试一下通过kubecfg.key和kubecfg.crt访问APIServer的insecure-port,验证一下kubecfg.key和kubecfg.crt作为client端私钥文件和公钥证书的可行性:

# curl https://10.47.136.60:6443/version --cert /srv/kubernetes/kubecfg.crt --key /srv/kubernetes/kubecfg.key --cacert /srv/kubernetes/ca.crt
{
  "major": "1",
  "minor": "3",
  "gitVersion": "v1.3.7",
  "gitCommit": "a2cba278cba1f6881bb0a7704d9cac6fca6ed435",
  "gitTreeState": "clean",
  "buildDate": "2016-09-12T23:08:43Z",
  "goVersion": "go1.6.2",
  "compiler": "gc",
  "platform": "linux/amd64"
}

接下来,我们就来开始调整k8s配置。

第一个场景:components on worker node -> master

worker node上有两个k8s components:kubelet和kube-proxy,当前它们的启动参数为:

root      7934     1  1 Nov15 ?        03:33:35 /opt/bin/kubelet --hostname-override=10.46.181.146 --api-servers=http://10.47.136.60:8080 --logtostderr=true --cluster-dns=192.168.3.10 --cluster-domain=cluster.local --config=
root      8140     1  0 14:59 ?        00:00:00 /opt/bin/kube-proxy --hostname-override=10.46.181.146 --master=http://10.47.136.60:8080 --logtostderr=true

我们将ca.crt、kubecfg.key和kubecfg.crt scp到其他各个Worker node的/srv/kubernetes目录下:

root@node1:/srv/kubernetes# scp ca.crt root@10.46.181.146:/srv/kubernetes
ca.crt                                                                                                                                        100% 1220     1.2KB/s   00:00
root@node1:/srv/kubernetes# scp kubecfg.crt root@10.46.181.146:/srv/kubernetes
kubecfg.crt                                                                                                                                   100% 4417     4.3KB/s   00:00
root@node1:/srv/kubernetes# scp kubecfg.key root@10.46.181.146:/srv/kubernetes
kubecfg.key

在worker node: 10.46.181.146上:

# ls -l
total 16
-rw-r----- 1 root root 1220 Nov 25 15:51 ca.crt
-rw------- 1 root root 4417 Nov 25 15:51 kubecfg.crt
-rw------- 1 root root 1708 Nov 25 15:51 kubecfg.key

创建worker node上kubelet和kube-proxy所要使用的config文件:/root/.kube/config

/root/.kube/config

apiVersion: v1
kind: Config
preferences: {}
users:
- name: kubecfg
  user:
    client-certificate: /srv/kubernetes/kubecfg.crt
    client-key: /srv/kubernetes/kubecfg.key
clusters:
- cluster:
    certificate-authority: /srv/kubernetes/ca.crt
  name: ubuntu
contexts:
- context:
    cluster: ubuntu
    user: kubecfg
  name: ubuntu
current-context: ubuntu

这个文件参考了master node上的/root/.kube/config文件的格式,你也可以在master node上使用kubectl config view查看config文件内容:

# kubectl config view
apiVersion: v1
clusters:
- cluster:
    insecure-skip-tls-verify: true
    server: http://10.47.136.60:8080
  name: ubuntu
contexts:
- context:
    cluster: ubuntu
    user: ubuntu
  name: ubuntu
current-context: ubuntu
kind: Config
preferences: {}
users:
- name: ubuntu
  user:
    password: xxxxxA
    username: admin

Worker node上/root/.kube/config中的user.name使用的是kubecfg,这也是在前面查看kubecfg.crt时,kubecfg.crt在/CN域中使用的值。

接下来我们来修改worker node上的/etc/default/kubelet文件:

KUBELET_OPTS=" --hostname-override=10.46.181.146  --api-servers=https://10.47.136.60:6443 --logtostderr=true  --cluster-dns=192.168.3.10  --cluster-domain=cluster.local  --kubeconfig=/root/.kube/config"
#KUBELET_OPTS=" --hostname-override=10.46.181.146  --api-servers=http://10.47.136.60:8080  --logtostderr=true  --cluster-dns=192.168.3.10  --cluster-domain=cluster.local  --config=  "

在worker node上重启kubelet并查看/var/log/upstart/kubelet.log:

# service kubelet restart
kubelet stop/waiting
kubelet start/running, process 9716

///var/log/upstart/kubelet.log
... ...
I1125 16:12:26.332652    9716 server.go:784] Watching apiserver
W1125 16:12:26.338581    9716 kubelet.go:572] Hairpin mode set to "promiscuous-bridge" but configureCBR0 is false, falling back to "hairpin-veth"
I1125 16:12:26.338641    9716 kubelet.go:393] Hairpin mode set to "hairpin-veth"
I1125 16:12:26.366600    9716 docker_manager.go:235] Setting dockerRoot to /var/lib/docker
I1125 16:12:26.367067    9716 server.go:746] Started kubelet v1.3.7
E1125 16:12:26.369508    9716 kubelet.go:954] Image garbage collection failed: unable to find data for container /
I1125 16:12:26.370534    9716 fs_resource_analyzer.go:66] Starting FS ResourceAnalyzer
I1125 16:12:26.370567    9716 status_manager.go:123] Starting to sync pod status with apiserver
I1125 16:12:26.370601    9716 kubelet.go:2501] Starting kubelet main sync loop.
I1125 16:12:26.370632    9716 kubelet.go:2510] skipping pod synchronization - [network state unknown container runtime is down]
I1125 16:12:26.370981    9716 server.go:117] Starting to listen on 0.0.0.0:10250
I1125 16:12:26.384336    9716 volume_manager.go:227] Starting Kubelet Volume Manager
I1125 16:12:26.480387    9716 factory.go:295] Registering Docker factory
I1125 16:12:26.480483    9716 factory.go:54] Registering systemd factory
I1125 16:12:26.481446    9716 factory.go:86] Registering Raw factory
I1125 16:12:26.482888    9716 manager.go:1072] Started watching for new ooms in manager
I1125 16:12:26.484242    9716 oomparser.go:200] OOM parser using kernel log file: "/var/log/kern.log"
I1125 16:12:26.485330    9716 manager.go:281] Starting recovery of all containers
I1125 16:12:26.562959    9716 kubelet.go:1213] Node 10.46.181.146 was previously registered
I1125 16:12:26.712150    9716 manager.go:286] Recovery completed

一次点亮!

再来修改worker node上kube-proxy的配置:/etc/default/kube-proxy:

// /etc/default/kube-proxy
KUBE_PROXY_OPTS=" --hostname-override=10.46.181.146  --master=https://10.47.136.60:6443  --logtostderr=true --kubeconfig=/root/.kube/config"
#KUBE_PROXY_OPTS=" --hostname-override=10.46.181.146  --master=http://10.47.136.60:8080  --logtostderr=true  "

在worker node上重启kube-proxy并查看/var/log/upstart/kube-proxy.log:

# service kube-proxy restart
kube-proxy stop/waiting
kube-proxy start/running, process 26185

// /var/log/upstart/kube-proxy.log
I1125 16:30:28.224491   26185 server.go:202] Using iptables Proxier.
I1125 16:30:28.228067   26185 server.go:214] Tearing down userspace rules.
I1125 16:30:28.245634   26185 conntrack.go:40] Setting nf_conntrack_max to 65536
I1125 16:30:28.247422   26185 conntrack.go:57] Setting conntrack hashsize to 16384
I1125 16:30:28.249456   26185 conntrack.go:62] Setting nf_conntrack_tcp_timeout_established to 86400

从日志上看不出有啥异常,算是成功!:)

第二个场景:pod in cluster -> master

通过阅读K8s的官方文档“Accessing the api from a pod”,我们知道K8s cluster为Pod访问API Server做了很多“预备”工作,最重要的一点就是在Pod被创建的时候,一个serviceaccount 被自动mount到/var/run/secrets/kubernetes.io/serviceaccount路径下:

#kubectl describe pod/my-golang-1147314274-0qms5

Name:        my-golang-1147314274-0qms5
Namespace:    default
Node:        10.47.136.60/10.47.136.60
Start Time:    Thu, 24 Nov 2016 14:59:52 +0800
Labels:        pod-template-hash=1147314274
        run=my-golang
Status:        Running
IP:        172.16.99.9
... ...

Containers:
  my-golang:
    ... ...
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-40z0x (ro)
    Environment Variables:    <none>
... ...
Volumes:
  default-token-40z0x:
    Type:    Secret (a volume populated by a Secret)
    SecretName:    default-token-40z0x
QoS Class:    BestEffort
Tolerations:    <none>

serviceaccount顾名思义,是Pod中程序访问APIServer所要使用的账户信息,我们来看看都有啥:

# kubectl get serviceaccount
NAME      SECRETS   AGE
default   1         43d

# kubectl describe serviceaccount/default
Name:        default
Namespace:    default
Labels:        <none>

Image pull secrets:    <none>

Mountable secrets:     default-token-40z0x

Tokens:                default-token-40z0x

# kubectl describe secret/default-token-40z0x
Name:        default-token-40z0x
Namespace:    default
Labels:        <none>
Annotations:    kubernetes.io/service-account.name=default
        kubernetes.io/service-account.uid=90de59ad-9120-11e6-a0a6-00163e1625a9

Type:    kubernetes.io/service-account-token

Data
====
ca.crt:        1220 bytes
namespace:    7 bytes
token:        {Token data}

mount到Pod中/var/run/secrets/kubernetes.io/serviceaccount路径下的default-token-40z0x volume包含三个文件:

  • ca.crt:CA的公钥证书
  • namspace文件:里面的内容为:”default”
  • token:用在Pod访问APIServer时候的身份验证。

理论上,使用这些信息Pod可以成功访问APIServer,我们来测试一下。注意在Pod的世界中,APIServer也是一个Service,通过kubectl get service可以看到:

# kubectl get services
NAME           CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
kubernetes     192.168.3.1     <none>        443/TCP    43d

kubernetes这个Service监听的端口是443,也就是说在Pod的视角中,APIServer暴露的仅仅是insecure-port。并且使用”kubernetes”这个名字,我们可以通过kube-dns获得APIServer的ClusterIP。

启动一个基于golang:latest的pod,pod.yaml如下:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-golang
spec:
  replicas: 1
  template:
    metadata:
      labels:
        run: my-golang
    spec:
      containers:
      - name: my-golang
        image: golang:latest
        command: ["tail", "-f", "/var/log/bootstrap.log"]

Pod启动后,docker exec -it container-id /bin/bash切入container,并执行如下命令:

# TOKEN="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
# curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt https://kubernetes:443/version -H "Authorization: Bearer $TOKEN"
Unauthorized

查看API Server的log:

E1125 17:30:22.504059 2743425 handlers.go:54] Unable to authenticate the request due to an error: crypto/rsa: verification error

似乎是验证token失败。这个问题在kubernetes的github issue中也有被提及,目前尚未解决。

不过仔细想了想,如果每个Pod都默认可以访问APIServer,显然也是不安全的,虽然我们可以通过authority和admission control对默认的token访问做出限制,但总感觉不那么“安全”。

我们来试试basic auth方式(这种方式的弊端是API Server运行中,无法在运行时动态更新auth文件,对于auth文件的修改,必须重启APIServer后生效)。

我们首先在APIServer侧为APIServer创建一个basic auth file:

// /srv/kubernetes/basic_auth_file
admin123,admin,admin

basic_auth_file中每一行的格式:password,username,useruid

修改APIServer的启动参数,将basic_auth_file传入并重启apiserver:

KUBE_APISERVER_OPTS=" --insecure-bind-address=10.47.136.60 --insecure-port=8080 --etcd-servers=http://127.0.0.1:4001 --logtostderr=true --service-cluster-ip-range=192.168.3.0/24 --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,SecurityContextDeny,ResourceQuota --service-node-port-range=30000-32767 --advertise-address=10.47.136.60 --basic-auth-file=/srv/kubernetes/basic_auth_file --client-ca-file=/srv/kubernetes/ca.crt --tls-cert-file=/srv/kubernetes/server.cert --tls-private-key-file=/srv/kubernetes/server.key"

我们在Pod中使用basic auth访问API Server:

# curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt https://kubernetes:443/version -basic -u admin:admin123
{
  "major": "1",
  "minor": "3",
  "gitVersion": "v1.3.7",
  "gitCommit": "a2cba278cba1f6881bb0a7704d9cac6fca6ed435",
  "gitTreeState": "clean",
  "buildDate": "2016-09-12T23:08:43Z",
  "goVersion": "go1.6.2",
  "compiler": "gc",
  "platform": "linux/amd64"
}

Pod to APIServer authentication成功了。

六、小结

再重申一次:上述配置不是绝对安全的理想配置方案,只是阶段性满足我目前项目需求的一个“有限安全”方案,大家谨慎参考。

到目前为止,我们的“有限安全”也仅仅做到Authentication这一步,至于Authority和Admission Control,目前尚未有相关实践,可能会在后续的文章中做单独说明。

七、参考资料

  • Master <-> Node Communication – http://kubernetes.io/docs/admin/master-node-communication/
  • Authentication – http://kubernetes.io/docs/admin/authentication/
  • Using Authorization Plugins – http://kubernetes.io/docs/admin/authorization/
  • Accessing the API – http://kubernetes.io/docs/admin/accessing-the-api/
  • Managing Service Accounts – http://kubernetes.io/docs/admin/service-accounts-admin/
  • Authenticating Across Clusters with kubeconfig — http://kubernetes.io/docs/user-guide/kubeconfig-file/
  • Service Accounts — https://docs.openshift.com/enterprise/3.1/dev_guide/service_accounts.html
  • 4S: SERVICES ACCOUNT, SECRET, SECURITY CONTEXT AND SECURITY IN KUBERNETES — http://www.sel.zju.edu.cn/?p=588
  • KUBERNETES APISERVER源码分析——API请求的认证过程 – http://www.sel.zju.edu.cn/?p=609
  • Kubernetes安全配置案例 – http://www.cnblogs.com/breg/p/5923604.html

使用Ceph RBD为Kubernetes集群提供存储卷

一旦走上使用Kubernetes的道路,你就会发现这条路并不好走,充满荆棘。即便你使用Kubernetes建立起的集群规模不大,也是需要“五脏俱全”的,否则你根本无法真正将kubernetes用起来,或者说一个半拉子Kubernetes集群很可能无法满足你要支撑的业务需求。在目前我正在从事的一个产品就是这样,光有K8s还不够,考虑到”有状态服务”的需求,我们还需要给Kubernetes配一个后端存储以支持Persistent Volume机制,使得Pod在k8s的不同节点间调度迁移时,具有持久化需求的数据不会被清除,且Pod中Container无论被调度到哪个节点,始终都能挂载到同一个Volume。

Kubernetes支持多种Volume类型,这里选择Ceph RBD(Rados Block Device)。选择Ceph大致有三个原因:

  • Ceph经过多年开发,已经逐渐步入成熟;
  • Ceph在Ubuntu 14.04.x上安装方便(仅通过apt-get即可),并且在未经任何调优(调优需要你对Ceph背后的原理十分熟悉)的情况下,性能可以基本满足我们需求;
  • Ceph同时支持对象存储、块存储和文件系统接口,虽然这里我们可能仅需要块存储。

即便这样,Ceph与K8s的集成过程依旧少不了“趟坑”,接下来我们就详细道来。

一、环境和准备条件

我们依然使用两个阿里云ECS Node,操作系统以及内核版本为:Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-70-generic x86_64)。

Ceph采用当前Ubuntu 14.04源中最新的Ceph LTS版本:JEWEL10.2.3

Kubernetes版本为上次安装时的1.3.7版本。

二、Ceph安装原理

Ceph分布式存储集群由若干组件组成,包括:Ceph Monitor、Ceph OSD和Ceph MDS,其中如果你仅使用对象存储和块存储时,MDS不是必须的(本次我们也不需要安装MDS),仅当你要用到Cephfs时,MDS才是需要安装的。

Ceph的安装模型与k8s有些类似,也是通过一个deploy node远程操作其他Node以create、prepare和activate各个Node上的Ceph组件,官方手册中给出的示意图如下:

img{}

映射到我们实际的环境中,我的安装设计是这样的:

admin-node, deploy-node(ceph-deploy):10.47.136.60  iZ25cn4xxnvZ
mon.node1,(mds.node1): 10.47.136.60  iZ25cn4xxnvZ
osd.0: 10.47.136.60   iZ25cn4xxnvZ
osd.1: 10.46.181.146 iZ25mjza4msZ

实际上就是两个Aliyun ECS节点承担以上多种角色。不过像iZ25cn4xxnvZ这样的host name太反人类,长远考虑还是换成node1、node2这样的简单名字更好。通过编辑各个ECS上的/etc/hostname, /etc/hosts,我们将iZ25cn4xxnvZ换成node1,将iZ25mjza4msZ换成node2:

10.47.136.60 (node1):

# cat /etc/hostname
node1

# cat /etc/hosts
127.0.0.1 localhost
127.0.1.1    localhost.localdomain    localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.47.136.60 admin
10.47.136.60 node1
10.47.136.60 iZ25cn4xxnvZ
10.46.181.146 node2

----------------------------------
10.46.181.146 (node2):

# cat /etc/hostname
node2

# cat /etc/hosts
127.0.0.1 localhost
127.0.1.1    localhost.localdomain    localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.46.181.146  node2
10.46.181.146  iZ25mjza4msZ
10.47.136.60  node1

于是上面的环境设计就变成了:

admin-node, deploy-node(ceph-deploy):node1 10.47.136.60
mon.node1, (mds.node1) : node1  10.47.136.60
osd.0:  node1 10.47.136.60
osd.1:  node2 10.46.181.146

三、Ceph安装步骤

1、安装ceph-deploy

Ceph提供了一键式安装工具ceph-deploy来协助Ceph集群的安装,在deploy node上,我们首先要来安装的就是ceph-deploy,Ubuntu 14.04官方源中的ceph-deploy是1.4.0版本,比较old,我们需要添加Ceph源,安装最新的ceph-deploy:

# wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
OK

# echo deb https://download.ceph.com/debian-jewel/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
deb https://download.ceph.com/debian-jewel/ trusty main

#apt-get update
... ...

# apt-get install ceph-deploy
Reading package lists... Done
Building dependency tree
Reading state information... Done
.... ...
The following NEW packages will be installed:
  ceph-deploy
0 upgraded, 1 newly installed, 0 to remove and 105 not upgraded.
Need to get 96.4 kB of archives.
After this operation, 622 kB of additional disk space will be used.
Get:1 https://download.ceph.com/debian-jewel/ trusty/main ceph-deploy all 1.5.35 [96.4 kB]
Fetched 96.4 kB in 1s (53.2 kB/s)
Selecting previously unselected package ceph-deploy.
(Reading database ... 153022 files and directories currently installed.)
Preparing to unpack .../ceph-deploy_1.5.35_all.deb ...
Unpacking ceph-deploy (1.5.35) ...
Setting up ceph-deploy (1.5.35) ...

注意:ceph-deploy只需要在admin/deploy node上安装即可。

2、前置设置

和安装k8s一样,在ceph-deploy真正执行安装之前,需要确保所有Ceph node都要开启NTP,同时建议在每个node节点上为安装过程创建一个安装账号,即ceph-deploy在ssh登录到每个Node时所用的账号。这个账号有两个约束:

  • 具有sudo权限;
  • 执行sudo命令时,无需输入密码。

我们将这一账号命名为cephd,我们需要在每个ceph node上(包括admin node/deploy node)都建立一个cephd用户,并加入到sudo组中。

以下命令在每个Node上都要执行:

useradd -d /home/cephd -m cephd
passwd cephd

添加sudo权限:
echo "cephd ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/cephd
sudo chmod 0440 /etc/sudoers.d/cephd

在admin node(deploy node)上,登入cephd账号,创建该账号下deploy node到其他各个Node的ssh免密登录设置,密码留空:

在deploy node上执行:

$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/cephd/.ssh/id_rsa):
Created directory '/home/cephd/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/cephd/.ssh/id_rsa.
Your public key has been saved in /home/cephd/.ssh/id_rsa.pub.
The key fingerprint is:
....

将deploy node的公钥copy到其他节点上去:

$ ssh-copy-id cephd@node1
The authenticity of host 'node1 (10.47.136.60)' can't be established.
ECDSA key fingerprint is d2:69:e2:3a:3e:4c:6b:80:15:30:17:8e:df:3b:62:1f.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
cephd@node1's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'cephd@node1'"
and check to make sure that only the key(s) you wanted were added.

同样,执行 ssh-copy-id cephd@node2,完成后,测试一下免密登录。

$ ssh node1
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-70-generic x86_64)

 * Documentation:  https://help.ubuntu.com/
New release '16.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Welcome to aliyun Elastic Compute Service!

最后,在Deploy node上创建并编辑~/.ssh/config,这是Ceph官方doc推荐的步骤,这样做的目的是可以避免每次执行ceph-deploy时都要去指定 –username {username} 参数。

//~/.ssh/config
Host node1
   Hostname node1
   User cephd
Host node2
   Hostname node2
   User cephd
3、安装ceph

这个环节参考的是Ceph官方doc手工部署一节

如果之前安装过ceph,可以先执行如下命令以获得一个干净的环境:

ceph-deploy purgedata node1 node2
ceph-deploy forgetkeys
ceph-deploy purge node1 node2

接下来我们就可以来全新安装Ceph了。在deploy node上,建立cephinstall目录,然后进入cephinstall目录执行相关步骤。

我们首先来创建一个ceph cluster,这个环节需要通过执行ceph-deploy new {initial-monitor-node(s)}命令。按照上面的安装设计,我们的ceph monitor node就是node1,因此我们执行下面命令来创建一个名为ceph的ceph cluster:

$ ceph-deploy new node1
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephd/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy new node1
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  func                          : <function new at 0x7f71d2051938>
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f71d19f5710>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  ssh_copykey                   : True
[ceph_deploy.cli][INFO  ]  mon                           : ['node1']
[ceph_deploy.cli][INFO  ]  public_network                : None
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  cluster_network               : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  fsid                          : None
[ceph_deploy.new][DEBUG ] Creating new cluster named ceph
[ceph_deploy.new][INFO  ] making sure passwordless SSH succeeds
[node1][DEBUG ] connection detected need for sudo
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /sbin/initctl version
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /bin/ip link show
[node1][INFO  ] Running command: sudo /bin/ip addr show
[node1][DEBUG ] IP addresses found: [u'101.201.78.51', u'192.168.16.1', u'10.47.136.60', u'172.16.99.0', u'172.16.99.1']
[ceph_deploy.new][DEBUG ] Resolving host node1
[ceph_deploy.new][DEBUG ] Monitor node1 at 10.47.136.60
[ceph_deploy.new][DEBUG ] Monitor initial members are ['node1']
[ceph_deploy.new][DEBUG ] Monitor addrs are ['10.47.136.60']
[ceph_deploy.new][DEBUG ] Creating a random mon key...
[ceph_deploy.new][DEBUG ] Writing monitor keyring to ceph.mon.keyring...
[ceph_deploy.new][DEBUG ] Writing initial config to ceph.conf...

new命令执行完后,ceph-deploy会在当前目录下创建一些辅助文件:

# ls
ceph.conf  ceph-deploy-ceph.log  ceph.mon.keyring

$ cat ceph.conf
[global]
fsid = f5166c78-e3b6-4fef-b9e7-1ecf7382fd93
mon_initial_members = node1
mon_host = 10.47.136.60
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

由于我们仅有两个OSD节点,因此我们在进一步安装之前,需要先对ceph.conf文件做一些配置调整:
修改配置以进行后续安装:

在[global]标签下,添加下面一行:
osd pool default size = 2

ceph.conf保存退出。接下来,我们执行下面命令在node1和node2上安装ceph运行所需的各个binary包:

# ceph-deploy install nod1 node2
.... ...
[node2][INFO  ] Running command: sudo ceph --version
[node2][DEBUG ] ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

这一过程ceph-deploy会SSH登录到各个node上去,执行apt-get update, 并install ceph的各种组件包,这个环节耗时可能会长一些(依网络情况不同而不同),请耐心等待。

4、初始化ceph monitor node

有了ceph启动的各个程序后,我们首先来初始化ceph cluster的monitor node。在deploy node的工作目录cephinstall下,执行:

# ceph-deploy mon create-initial

[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy mon create-initial
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : create-initial
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f0f7ea2fe60>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  func                          : <function mon at 0x7f0f7ee93de8>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  keyrings                      : None
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts node1
[ceph_deploy.mon][DEBUG ] detecting platform for host node1...
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
....

[iZ25cn4xxnvZ][INFO  ] Running command: ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.iZ25cn4xxnvZ.asok mon_status
[ceph_deploy.mon][INFO  ] mon.iZ25cn4xxnvZ monitor has reached quorum!
[ceph_deploy.mon][INFO  ] all initial monitors are running and have formed quorum
[ceph_deploy.mon][INFO  ] Running gatherkeys...
[ceph_deploy.gatherkeys][INFO  ] Storing keys in temp directory /tmp/tmpP_SmXX
[iZ25cn4xxnvZ][DEBUG ] connected to host: iZ25cn4xxnvZ
[iZ25cn4xxnvZ][DEBUG ] detect platform information from remote host
[iZ25cn4xxnvZ][DEBUG ] detect machine type
[iZ25cn4xxnvZ][DEBUG ] find the location of an executable
[iZ25cn4xxnvZ][INFO  ] Running command: /sbin/initctl version
[iZ25cn4xxnvZ][DEBUG ] get remote short hostname
[iZ25cn4xxnvZ][DEBUG ] fetch remote file
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --admin-daemon=/var/run/ceph/ceph-mon.iZ25cn4xxnvZ.asok mon_status
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-iZ25cn4xxnvZ/keyring auth get-or-create client.admin osd allow * mds allow * mon allow *
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-iZ25cn4xxnvZ/keyring auth get-or-create client.bootstrap-mds mon allow profile bootstrap-mds
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-iZ25cn4xxnvZ/keyring auth get-or-create client.bootstrap-osd mon allow profile bootstrap-osd
[iZ25cn4xxnvZ][INFO  ] Running command: /usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-iZ25cn4xxnvZ/keyring auth get-or-create client.bootstrap-rgw mon allow profile bootstrap-rgw
... ...
[ceph_deploy.gatherkeys][INFO  ] Storing ceph.client.admin.keyring
[ceph_deploy.gatherkeys][INFO  ] Storing ceph.bootstrap-mds.keyring
[ceph_deploy.gatherkeys][INFO  ] keyring 'ceph.mon.keyring' already exists
[ceph_deploy.gatherkeys][INFO  ] Storing ceph.bootstrap-osd.keyring
[ceph_deploy.gatherkeys][INFO  ] Storing ceph.bootstrap-rgw.keyring
[ceph_deploy.gatherkeys][INFO  ] Destroy temp directory /tmp/tmpP_SmXX

这一过程很顺利。命令执行完成后我们能看到一些变化:

在当前目录下,出现了若干*.keyring,这是Ceph组件间进行安全访问时所需要的:

# ls -l
total 216
-rw------- 1 root root     71 Nov  3 17:24 ceph.bootstrap-mds.keyring
-rw------- 1 root root     71 Nov  3 17:25 ceph.bootstrap-osd.keyring
-rw------- 1 root root     71 Nov  3 17:25 ceph.bootstrap-rgw.keyring
-rw------- 1 root root     63 Nov  3 17:24 ceph.client.admin.keyring
-rw-r--r-- 1 root root    242 Nov  3 16:40 ceph.conf
-rw-r--r-- 1 root root 192336 Nov  3 17:25 ceph-deploy-ceph.log
-rw------- 1 root root     73 Nov  3 16:28 ceph.mon.keyring
-rw-r--r-- 1 root root   1645 Oct 16  2015 release.asc

在node1(monitor node)上,我们看到ceph-mon已经运行起来了:

cephd@node1:~/cephinstall$ ps -ef|grep ceph
ceph     32326     1  0 14:19 ?        00:00:00 /usr/bin/ceph-mon --cluster=ceph -i node1 -f --setuser ceph --setgroup ceph

如果要手工停止ceph-mon,可以使用stop ceph-mon-all 命令。

5、prepare ceph OSD node

至此,ceph-mon组件程序已经成功启动了,剩下的只有OSD这一关了。启动OSD node分为两步:prepare 和 activate。OSD node是真正存储数据的节点,我们需要为ceph-osd提供独立存储空间,一般是一个独立的disk。但我们环境不具备这个条件,于是在本地盘上创建了个目录,提供给OSD。

在deploy node上执行:

ssh node1
sudo mkdir /var/local/osd0
exit

ssh node2
sudo mkdir /var/local/osd1
exit

接下来,我们就可以执行prepare操作了,prepare操作会在上述的两个osd0和osd1目录下创建一些后续activate激活以及osd运行时所需要的文件:

cephd@node1:~/cephinstall$ ceph-deploy osd prepare node1:/var/local/osd0 node2:/var/local/osd1
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephd/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy osd prepare node1:/var/local/osd0 node2:/var/local/osd1
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  disk                          : [('node1', '/var/local/osd0', None), ('node2', '/var/local/osd1', None)]
[ceph_deploy.cli][INFO  ]  dmcrypt                       : False
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  bluestore                     : None
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : prepare
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir               : /etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f072603e8c0>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  fs_type                       : xfs
[ceph_deploy.cli][INFO  ]  func                          : <function osd at 0x7f0726492d70>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  zap_disk                      : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks node1:/var/local/osd0: node2:/var/local/osd1:
[node1][DEBUG ] connection detected need for sudo
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /sbin/initctl version
[node1][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] Deploying osd to node1
[node1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.osd][DEBUG ] Preparing host node1 disk /var/local/osd0 journal None activate False
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /usr/sbin/ceph-disk -v prepare --cluster ceph --fs-type xfs -- /var/local/osd0
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --check-allows-journal -i 0 --cluster ceph
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --check-wants-journal -i 0 --cluster ceph
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --check-needs-journal -i 0 --cluster ceph
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size
[node1][WARNIN] populate_data_path: Preparing osd data dir /var/local/osd0
[node1][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd0/ceph_fsid.782.tmp
[node1][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd0/fsid.782.tmp
[node1][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd0/magic.782.tmp
[node1][INFO  ] checking OSD status...
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /usr/bin/ceph --cluster=ceph osd stat --format=json
[
ceph_deploy.osd][DEBUG ] Host node1 is now ready for osd use.
[node2][DEBUG ] connection detected need for sudo
[node2][DEBUG ] connected to host: node2
... ...
[node2][INFO  ] Running command: sudo /usr/bin/ceph --cluster=ceph osd stat --format=json
[ceph_deploy.osd][DEBUG ] Host node2 is now ready for osd use.

prepare并不会启动ceph osd,那是activate的职责。

6、激活ceph OSD node

接下来,我们来激活各个OSD node:

$ ceph-deploy osd activate node1:/var/local/osd0 node2:/var/local/osd1

... ...
[node1][WARNIN] got monmap epoch 1
[node1][WARNIN] command: Running command: /usr/bin/timeout 300 ceph-osd --cluster ceph --mkfs --mkkey -i 0 --monmap /var/local/osd0/activate.monmap --osd-data /var/local/osd0 --osd-journal /var/local/osd0/journal --osd-uuid 6def4f7f-4f37-43a5-8699-5c6ab608c89c --keyring /var/local/osd0/keyring --setuser ceph --setgroup ceph
[node1][WARNIN] Traceback (most recent call last):
[node1][WARNIN]   File "/usr/sbin/ceph-disk", line 9, in <module>
[node1][WARNIN]     load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5011, in run
[node1][WARNIN]     main(sys.argv[1:])
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4962, in main
[node1][WARNIN]     args.func(args)
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3324, in main_activate
[node1][WARNIN]     init=args.mark_init,
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3144, in activate_dir
[node1][WARNIN]     (osd_id, cluster) = activate(path, activate_key_template, init)
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3249, in activate
[node1][WARNIN]     keyring=keyring,
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2742, in mkfs
[node1][WARNIN]     '--setgroup', get_ceph_group(),
[node1][WARNIN]   File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2689, in ceph_osd_mkfs
[node1][WARNIN]     raise Error('%s failed : %s' % (str(arguments), error))
[node1][WARNIN] ceph_disk.main.Error: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', '0', '--monmap', '/var/local/osd0/activate.monmap', '--osd-data', '/var/local/osd0', '--osd-journal', '/var/local/osd0/journal', '--osd-uuid', '6def4f7f-4f37-43a5-8699-5c6ab608c89c', '--keyring', '/var/local/osd0/keyring', '--setuser', 'ceph', '--setgroup', 'ceph'] failed : 2016-11-04 14:25:40.325009 7fd1aa73f800 -1 filestore(/var/local/osd0) mkfs: write_version_stamp() failed: (13) Permission denied
[node1][WARNIN] 2016-11-04 14:25:40.325032 7fd1aa73f800 -1 OSD::mkfs: ObjectStore::mkfs failed with error -13
[node1][WARNIN] 2016-11-04 14:25:40.325075 7fd1aa73f800 -1  ** ERROR: error creating empty object store in /var/local/osd0: (13) Permission denied
[node1][WARNIN]
[node1][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init upstart --mount /var/local/osd0

激活没能成功,在激活第一个节点时,就输出了如上错误日志。日志的error含义很明显:权限问题。

ceph-deploy尝试在osd node1上以ceph:ceph启动ceph-osd,但/var/local/osd0目录的权限情况如下:

$ ls -l /var/local
drwxr-sr-x 2 root staff 4096 Nov  4 14:25 osd0

osd0被root拥有,以ceph用户启动的ceph-osd程序自然没有权限在/var/local/osd0目录下创建文件并写入数据了。这个问题在ceph官方issue中有很多人提出来,也给出了临时修正方法:

将osd0和osd1的权限赋予ceph:ceph:

node1:
sudo chown -R ceph:ceph /var/local/osd0

node2:
sudo chown -R ceph:ceph /var/local/osd1

修改完权限后,我们再来执行activate:

$ ceph-deploy osd activate node1:/var/local/osd0 node2:/var/local/osd1
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephd/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy osd activate node1:/var/local/osd0 node2:/var/local/osd1
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : activate
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f3c90c678c0>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  func                          : <function osd at 0x7f3c910bbd70>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  disk                          : [('node1', '/var/local/osd0', None), ('node2', '/var/local/osd1', None)]
[ceph_deploy.osd][DEBUG ] Activating cluster ceph disks node1:/var/local/osd0: node2:/var/local/osd1:
[node1][DEBUG ] connection detected need for sudo
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /sbin/initctl version
[node1][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] activating host node1 disk /var/local/osd0
[ceph_deploy.osd][DEBUG ] will use init type: upstart
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /usr/sbin/ceph-disk -v activate --mark-init upstart --mount /var/local/osd0
[node1][WARNIN] main_activate: path = /var/local/osd0
[node1][WARNIN] activate: Cluster uuid is f5166c78-e3b6-4fef-b9e7-1ecf7382fd93
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
[node1][WARNIN] activate: Cluster name is ceph
[node1][WARNIN] activate: OSD uuid is 6def4f7f-4f37-43a5-8699-5c6ab608c89c
[node1][WARNIN] activate: OSD id is 0
[node1][WARNIN] activate: Initializing OSD...
[node1][WARNIN] command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/local/osd0/activate.monmap
[node1][WARNIN] got monmap epoch 1
[node1][WARNIN] command: Running command: /usr/bin/timeout 300 ceph-osd --cluster ceph --mkfs --mkkey -i 0 --monmap /var/local/osd0/activate.monmap --osd-data /var/local/osd0 --osd-journal /var/local/osd0/journal --osd-uuid 6def4f7f-4f37-43a5-8699-5c6ab608c89c --keyring /var/local/osd0/keyring --setuser ceph --setgroup ceph
[node1][WARNIN] activate: Marking with init system upstart
[node1][WARNIN] activate: Authorizing OSD key...
[node1][WARNIN] command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth add osd.0 -i /var/local/osd0/keyring osd allow * mon allow profile osd
[node1][WARNIN] added key for osd.0
[node1][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd0/active.4616.tmp
[node1][WARNIN] activate: ceph osd.0 data dir is ready at /var/local/osd0
[node1][WARNIN] activate_dir: Creating symlink /var/lib/ceph/osd/ceph-0 -> /var/local/osd0
[node1][WARNIN] start_daemon: Starting ceph osd.0...
[node1][WARNIN] command_check_call: Running command: /sbin/initctl emit --no-wait -- ceph-osd cluster=ceph id=0
[node1][INFO  ] checking OSD status...
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /usr/bin/ceph --cluster=ceph osd stat --format=json
[node1][WARNIN] there is 1 OSD down
[node1][WARNIN] there is 1 OSD out

[node2][DEBUG ] connection detected need for sudo
[node2][DEBUG ] connected to host: node2
[node2][DEBUG ] detect platform information from remote host
[node2][DEBUG ] detect machine type
[node2][DEBUG ] find the location of an executable
[node2][INFO  ] Running command: sudo /sbin/initctl version
[node2][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] activating host node2 disk /var/local/osd1
[ceph_deploy.osd][DEBUG ] will use init type: upstart
[node2][DEBUG ] find the location of an executable
[node2][INFO  ] Running command: sudo /usr/sbin/ceph-disk -v activate --mark-init upstart --mount /var/local/osd1
[node2][WARNIN] main_activate: path = /var/local/osd1
[node2][WARNIN] activate: Cluster uuid is f5166c78-e3b6-4fef-b9e7-1ecf7382fd93
[node2][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
[node2][WARNIN] activate: Cluster name is ceph
[node2][WARNIN] activate: OSD uuid is 4733f683-0376-4708-86a6-818af987ade2
[node2][WARNIN] allocate_osd_id: Allocating OSD id...
[node2][WARNIN] command: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd create --concise 4733f683-0376-4708-86a6-818af987ade2
[node2][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd1/whoami.27470.tmp
[node2][WARNIN] activate: OSD id is 1
[node2][WARNIN] activate: Initializing OSD...
[node2][WARNIN] command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/local/osd1/activate.monmap
[node2][WARNIN] got monmap epoch 1
[node2][WARNIN] command: Running command: /usr/bin/timeout 300 ceph-osd --cluster ceph --mkfs --mkkey -i 1 --monmap /var/local/osd1/activate.monmap --osd-data /var/local/osd1 --osd-journal /var/local/osd1/journal --osd-uuid 4733f683-0376-4708-86a6-818af987ade2 --keyring /var/local/osd1/keyring --setuser ceph --setgroup ceph
[node2][WARNIN] activate: Marking with init system upstart
[node2][WARNIN] activate: Authorizing OSD key...
[node2][WARNIN] command_check_call: Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring auth add osd.1 -i /var/local/osd1/keyring osd allow * mon allow profile osd
[node2][WARNIN] added key for osd.1
[node2][WARNIN] command: Running command: /bin/chown -R ceph:ceph /var/local/osd1/active.27470.tmp
[node2][WARNIN] activate: ceph osd.1 data dir is ready at /var/local/osd1
[node2][WARNIN] activate_dir: Creating symlink /var/lib/ceph/osd/ceph-1 -> /var/local/osd1
[node2][WARNIN] start_daemon: Starting ceph osd.1...
[node2][WARNIN] command_check_call: Running command: /sbin/initctl emit --no-wait -- ceph-osd cluster=ceph id=1
[node2][INFO  ] checking OSD status...
[node2][DEBUG ] find the location of an executable
[node2][INFO  ] Running command: sudo /usr/bin/ceph --cluster=ceph osd stat --format=json

没有错误报出!但OSD真的运行起来了吗?我们还需要再确认一下。

我们先通过ceph admin命令将各个.keyring同步到各个Node上,以便可以在各个Node上使用ceph命令连接到monitor:

注意:执行ceph admin前,需要在deploy-node的/etc/hosts中添加:

10.47.136.60 admin

执行ceph admin:

$ ceph-deploy admin admin node1 node2
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephd/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy admin admin node1 node2
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f072ee3b758>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  client                        : ['admin', 'node1', 'node2']
[ceph_deploy.cli][INFO  ]  func                          : <function admin at 0x7f072f6cf5f0>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to admin
[admin][DEBUG ] connection detected need for sudo
[admin][DEBUG ] connected to host: admin
[admin][DEBUG ] detect platform information from remote host
[admin][DEBUG ] detect machine type
[admin][DEBUG ] find the location of an executable
[admin][INFO  ] Running command: sudo /sbin/initctl version
[admin][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to node1
[node1][DEBUG ] connection detected need for sudo
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: sudo /sbin/initctl version
[node1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to node2
[node2][DEBUG ] connection detected need for sudo
[node2][DEBUG ] connected to host: node2
[node2][DEBUG ] detect platform information from remote host
[node2][DEBUG ] detect machine type
[node2][DEBUG ] find the location of an executable
[node2][INFO  ] Running command: sudo /sbin/initctl version
[node2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf

$sudo chmod +r /etc/ceph/ceph.client.admin.keyring

接下来,查看一下ceph集群中的OSD节点状态:

$ ceph osd tree
ID WEIGHT  TYPE NAME             UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.07660 root default
-2 0.03830     host node1
 0 0.03830         osd.0            down        0          1.00000
-3 0.03830     host iZ25mjza4msZ
 1 0.03830         osd.1            down        0          1.00000

果不其然,两个osd节点均处于down状态,一个也没有启动起来。问题在哪?

我们来查看一下node1上的日志:/var/log/ceph/ceph-osd.0.log:

016-11-04 15:33:17.088971 7f568d6db800  0 pidfile_write: ignore empty --pid-file
2016-11-04 15:33:17.102052 7f568d6db800  0 filestore(/var/lib/ceph/osd/ceph-0) backend generic (magic 0xef53)
2016-11-04 15:33:17.102071 7f568d6db800 -1 filestore(/var/lib/ceph/osd/ceph-0) WARNING: max attr value size (1024) is smaller than osd_max_object_name_len (2048).  Your backend filesystem appears to not support attrs large enough to handle the configured max rados name size.  You may get unexpected ENAMETOOLONG errors on rados operations or buggy behavior
2016-11-04 15:33:17.102410 7f568d6db800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2016-11-04 15:33:17.102425 7f568d6db800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2016-11-04 15:33:17.102445 7f568d6db800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice is supported
2016-11-04 15:33:17.119261 7f568d6db800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2016-11-04 15:33:17.127630 7f568d6db800  0 filestore(/var/lib/ceph/osd/ceph-0) limited size xattrs
2016-11-04 15:33:17.128125 7f568d6db800  1 leveldb: Recovering log #38
2016-11-04 15:33:17.136595 7f568d6db800  1 leveldb: Delete type=3 #37
2016-11-04 15:33:17.136656 7f568d6db800  1 leveldb: Delete type=0 #38
2016-11-04 15:33:17.136845 7f568d6db800  0 filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2016-11-04 15:33:17.137064 7f568d6db800 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2016-11-04 15:33:17.137068 7f568d6db800  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 18: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0
2016-11-04 15:33:17.137897 7f568d6db800  1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 18: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0
2016-11-04 15:33:17.138243 7f568d6db800  1 filestore(/var/lib/ceph/osd/ceph-0) upgrade
2016-11-04 15:33:17.138453 7f568d6db800 -1 osd.0 0 backend (filestore) is unable to support max object name[space] len
2016-11-04 15:33:17.138481 7f568d6db800 -1 osd.0 0    osd max object name len = 2048
2016-11-04 15:33:17.138485 7f568d6db800 -1 osd.0 0    osd max object namespace len = 256
2016-11-04 15:33:17.138488 7f568d6db800 -1 osd.0 0 (36) File name too long
2016-11-04 15:33:17.138895 7f568d6db800  1 journal close /var/lib/ceph/osd/ceph-0/journal
2016-11-04 15:33:17.140041 7f568d6db800 -1  ** ERROR: osd init failed: (36) File name too long

的确发现了错误日志:

2016-11-04 15:33:17.138481 7f568d6db800 -1 osd.0 0    osd max object name len = 2048
2016-11-04 15:33:17.138485 7f568d6db800 -1 osd.0 0    osd max object namespace len = 256
2016-11-04 15:33:17.138488 7f568d6db800 -1 osd.0 0 (36) File name too long
2016-11-04 15:33:17.138895 7f568d6db800  1 journal close /var/lib/ceph/osd/ceph-0/journal
2016-11-04 15:33:17.140041 7f568d6db800 -1  ** ERROR: osd init failed: (36) File name too long

进一步搜索ceph官方文档,发现在文件系统推荐这个doc中有提到,官方不建议采用ext4文件系统作为ceph的后端文件系统,如果采用,那么对于ext4的filesystem,应该在ceph.conf中添加如下配置:

osd max object name len = 256
osd max object namespace len = 64

由于配置已经分发到个个node上,我们需要到各个Node上同步修改:/etc/ceph/ceph.conf,添加上面两行。然后重新activate osd node,这里不赘述。重新激活后,我们来查看ceph osd状态:

$ ceph osd tree
ID WEIGHT  TYPE NAME             UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.07660 root default
-2 0.03830     host node1
 0 0.03830         osd.0              up  1.00000          1.00000
-3 0.03830     host iZ25mjza4msZ
 1 0.03830         osd.1              up  1.00000          1.00000

 $ceph -s
    cluster f5166c78-e3b6-4fef-b9e7-1ecf7382fd93
     health HEALTH_OK
     monmap e1: 1 mons at {node1=10.47.136.60:6789/0}
            election epoch 3, quorum 0 node1
     osdmap e11: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v29: 64 pgs, 1 pools, 0 bytes data, 0 objects
            37834 MB used, 38412 MB / 80374 MB avail
                  64 active+clean

$ !ps
ps -ef|grep ceph
ceph       17139       1  0 16:20 ?        00:00:00 /usr/bin/ceph-osd --cluster=ceph -i 0 -f --setuser ceph --setgroup ceph

可以看到ceph osd节点上的ceph-osd启动正常,cluster 状态为active+clean,至此,Ceph Cluster集群安装ok(我们暂不需要Ceph MDS组件)。

四、创建一个使用Ceph RBD作为后端Volume的Pod

在这一节中,我们就要将Ceph RBD与Kubenetes做集成了。Kubernetes的官方源码的examples/volumes/rbd目录下,就有一个使用cephrbd作为kubernetes pod volume的例子,我们试着将其跑起来。

例子提供了两个pod描述文件:rbd.json和rbd-with-secret.json。由于我们在ceph install时在ceph.conf中使用默认的安全验证协议cephx – The Ceph authentication protocol了:

auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

因此我们将采用rbd-with-secret.json这个pod描述文件来创建例子中的Pod,限于篇幅,这里仅节选json文件中的volumes部分:

//例子中的rbd-with-secret.json

{
    ... ...
        "volumes": [
            {
                "name": "rbdpd",
                "rbd": {
                    "monitors": [
                           "10.16.154.78:6789",
                           "10.16.154.82:6789",
                           "10.16.154.83:6789"
                                 ],
                    "pool": "kube",
                    "image": "foo",
                    "user": "admin",
                    "secretRef": {
                           "name": "ceph-secret"
                                         },
                    "fsType": "ext4",
                    "readOnly": true
                }
            }
        ]
    }
}

volumes部分是和ceph rbd紧密相关的一些信息,各个字段的大致含义如下:

name:volume名字,这个没什么可说的,顾名思义即可。
rbd.monitors:前面提到过ceph集群的monitor组件,这里填写monitor组件的通信信息,集群里有几个monitor就填几个;
rbd.pool:Ceph中的pool记号,它用来给ceph中存储的对象进行逻辑分区用的。默认的pool是”rbd”;
rbd.image:Ceph磁盘块设备映像文件
rbd.user:ceph client访问ceph storage cluster所使用的用户名。ceph有自己的一套user管理系统,user的写法通常是TYPE.ID,比如client.admin(是不是想到对应的文件:ceph.client.admin.keyring)。client是一种type,而admin则是user。一般来说,Type基本都是client。
secret.Ref:引用的k8s secret对象名称。

上面的字段中,有两个字段值我们无法提供:rbd.image和secret.Ref,现在我们就来“填空”。我们在root用户下建立k8s-cephrbd工作目录,我们首先需要使用ceph提供的rbd工具创建Pod要用到image:

# rbd create foo -s 1024

# rbd list
foo

我们在rbd pool中(在上述命令中未指定pool name,默认image建立在rbd pool中)创建一个大小为1024Mi的ceph image foo,rbd list命令的输出告诉我们foo image创建成功。接下来,我们尝试将foo image映射到内核,并格式化该image:

root@node1:~# rbd map foo
rbd: sysfs write failed
RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable".
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (6) No such device or address

map操作报错。不过从错误提示信息,我们能找到一些蛛丝马迹:“RBD image feature set mismatch”。ceph新版中在map image时,给image默认加上了许多feature,通过rbd info可以查看到:

# rbd info foo
rbd image 'foo':
    size 1024 MB in 256 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.10612ae8944a
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    flags:

可以看到foo image拥有: layering, exclusive-lock, object-map, fast-diff, deep-flatten。不过遗憾的是我的Ubuntu 14.04的3.19内核仅支持其中的layering feature,其他feature概不支持。我们需要手动disable这些features:

# rbd feature disable foo exclusive-lock, object-map, fast-diff, deep-flatten
root@node1:/var/log/ceph# rbd info foo
rbd image 'foo':
    size 1024 MB in 256 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.10612ae8944a
    format: 2
    features: layering
    flags:

不过每次这么来disable可是十分麻烦的,一劳永逸的方法是在各个cluster node的/etc/ceph/ceph.conf中加上这样一行配置:

rbd_default_features = 1 #仅是layering对应的bit码所对应的整数值

设置完后,通过下面命令查看配置变化:

# ceph --show-config|grep rbd|grep features
rbd_default_features = 1

关于image features的这个问题,zphj1987的这篇文章中有较为详细的讲解。

我们再来map一下foo这个image:

# rbd map foo
/dev/rbd0

# ls -l /dev/rbd0
brw-rw---- 1 root disk 251, 0 Nov  5 10:33 /dev/rbd0

map后,我们就可以像格式化一个空image那样对其进行格式化了,这里格成ext4文件系统(格式化这一步大可不必,在后续小节中你会看到):

# mkfs.ext4 /dev/rbd0
mke2fs 1.42.9 (4-Feb-2014)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1024 blocks, Stripe width=1024 blocks
65536 inodes, 262144 blocks
13107 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=268435456
8 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376

Allocating group tables: done
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

接下来我们来创建ceph-secret这个k8s secret对象,这个secret对象用于k8s volume插件访问ceph集群:

获取client.admin的keyring值,并用base64编码:

# ceph auth get-key client.admin
AQBiKBxYuPXiJRAAsupnTBsURoWzb0k00oM3iQ==

# echo "AQBiKBxYuPXiJRAAsupnTBsURoWzb0k00oM3iQ=="|base64
QVFCaUtCeFl1UFhpSlJBQXN1cG5UQnNVUm9XemIwazAwb00zaVE9PQo=

在k8s-cephrbd下建立ceph-secret.yaml文件,data下的key字段值即为上面得到的编码值:

//ceph-secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
data:
  key: QVFCaUtCeFl1UFhpSlJBQXN1cG5UQnNVUm9XemIwazAwb00zaVE9PQo=

创建ceph-secret:

# kubectl create -f ceph-secret.yaml
secret "ceph-secret" created

# kubectl get secret
NAME                  TYPE                                  DATA      AGE
ceph-secret           Opaque                                1         16s

至此,我们的rbd-with-secret.json全貌如下:

{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "name": "rbd2"
    },
    "spec": {
        "containers": [
            {
                "name": "rbd-rw",
                "image": "kubernetes/pause",
                "volumeMounts": [
                    {
                        "mountPath": "/mnt/rbd",
                        "name": "rbdpd"
                    }
                ]
            }
        ],
        "volumes": [
            {
                "name": "rbdpd",
                "rbd": {
                    "monitors": [
                        "10.47.136.60:6789"
                                 ],
                    "pool": "rbd",
                    "image": "foo",
                    "user": "admin",
                    "secretRef": {
                        "name": "ceph-secret"
                        },
                    "fsType": "ext4",
                    "readOnly": true
                }
            }
        ]
    }
}

基于该Pod描述文件,创建使用cephrbd作为后端存储的pod:

# kubectl create -f rbd-with-secret.json
pod "rbd2" created

# kubectl get pod
NAME                        READY     STATUS    RESTARTS   AGE
rbd2                        1/1       Running   0          16s

# rbd showmapped
id pool image snap device
0  rbd  foo   -    /dev/rbd0

# mount
... ...
/dev/rbd0 on /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-foo type ext4 (rw)

在我的环境中,pod实际被调度到了另外一个k8s node上运行了:

pod被调度到另外一个 node2 上:

# docker ps
CONTAINER ID        IMAGE                                          COMMAND                  CREATED             STATUS              PORTS                    NAMES
32f92243f911        kubernetes/pause                               "/pause"                 2 minutes ago       Up 2 minutes                                 k8s_rbd-rw.c1dc309e_rbd2_default_6b6541b9-a306-11e6-ba01-00163e1625a9_a6bb1b20

#docker inspect 32f92243f911
... ...
"Mounts": [
            {
                "Source": "/var/lib/kubelet/pods/6b6541b9-a306-11e6-ba01-00163e1625a9/volumes/kubernetes.io~secret/default-token-40z0x",
                "Destination": "/var/run/secrets/kubernetes.io/serviceaccount",
                "Mode": "ro",
                "RW": false,
                "Propagation": "rprivate"
            },
            {
                "Source": "/var/lib/kubelet/pods/6b6541b9-a306-11e6-ba01-00163e1625a9/etc-hosts",
                "Destination": "/etc/hosts",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Source": "/var/lib/kubelet/pods/6b6541b9-a306-11e6-ba01-00163e1625a9/containers/rbd-rw/a6bb1b20",
                "Destination": "/dev/termination-log",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Source": "/var/lib/kubelet/pods/6b6541b9-a306-11e6-ba01-00163e1625a9/volumes/kubernetes.io~rbd/rbdpd",
                "Destination": "/mnt/rbd",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
... ...

五、Kubernetes Persistent Volume和Persistent Volume Claim

上面一小节讲解了Kubernetes volume与Ceph RBD的结合,但是k8s volume还不能完全满足实际生产过程对持久化存储的需求,因为k8s volume的lifetime和pod的生命周期相同,一旦pod被delete,那么volume中的数据就不复存在了。于是k8s又推出了Persistent Volume(PV)和Persistent Volume Claim(PVC)组合,故名思意:即便挂载其的pod被delete了,PV依旧存在,PV上的数据依旧存在。

由于有了之前的“铺垫”,这里仅仅给出使用PV和PVC的步骤:

1、创建disk image

$ rbd create ceph-image -s 128 #考虑后续format快捷,这里只用了128M,仅适用于Demo哦。

# rbd create ceph-image -s 128
# rbd info rbd/ceph-image
rbd image 'ceph-image':
    size 128 MB in 32 objects
    order 22 (4096 kB objects)
    block_name_prefix: rbd_data.37202ae8944a
    format: 2
    features: layering
    flags:

如果这里不先创建一个ceph-image,后续Pod启动时,会出现如下的一些错误,比如pod始终处于ContainerCreating状态:

# kubectl get pod
NAME                        READY     STATUS              RESTARTS   AGE
ceph-pod1                   0/1       ContainerCreating   0          13s

如果出现这种错误情况,可以查看/var/log/upstart/kubelet.log,你也许能看到如下错误信息:

I1107 06:02:27.500247   22037 operation_executor.go:768] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/01d049c6-9430-11e6-ba01-00163e1625a9-default-token-40z0x" (spec.Name: "default-token-40z0x") pod "01d049c6-9430-11e6-ba01-00163e1625a9" (UID: "01d049c6-9430-11e6-ba01-00163e1625a9").
I1107 06:03:08.499628   22037 reconciler.go:294] MountVolume operation started for volume "kubernetes.io/rbd/ea848a49-a46b-11e6-ba01-00163e1625a9-ceph-pv" (spec.Name: "ceph-pv") to pod "ea848a49-a46b-11e6-ba01-00163e1625a9" (UID: "ea848a49-a46b-11e6-ba01-00163e1625a9").
E1107 06:03:09.532348   22037 disk_manager.go:56] failed to attach disk
E1107 06:03:09.532402   22037 rbd.go:228] rbd: failed to setup mount /var/lib/kubelet/pods/ea848a49-a46b-11e6-ba01-00163e1625a9/volumes/kubernetes.io~rbd/ceph-pv rbd: map failed exit status 2 rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (2) No such file or directory
2、创建PV

我们直接复用之前创建的ceph-secret对象,PV的描述文件ceph-pv.yaml如下:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ceph-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  rbd:
    monitors:
      - 10.47.136.60:6789
    pool: rbd
    image: ceph-image
    user: admin
    secretRef:
      name: ceph-secret
    fsType: ext4
    readOnly: false
  persistentVolumeReclaimPolicy: Recycle

执行创建操作:

# kubectl create -f ceph-pv.yaml
persistentvolume "ceph-pv" created

# kubectl get pv
NAME      CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS      CLAIM     REASON    AGE
ceph-pv   1Gi        RWO           Recycle         Available                       7s
3、创建PVC

pvc是Pod对Pv的请求,将请求做成一种资源,便于管理以及pod复用。我们用到的pvc描述文件ceph-pvc.yaml如下:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

执行创建操作:

# kubectl create -f ceph-pvc.yaml
persistentvolumeclaim "ceph-claim" created

# kubectl get pvc
NAME         STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
ceph-claim   Bound     ceph-pv   1Gi        RWO           12s

4、创建挂载ceph RBD的pod

pod描述文件ceph-pod1.yaml如下:

apiVersion: v1
kind: Pod
metadata:
  name: ceph-pod1
spec:
  containers:
  - name: ceph-busybox1
    image: busybox
    command: ["sleep", "600000"]
    volumeMounts:
    - name: ceph-vol1
      mountPath: /usr/share/busybox
      readOnly: false
  volumes:
  - name: ceph-vol1
    persistentVolumeClaim:
      claimName: ceph-claim

创建pod操作:

# kubectl create -f ceph-pod1.yaml
pod "ceph-pod1" created

# kubectl get pod
NAME                        READY     STATUS              RESTARTS   AGE
ceph-pod1                   0/1       ContainerCreating   0          13s

Pod还处于ContainerCreating状态。pod的创建,尤其是挂载pv的Pod的创建需要一小段时间,耐心等待一下,我们可以查看一下/var/log/upstart/kubelet.log:

I1107 11:44:38.768541   22037 mount_linux.go:272] `fsck` error fsck from util-linux 2.20.1

fsck.ext2: Bad magic number in super-block while trying to open /dev/rbd1
/dev/rbd1:
The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

E1107 11:44:38.774080   22037 mount_linux.go:110] Mount failed: exit status 32
Mounting arguments: /dev/rbd1 /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-ceph-image ext4 [defaults]
Output: mount: wrong fs type, bad option, bad superblock on /dev/rbd1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

I1107 11:44:38.839148   22037 mount_linux.go:292] Disk "/dev/rbd1" appears to be unformatted, attempting to format as type: "ext4" with options: [-E lazy_itable_init=0,lazy_journal_init=0 -F /dev/rbd1]
I1107 11:44:39.152689   22037 mount_linux.go:297] Disk successfully formatted (mkfs): ext4 - /dev/rbd1 /var/lib/kubelet/plugins/kubernetes.io/rbd/rbd/rbd-image-ceph-image
I1107 11:44:39.220223   22037 operation_executor.go:768] MountVolume.SetUp succeeded for volume "kubernetes.io/rbd/811a57ee-a49c-11e6-ba01-00163e1625a9-ceph-pv" (spec.Name: "ceph-pv") pod "811a57ee-a49c-11e6-ba01-00163e1625a9" (UID: "811a57ee-a49c-11e6-ba01-00163e1625a9").

可以看到,k8s通过fsck发现这个image是一个空image,没有fs在里面,于是默认采用ext4为其格式化,成功后,再行挂载。等待一会后,我们看到ceph-pod1成功run起来了:

# kubectl get pod
NAME                        READY     STATUS    RESTARTS   AGE
ceph-pod1                   1/1       Running   0          4m

# docker ps
CONTAINER ID        IMAGE                                                 COMMAND                  CREATED             STATUS              PORTS               NAMES
f50bb8c31b0f        busybox                                               "sleep 600000"           4 hours ago         Up 4 hours                              k8s_ceph-busybox1.c0c0379f_ceph-pod1_default_811a57ee-a49c-11e6-ba01-00163e1625a9_9d910a29

# docker exec 574b8069e548 df -h
Filesystem                Size      Used Available Use% Mounted on
none                     39.2G     20.9G     16.3G  56% /
tmpfs                     1.9G         0      1.9G   0% /dev
tmpfs                     1.9G         0      1.9G   0% /sys/fs/cgroup
/dev/vda1                39.2G     20.9G     16.3G  56% /dev/termination-log
/dev/vda1                39.2G     20.9G     16.3G  56% /etc/resolv.conf
/dev/vda1                39.2G     20.9G     16.3G  56% /etc/hostname
/dev/vda1                39.2G     20.9G     16.3G  56% /etc/hosts
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/rbd1               120.0M      1.5M    109.5M   1% /usr/share/busybox
tmpfs                     1.9G     12.0K      1.9G   0% /var/run/secrets/kubernetes.io/serviceaccount
tmpfs                     1.9G         0      1.9G   0% /proc/kcore
tmpfs                     1.9G         0      1.9G   0% /proc/timer_list
tmpfs                     1.9G         0      1.9G   0% /proc/timer_stats
tmpfs                     1.9G         0      1.9G   0% /proc/sched_debug

六、简单测试

这一节我们要对cephrbd作为k8s PV的效用做一个简单测试。测试步骤:

1) 在container中,向挂载的cephrbd写入数据;
2) 删除ceph-pod1
3) 重新创建ceph-pod1,查看数据是否还存在。

我们首先通过touch 、vi等命令向ceph-pod1挂载的cephrbd volume写入数据:我们通过容器f50bb8c31b0f 创建/usr/share/busybox/hello-ceph.txt,并向文件写入”hello ceph”一行字符串并保存。

# docker exec -it f50bb8c31b0f touch /usr/share/busybox/hello-ceph.txt
# docker exec -it f50bb8c31b0f vi /usr/share/busybox/hello-ceph.txt
# docker exec -it f50bb8c31b0f cat /usr/share/busybox/hello-ceph.txt
hello ceph

接下来删除ceph-pod1:

# kubectl get pod
NAME                        READY     STATUS    RESTARTS   AGE
ceph-pod1                   1/1       Running   0          4h

# kubectl delete pod/ceph-pod1
pod "ceph-pod1" deleted

# kubectl get pod
NAME                        READY     STATUS        RESTARTS   AGE
ceph-pod1                   1/1       Terminating   0          4h

# kubectl get pv,pvc
NAME         CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM                REASON    AGE
pv/ceph-pv   1Gi        RWO           Recycle         Bound     default/ceph-claim             4h
NAME             STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
pvc/ceph-claim   Bound     ceph-pv   1Gi        RWO           4h

可以看到ceph-pod1的删除需要一段时间,这段时间pod一直处于“ Terminating”状态。同时,我们看到pod的删除并没有影响到pv和pvc object,它们依旧存在。

最后,我们再次来创建一下一个使用同一个pvc的pod,为了避免“不必要”的麻烦,我们建立一个名为ceph-pod2.yaml的描述文件:

apiVersion: v1
kind: Pod
metadata:
  name: ceph-pod2
spec:
  containers:
  - name: ceph-busybox2
    image: busybox
    command: ["sleep", "600000"]
    volumeMounts:
    - name: ceph-vol2
      mountPath: /usr/share/busybox
      readOnly: false
  volumes:
  - name: ceph-vol2
    persistentVolumeClaim:
      claimName: ceph-claim

创建ceph-pod2:

# kubectl create -f ceph-pod2.yaml
pod "ceph-pod2" created

root@node1:~/k8stest/k8s-cephrbd# kubectl get pod
NAME                        READY     STATUS    RESTARTS   AGE
ceph-pod2                   1/1       Running   0          14s

root@node1:~/k8stest/k8s-cephrbd# docker ps
CONTAINER ID        IMAGE                                                 COMMAND                  CREATED             STATUS              PORTS               NAMES
574b8069e548        busybox                                               "sleep 600000"           11 seconds ago      Up 10 seconds                           k8s_ceph-busybox2.c5e637a1_ceph-pod2_default_f4aeebd6-a4c3-11e6-ba01-00163e1625a9_fc94c0fe

查看数据是否依旧存在:

# docker exec -it 574b8069e548 cat /usr/share/busybox/hello-ceph.txt
hello ceph

数据完好无损的被ceph-pod2读取到了!

七、小结

至此,对k8s与ceph的集成仅仅才是一个开端,更多的feature和坑等待挖掘。近期发现文章越写越长,原因么?自己赶脚是因为目标系统越来越大,越来越复杂。深入K8s的过程,就是继续给自己挖坑的过程^_^。

我,不是在填坑的路上,就是在坑里:)。

BTW,列一下参考资料:
1、Ceph官方文档
2、OpenShift中的K8s与Ceph RBD集成的文档;
3、Kubernetes官方文档Persistent volumes部分
4、zphj1987博主的这篇文章




这里是Tony Bai的个人Blog,欢迎访问、订阅和留言!订阅Feed请点击上面图片

如果您觉得这里的文章对您有帮助,请扫描上方二维码进行捐赠,加油后的Tony Bai将会为您呈现更多精彩的文章,谢谢!

如果您希望通过微信捐赠,请用微信客户端扫描下方赞赏码:


如果您希望通过比特币或以太币捐赠,可以扫描下方二维码:

比特币:


以太币:


如果您喜欢通过微信App浏览本站内容,可以扫描下方二维码,订阅本站官方微信订阅号“iamtonybai”;点击二维码,可直达本人官方微博主页^_^:



本站Powered by Digital Ocean VPS。

选择Digital Ocean VPS主机,即可获得10美元现金充值,可免费使用两个月哟!

著名主机提供商Linode 10$优惠码:linode10,在这里注册即可免费获得。

阿里云推荐码:1WFZ0V立享9折!

View Tony Bai's profile on LinkedIn


文章

评论

  • 正在加载...

分类

标签

归档











更多