GlusterFS | Tony Bai

标签 GlusterFS 下的文章

使用minio搭建高性能对象存储-第一部分：原型

三月 16, 2020
4 条评论

近期参与了一个项目，该项目有存储大量图片、短视频、音频等非结构化数据的需求。于是我优先在Go社区寻找能满足这类需求的开源项目，minio就这样进入了我的视野。

图：minio logo

其实三年前我就知道了minio，并还下载玩(研)耍(究)了一番，但那时minio的成熟程度与今天相比还是相差较远的(当时需求简单，于是选择了较为熟悉的weedfs)。而如今的minio在github上收获了广泛的关注，小星星也是蛮多的(20k+ star)。它不仅被Go社区使用，在其他语言社区也有着广泛应用。我可以不负责任的说：在对象存储领域，minio大有kafka(java技术栈)在消息队列领域舍我其谁的气概:)。

2019年gopherchina大会上，探探工程师分享了“基于MINIO的对象存储方案在探探的实践”。虽然探探目前是否在生产中使用minio暂不得而知，但这又一次证明了minio在对象存储领域的强大影响力。

img{512x368}

图：探探工程师在gopherchina2019大会上分享minio实践

minio出品自一个有着多年网络文件系统开发经验的团队，其初始创始团队都来自于原Glusterfs团队，该团队二次创业的产品minio的设计广泛吸取了glusterfs的经验和教训：

部署简单：一个single二进制文件即是一切，还可支持各种平台。（托了go语言的福）
minio支持海量存储，可按zone扩展(原zone不受任何影响)，支持单个对象最大5TB；
兼容Amazon S3接口，充分考虑开发人员的需求和体验；
低冗余且磁盘损坏高容忍，标准且最高的数据冗余系数为2（即存储一个1M的数据对象，实际占用磁盘空间为2M）。但在任意n/2块disk损坏的情况下依然可以读出数据(n为一个纠删码集合(Erasure Coding Set)中的disk数量)。并且这种损坏恢复是基于单个对象的，而不是基于整个存储卷的。
读写性能优异

img{512x368}

图：来自minio技术白皮书中的benchmark数据

鉴于上述minio的“优点”，我打算在这个项目中基于minio实现非结构化数据的对象存储方案。本篇文章将介绍方案的原型设计与初始minio验证环境搭建。

一. 原型方案

基于minio的非结构化数据对象存储方案都大同小异，下面的图示就是根据我们的需求简单设计的原型方案：

img{512x368}

图：原型方案

我们基于minio提供的distributed mode，将位于多个host上的多块磁盘组成一个逻辑存储池，通过运行于不同host上的minio server实现一个高可用的对象存储方案；
数据通过一个独立的上传服务(基于minio提供的sdk与minio集群通信)写入minio；
通过minio的mc工具创建bucket，并将bucket的policy设置为”download”，以允许外部用户直接与minio通信，获取对象数据。中间不再设置除lb之外的中间层；
通过job或定时任务利用mc工具统一对minio中的数据进行维护，比如定期删除7天前的数据(如果数据默认过期时间设定为7天)。

二. minio server启动模式

minio支持多种server启动模式：

img{512x368}

图：minio server启动模式

minio server的standalone模式，即要管理的磁盘都在host本地。该启动模式一般仅用于实验环境、测试环境的验证和学习使用。在standalone模式下，还可以分为non-erasure code mode和erasure code mode。

所谓non-erasure code mode，即minio server启动时仅传入一个本地磁盘目录参数：比如：

$minio server data

Endpoint:  http://10.10.126.88:9000  http://127.0.0.1:9000
AccessKey: minioadmin
SecretKey: minioadmin

Browser Access:
   http://10.10.126.88:9000  http://127.0.0.1:9000           

Command-line Access: https://docs.min.io/docs/minio-client-quickstart-guide
   $ mc config host add myminio http://10.10.126.88:9000 minioadmin minioadmin

... ...

在这样的启动模式下，对于每一份对象数据，minio直接在data下面存储这份数据，不会建立副本，也不会启用纠删码机制。因此，这种模式无论是服务实例还是磁盘都是“单点”，无任何高可用保障，磁盘损坏就表示数据丢失。

同样在单minio server的情况下，erasure code mode即为minio server实例传入多个本地磁盘参数。一旦遇到多于一个磁盘参数，minio server会自动启用erasure code mode。erasure code对磁盘的个数是有要求的，如不满足要求，实例启动将失败：

$minio server data1 data2
ERROR Invalid command line arguments: Incorrect number of endpoints provided [data1 data2]
      > Please provide an even number of endpoints greater or equal to 4
      HINT:
        For more information, please refer to https://docs.min.io/docs/minio-erasure-code-quickstart-guide

erasure code启用后，要求传给minio server的endpoint(standalone模式下，即本地磁盘上的目录)至少为4个。minio server启用纠删码机制后，会自动将传入的disk drive划分为多个erasure coding set，每个erasure coding set中的disk drive的数量可以是：4, 6, 8, 10, 12, 14 和16。minio server会根据传入disk drive的数量自动计算set个数和每个set中的disk drive数量。比如下面例子中，我们传入四个endpoint(disk drive)给minio server：

$minio server data1 data2 data3 data4

Formatting 1 zone, 1 set(s), 4 drives per set.
WARNING: Host local has more than 2 drives of set. A host failure will result in data becoming unavailable.
Status:         4 Online, 0 Offline.
Endpoint:  http://10.10.126.88:9000  http://127.0.0.1:9000
AccessKey: minioadmin
SecretKey: minioadmin

Browser Access:
   http://10.10.126.88:9000  http://127.0.0.1:9000           

Command-line Access: https://docs.min.io/docs/minio-client-quickstart-guide
   $ mc config host add myminio http://10.10.126.88:9000 minioadmin minioadmin

... ...

从minio server的输出日志来看，minio server将这些drive放入了一个erasure coding set了。在输出日志中，我们还看到一行WARNING: Host local has more than 2 drives of set. A host failure will result in data becoming unavailable.，即minio server警告我们：这个erasure coding set中有多于两个的drive都在local host上，这样一旦host宕机，那么数据将无法获取。(每个set 有4个drive，根据纠删码的机制，这个set的最大允许失效的disk数量为4/2=2)。

我们再来看minio server启动的一个“语法糖” – “省略号”语法：

$minio server data{1...18}

Formatting 1 zone, 3 set(s), 6 drives per set.
WARNING: Host local has more than 3 drives of set. A host failure will result in data becoming unavailable.
WARNING: Host local has more than 3 drives of set. A host failure will result in data becoming unavailable.
WARNING: Host local has more than 3 drives of set. A host failure will result in data becoming unavailable.
Status:         18 Online, 0 Offline.
Endpoint:  http://10.10.126.88:9000  http://127.0.0.1:9000
AccessKey: minioadmin
SecretKey: minioadmin

Browser Access:
   http://10.10.126.88:9000  http://127.0.0.1:9000           

Command-line Access: https://docs.min.io/docs/minio-client-quickstart-guide
   $ mc config host add myminio http://10.10.126.88:9000 minioadmin minioadmin

... ...

minio server data{1...18}等价于minio server data1 data2 data3 data4 data5 data6 data7 data8 data9 data10 data11 data 12 data13 data14 data15 data16 data17 data18。minio server会自行扩展省略号代表的内容。我们看到：当我们传入18个disk drive后，minio server创建了3个erasure coding set，每个set中有6个disk drive。同样，minio server还针对每个set输出了一行WARNING：每个Set中有三个以上的disk drive都位于同一台host上。

这些WARNING我们可以通过distributed mode来解决。顾名思义，distributed mode下，minio server实例和其管理的disk drive分布在多台host上，这种模式可以避免minio server实例单点，数据也将分布在不同host上的不同disk中，实现了高可用，提升了整体的容灾能力。由于处理多个host上的disk，distribute mode默认就会启动erasure coding set机制。

在distributed mode下，minio server后面的远程的endpoint采用http url编码格式：

export MINIO_ACCESS_KEY=<ACCESS_KEY>
export MINIO_SECRET_KEY=<SECRET_KEY>
$minio server http://host{1...4}:9000/minio/data{1...4}

上面例子中的minio server命令相当于4个host，每个host上启动一个minio server实例，每个实例都管理16的disk drive(包括本地和远程的)。上述命令等价于：

$minio server http://host1:9000/minio/data1 http://host1:9000/minio/data2 http://host1:9000/minio/data3 http://host1:9000/minio/data4 http://host2:9000/minio/data1 http://host2:9000/minio/data2 http://host2:9000/minio/data3 http://host2:9000/minio/data4 http://host3:9000/minio/data1 http://host3:9000/minio/data2 http://host3:9000/minio/data3 http://host3:9000/minio/data4 http://host4:9000/minio/data1 http://host4:9000/minio/data2 http://host4:9000/minio/data3 http://host4:9000/minio/data4

minio同样会自动将这些disk drive划分为若干个erasure coding set。每个endpoint用http://address/disk-drive-path的形式编码。注意：这条命令在host1、host2、host3和host4上都要执行。

minio有一个zone的概念，比如下面这个例子：

$minio server data{1...8} data{9...16}

Formatting 1 zone, 1 set(s), 8 drives per set.
WARNING: Host local has more than 4 drives of set. A host failure will result in data becoming unavailable.
Formatting 2 zone, 1 set(s), 8 drives per set.
WARNING: Host local has more than 4 drives of set. A host failure will result in data becoming unavailable.
Status:         16 Online, 0 Offline.
Endpoint:  http://10.10.126.88:9000  http://127.0.0.1:9000
AccessKey: minioadmin
SecretKey: minioadmin

Browser Access:
   http://10.10.126.88:9000  http://127.0.0.1:9000           

Command-line Access: https://docs.min.io/docs/minio-client-quickstart-guide
   $ mc config host add myminio http://10.10.126.88:9000 minioadmin minioadmin

... ...

我们在命令行中给minio server传入两组采用“省略号”语法的参数，minio认为每组就是一个“zone”，这里有两组，因此minio创建了两个zone。在每个zone内，minio创建了一个erasure coding set，每个set中有8个disk drive。对于外部的写数据请求，minio server会首先查找可用空间多的zone，然后再在zone内选择set和disk drive。

如果不用“省略号”语法，那么minio server会将后面传入的所有disk drive放入一个zone中。

三. 原型验证环境搭建与配置

1. 单机上部署distributed minio集群

我们的验证环境采用最小的distributed minio模式：单机、one zone, one erasure coding set, 4 disk drive。下面是部署的示意图：

img{512x368}

图：单机上部署distributed minio集群

我们没有使用“省略号”语法，在单机上不是很好模拟。我们通过下面脚本来启动该minio集群：

# cat startup_minio.sh
#!/bin/bash

export MINIO_ACCESS_KEY="minio"
export MINIO_SECRET_KEY="minio123"

for i in {01..04}; do
    nohup minio server --address ":90${i}" http://127.0.0.1:9001/root/minio-install/data1 http://127.0.0.1:9002/root/minio-install/data2  http://127.0.0.1:9003/root/minio-install/data3 http://127.0.0.1:9004/root/minio-install/data4 > "/root/minio-install/90${i}.log"& 2>&1
done

启动该minio集群，并查看启动状态：

# bash startup_minio.sh

# ps -ef|grep minio

root      1218     1 11 21:58 pts/5    00:00:01 minio server --address :9001 http://127.0.0.1:9001/root/minio-install/data1 http://127.0.0.1:9002/root/minio-install/data2 http://127.0.0.1:9003/root/minio-install/data3 http://127.0.0.1:9004/root/minio-install/data4
root      1219     1 11 21:58 pts/5    00:00:01 minio server --address :9002 http://127.0.0.1:9001/root/minio-install/data1 http://127.0.0.1:9002/root/minio-install/data2 http://127.0.0.1:9003/root/minio-install/data3 http://127.0.0.1:9004/root/minio-install/data4
root      1220     1  3 21:58 pts/5    00:00:00 minio server --address :9003 http://127.0.0.1:9001/root/minio-install/data1 http://127.0.0.1:9002/root/minio-install/data2 http://127.0.0.1:9003/root/minio-install/data3 http://127.0.0.1:9004/root/minio-install/data4
root      1221     1 11 21:58 pts/5    00:00:01 minio server --address :9004 http://127.0.0.1:9001/root/minio-install/data1 http://127.0.0.1:9002/root/minio-install/data2 http://127.0.0.1:9003/root/minio-install/data3 http://127.0.0.1:9004/root/minio-install/data4

root@instance-cspzrq3u:~/minio-install# ls
9001.log  9002.log  9003.log  9004.log  data1  data2  data3  data4  startup_minio.sh
root@instance-cspzrq3u:~/minio-install# tail -100f 9001.log

Formatting 1 zone, 1 set(s), 4 drives per set.
Attempting encryption of all config, IAM users and policies on MinIO backend
Status:         4 Online, 0 Offline.
Endpoint:  http://192.168.16.4:9001  http://172.17.0.1:9001  http://172.18.0.1:9001  http://127.0.0.1:9001       

Browser Access:
   http://192.168.16.4:9001  http://172.17.0.1:9001  http://172.18.0.1:9001  http://127.0.0.1:9001       

.... ...

2. mc配置与管理

minio官方提供了mc命令行工具，用于对minio server进行管理。我们首先要为mc创建一个管理本地minio server(:9001)的配置：

# mc config host add myminio http://localhost:9001 minio minio123
Added `myminio` successfully.

这里我们使用mc添加了一个所谓”host”，指向上面创建的minio server(:9001)。上面的命令实质上是在~/.mc/config.json中写入了如下配置：

# cat ~/.mc/config.json
{
    "version": "9",
    "hosts": {
        "myminio": {
            "url": "http://localhost:9001",
            "accessKey": "minio",
            "secretKey": "minio123",
            "api": "s3v4",
            "lookup": "auto"
        }
    }
}

接下来，我们通过mc命令在minio集群中添加三个bucket：

root@instance-cspzrq3u:~# mc mb myminio/image
Bucket created successfully `myminio/image`.
root@instance-cspzrq3u:~# mc mb myminio/video
Bucket created successfully `myminio/video`.
root@instance-cspzrq3u:~# mc mb myminio/audio
Bucket created successfully `myminio/audio`.
root@instance-cspzrq3u:~# mc ls myminio
[2020-03-16 15:19:55 CST]      0B audio/
[2020-03-16 15:19:48 CST]      0B image/
[2020-03-16 15:19:52 CST]      0B video/

新创建的bucket默认的访问policy是none，即外部无访问权限：

root@instance-cspzrq3u:~# mc policy get myminio/image
Access permission for `myminio/image` is `none`

根据我们的设计，我们需要给这三个bucket添加外部可读取权限，以image这个bucket为例：

root@instance-cspzrq3u:~# mc policy set download myminio/image
Access permission for `myminio/image` is set to `download`
root@instance-cspzrq3u:~# mc policy get myminio/image
Access permission for `myminio/image` is `download`

3. load balancer设置

这里我们使用一个nginx前置在minio集群外部，下面是为minio创建的nginx配置文件(/etc/nginx/conf.d/minio.conf)：

// /etc/nginx/conf.d/minio.conf

 upstream minio_cluster {
    server localhost:9001;
    server localhost:9002;
    server localhost:9003;
    server localhost:9004;
 }

server {
 listen 9000;
 server_name myminio.tonybai.com;

 # To allow special characters in headers
 ignore_invalid_headers off;
 # Allow any size file to be uploaded.
 # Set to a value such as 1000m; to restrict file size to a specific value
 client_max_body_size 0;
 # To disable buffering
 proxy_buffering off;

location / {

   proxy_set_header X-Real-IP $remote_addr;
   proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
   proxy_set_header X-Forwarded-Proto $scheme;
   proxy_set_header Host $http_host;

   proxy_connect_timeout 300;
   # Default is HTTP/1, keepalive is only enabled in HTTP/1.1
   proxy_http_version 1.1;
   proxy_set_header Connection "";
   chunked_transfer_encoding off;

   proxy_pass http://minio_cluster;
}

location /image/ {
   proxy_set_header X-Real-IP $remote_addr;
   proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
   proxy_set_header X-Forwarded-Proto $scheme;
   proxy_set_header Host $http_host;

   proxy_connect_timeout 300;
   # Default is HTTP/1, keepalive is only enabled in HTTP/1.1
   proxy_http_version 1.1;
   proxy_set_header Connection "";
   chunked_transfer_encoding off;
   client_max_body_size 1000m;
   proxy_buffering off;

   proxy_pass http://minio_cluster;
 }
}

重启nginx（nginx -s reload)。

我们使用浏览器访问一下http://myminio.tonybai.com:9000/，登录后，你将看到如下页面：

img{512x368}

图：浏览器访问minio web

选择左侧的”image” bucket，点击右下角的”+”号，我们可以上传一张图片：gopher-daily-logo.png，上传后，我们退出登录。然后通过地址http://myminio.tonybai.com:9000/image/gopher-daily-logo.png访问该图片。你也可以通过wget命令下载该图片：

$wget -c http://myminio.tonybai.com:9000/image/gopher-daily-logo.png
--2020-03-16 15:40:20--  http://myminio.tonybai.com:9000/image/gopher-daily-logo.png
正在解析主机 myminio.tonybai.com (myminio.tonybai.com)... 106.12.69.83
正在连接 myminio.tonybai.com (myminio.tonybai.com)|106.12.69.83|:9000... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：59736 (58K) [image/png]
正在保存至: “gopher-daily-logo.png”

gopher-daily-logo.png        100%[============================================>]  58.34K   253KB/s  用时 0.2s   

2020-03-16 15:40:20 (253 KB/s) - 已保存 “gopher-daily-logo.png” [59736/59736])

4. 对象清除

我们的需求中，bucket中的数据对象的生命周期是7天，我们可以使用定时工具或一个job通过mc工具对这些过期对象进行清除，比如我们每隔5分钟执行一次下面的命令：

$mc rm --recursive --force --newer-than 7d myminio/image/

该命令将递归删除image bucket下早于7天前创建的数据对象。rm命令支持各种条件组合，具体可参考一下mc rm的manual。

四. 小结

至此，使用minio搭建高性能对象存储的第一步：原型算是顺利搭建ok了。相信在后续对minio的深入使用和了解后，会有更多关于minio的内容和大家分享。

我的网课“Kubernetes实战：高可用集群搭建、配置、运维与应用”在慕课网上线了，感谢小伙伴们学习支持！

我爱发短信：企业级短信平台定制开发专家 https://tonybai.com/
smspush : 可部署在企业内部的定制化短信平台，三网覆盖，不惧大并发接入，可定制扩展；短信内容你来定，不再受约束, 接口丰富，支持长短信，签名可选。

著名云主机服务厂商DigitalOcean发布最新的主机计划，入门级Droplet配置升级为：1 core CPU、1G内存、25G高速SSD，价格5$/月。有使用DigitalOcean需求的朋友，可以打开这个链接地址：https://m.do.co/c/bff6eed92687 开启你的DO主机之路。

Gopher Daily(Gopher每日新闻)归档仓库 – https://github.com/bigwhite/gopherdaily

我的联系方式：

微博：https://weibo.com/bigwhite20xx
微信公众号：iamtonybai
博客：tonybai.com
github: https://github.com/bigwhite

微信赞赏：
img{512x368}

商务合作方式：撰稿、出书、培训、在线课程、合伙创业、咨询、广告合作。

weed-fs使用简介

八月 22, 2015
38 条评论

weed-fs，全名Seaweed-fs，是一种用golang实现的简单且高可用的分布式文件系统。该系统的目标有二：

- 存储billions of files
- serve the files fast

weed-fs起初是为了搞一个基于Fackbook的Haystack论文的实现，Haystack旨在优化Fackbook内部图片存储和获取。后在这个基础上，weed-fs作者又增加了若干feature，形成了目前的weed-fs。

这里并不打算深入分析weed-fs源码，仅仅是从黑盒角度介绍weed-fs的使用，发掘weed-fs的功能、长处和不足。

一、weed-fs集群简介

weed-fs集群的拓扑(Topology)由DataCenter、Rack(机架)、Machine(或叫Node)组成。最初版本的weed-fs应该可以通过配置文件来描述整个集群的拓扑结构，配置文件采用xml格式，官方给出的样例如下：

但目前的版本中，该配置文件在help说明中被置为“Deprecating!”了：

$weed master -help
…
-conf="/etc/weedfs/weedfs.conf": Deprecating! xml configuration file
…

0.70版本的weed-fs在Master中维护集群拓扑，master会根据master与master、volume与master的连接情况实时合成拓扑结构了。

weed-fs自身可以在两种模式下运行，一种是Master，另外一种则是Volume。集群的维护以及强一致性的保证由master们保证，master间通过raft协议实现强一致性。Volume是实际管理和存储数据的运行实例。数据的可靠性则可以通过weed-fs提供的 replication机制保证。

weed-fs提供了若干种replication策略(rack – 机架，一个逻辑上的概念)：

000 no replication, just one copy
001 replicate once on the same rack
010 replicate once on a different rack in the same data center
100 replicate once on a different data center
200 replicate twice on two other different data center
110 replicate once on a different rack, and once on a different data center

选择数据更可靠的策略，则会带来一些性能上的代价，这始终是一个权衡的问题。

更多的细节以及Scaling、数据迁移等方面，下面将逐一说明。

二、weed-fs集群的启动

为了实验方便，我们定义了一个weed-fs集群拓扑：

三个master:
    master1 – localhost:9333
    master2 – localhost:9334
    master3 – localhost:9335

replication策略：100(即在另外一个不同的datacenter中复制一份)

三个volume:
         volume1 – localhost:8081 dc1
    volume2 – localhost:8082 dc1
    volume3 – localhost:8083 dc2

集群启动首先启动master们，启动顺序: master1、master2、master3：

master1:

$ weed -v=3 master -port=9333 -mdir=./m1 -peers=localhost:9333,localhost:9334,localhost:9335 -defaultReplication=100
I0820 14:37:17 07606 file_util.go:20] Folder ./m1 Permission: -rwxrwxr-x
I0820 14:37:17 07606 topology.go:86] Using default configurations.
I0820 14:37:17 07606 master_server.go:59] Volume Size Limit is 30000 MB
I0820 14:37:17 07606 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9333
I0820 14:37:17 07606 raft_server.go:50] Starting RaftServer with IP:localhost:9333:
I0820 14:37:17 07606 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335
I0820 14:37:17 07606 raft_server.go:134] Attempting to connect to: http://localhost:9334/cluster/join
I0820 14:37:17 07606 raft_server.go:139] Post returned error: Post http://localhost:9334/cluster/join: dial tcp 127.0.0.1:9334: connection refused
I0820 14:37:17 07606 raft_server.go:134] Attempting to connect to: http://localhost:9335/cluster/join
I0820 14:37:17 07606 raft_server.go:139] Post returned error: Post http://localhost:9335/cluster/join: dial tcp 127.0.0.1:9335: connection refused
I0820 14:37:17 07606 raft_server.go:78] No existing server found. Starting as leader in the new cluster.
I0820 14:37:17 07606 master_server.go:93] [ localhost:9333 ] I am the leader!

I0820 14:37:52 07606 raft_server_handlers.go:16] Processing incoming join. Current Leader localhost:9333 Self localhost:9333 Peers map[]
I0820 14:37:52 07606 raft_server_handlers.go:20] Command:{"name":"localhost:9334","connectionString":"http://localhost:9334"}
I0820 14:37:52 07606 raft_server_handlers.go:27] join command from Name localhost:9334 Connection http://localhost:9334

I0820 14:38:02 07606 raft_server_handlers.go:16] Processing incoming join. Current Leader localhost:9333 Self localhost:9333 Peers map[localhost:9334:0xc20800f730]
I0820 14:38:02 07606 raft_server_handlers.go:20] Command:{"name":"localhost:9335","connectionString":"http://localhost:9335"}
I0820 14:38:02 07606 raft_server_handlers.go:27] join command from Name localhost:9335 Connection http://localhost:9335

master2:

$ weed -v=3 master -port=9334 -mdir=./m2 -peers=localhost:9333,localhost:9334,localhost:9335 -defaultReplication=100
I0820 14:37:52 07616 file_util.go:20] Folder ./m2 Permission: -rwxrwxr-x
I0820 14:37:52 07616 topology.go:86] Using default configurations.
I0820 14:37:52 07616 master_server.go:59] Volume Size Limit is 30000 MB
I0820 14:37:52 07616 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9334
I0820 14:37:52 07616 raft_server.go:50] Starting RaftServer with IP:localhost:9334:
I0820 14:37:52 07616 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335
I0820 14:37:52 07616 raft_server.go:134] Attempting to connect to: http://localhost:9333/cluster/join
I0820 14:37:52 07616 raft_server.go:179] Post returned status: 200

master3:

$ weed -v=3 master -port=9335 -mdir=./m3 -peers=localhost:9333,localhost:9334,localhost:9335 -defaultReplication=100
I0820 14:38:02 07626 file_util.go:20] Folder ./m3 Permission: -rwxrwxr-x
I0820 14:38:02 07626 topology.go:86] Using default configurations.
I0820 14:38:02 07626 master_server.go:59] Volume Size Limit is 30000 MB
I0820 14:38:02 07626 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9335
I0820 14:38:02 07626 raft_server.go:50] Starting RaftServer with IP:localhost:9335:
I0820 14:38:02 07626 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335
I0820 14:38:02 07626 raft_server.go:134] Attempting to connect to: http://localhost:9333/cluster/join
I0820 14:38:03 07626 raft_server.go:179] Post returned status: 200

master1启动后，发现其他两个peer master尚未启动，于是将自己选为leader。master2、master3启动后，加入到以master1为leader的 master集群。

接下来我们来启动volume servers：

volume1:

$ weed -v=3 volume -port=8081 -dir=./v1 -mserver=localhost:9333 -dataCenter=dc1
I0820 14:44:29 07642 file_util.go:20] Folder ./v1 Permission: -rwxrwxr-x
I0820 14:44:29 07642 store.go:225] Store started on dir: ./v1 with 0 volumes max 7
I0820 14:44:29 07642 volume.go:136] Start Seaweed volume server 0.70 beta at 0.0.0.0:8081
I0820 14:44:29 07642 volume_server.go:70] Volume server bootstraps with master localhost:9333
I0820 14:44:29 07642 list_masters.go:18] list masters result :{"IsLeader":true,"Leader":"localhost:9333","Peers":["localhost:9334","localhost:9335"]}
I0820 14:44:29 07642 store.go:65] current master nodes is nodes:[localhost:9334 localhost:9335 localhost:9333 localhost:9333], lastNode:3

volume server的启动大致相同，volume2和volume3的输出日志这里就不详细列出了。

volume2:

$weed -v=3 volume -port=8082 -dir=./v2 -mserver=localhost:9334 -dataCenter=dc1

volume3:

$weed -v=3 volume -port=8083 -dir=./v3 -mserver=localhost:9335 -dataCenter=dc2

三个volume server启动后，我们在leader master(9333)上能看到如下日志：

I0820 14:44:29 07606 node.go:208] topo adds child dc1
I0820 14:44:29 07606 node.go:208] topo:dc1 adds child DefaultRack
I0820 14:44:29 07606 node.go:208] topo:dc1:DefaultRack adds child 127.0.0.1:8081
I0820 14:47:09 07606 node.go:208] topo:dc1:DefaultRack adds child 127.0.0.1:8082
I0820 14:47:21 07606 node.go:208] topo adds child dc2
I0820 14:47:21 07606 node.go:208] topo:dc2 adds child DefaultRack
I0820 14:47:21 07606 node.go:208] topo:dc2:DefaultRack adds child 127.0.0.1:8083

至此，整个weed-fs集群已经启动了。初始启动后的master会在-mdir下建立一些目录和文件：

$ ls m1
conf log snapshot

但volume在-dir下没有做任何操作，volume server会在第一次写入数据时建立相应的.idx文件和.dat文件。

三、基本操作：存储、获取和删除文件

创建一个hello.txt文件，内容为"hello weed-fs!"，用于我们测试weed-fs的基本操作。weed-fs提供了HTTP REST API接口，我们可以很方便的使用其基本功能(这里客户端使用curl)。

1、存储

我们来将hello.txt文件存储在weed-fs文件系统中，我们通过master提供的submit API接口来完成这一操作：

$ curl -F file=@hello.txt http://localhost:9333/submit
{"fid":"6,01fc4a422c","fileName":"hello.txt","fileUrl":"127.0.0.1:8082/6,01fc4a422c","size":39}

我们看到master给我们返回了一行json数据，其中:

fid是一个逗号分隔的字符串，按照repository中文档的说明，这个字符串应该由volume id, key uint64和cookie code构成。其中逗号前面的6就是volume id, 01fc4a422c则是key和cookie组成的串。fid是文件hello.txt在集群中的唯一ID。后续查看、获取以及删除该文件数据都需要使用这个fid。

fileUrl是该文件在weed-fs中的一个访问地址(非唯一哦)，这里是127.0.0.1:8082/6,01fc4a422c，可以看出weed-fs在volume server2上存储了一份hello.txt的数据。

这一存储操作引发了物理volume的创建，我们可以看到volume server的-dir下发生了变化，多了很多.idx和.dat文件：

$ ls v1 v2 v3
v1:
3.dat 3.idx 4.dat 4.idx 5.dat 5.idx

v2:
1.dat 1.idx 2.dat 2.idx 6.dat 6.idx

v3:
1.dat 1.idx 2.dat 2.idx 3.dat 3.idx 4.dat 4.idx 5.dat 5.idx 6.dat 6.idx

并且这个创建过程是在master leader的控制之下的：

I0820 15:06:02 07606 volume_growth.go:204] Created Volume 3 on topo:dc1:DefaultRack:127.0.0.1:8081
I0820 15:06:02 07606 volume_growth.go:204] Created Volume 3 on topo:dc2:DefaultRack:127.0.0.1:8083

我们从文件的size可以看出，hello.txt文件被存储在了v2和v3下的id为6的卷(6.dat和6.idx)中：

v2:
-rw-r–r– 1 tonybai tonybai 104 8月20 15:06 6.dat
-rw-r–r– 1 tonybai tonybai 16 8月20 15:06 6.idx

v3:
-rw-r–r– 1 tonybai tonybai 104 8月20 15:06 6.dat
-rw-r–r– 1 tonybai tonybai 16 8月20 15:06 6.idx

v2和v3中的6.dat是一模一样的，6.idx也是一样的（后续在做数据迁移时，这点极其重要）。

2、获取

前面提到master给我们返回了一个fid:6,01fc4a422c以及fileUrl":"127.0.0.1:8082/6,01fc4a422c"。

通过这个fileUrl，我们可以获取到hello.txt的数据：

$ curl http://127.0.0.1:8082/6,01fc4a422c
hello weed-fs!

根据我们的replication策略，hello.txt应该还存储在v3下，我们换成8083这个volume，应该也可以得到 hello.txt数据：

$ curl http://127.0.0.1:8083/6,01fc4a422c
hello weed-fs!

如果我们通过volume1 (8081)查，应该得不到数据：

$ curl http://127.0.0.1:8081/6,01fc4a422c
<a href="http://127.0.0.1:8082/6,01fc4a422c">Moved Permanently</a>.

这里似乎是重定向了。我们给curl加上重定向处理选项再试一次：

$ curl -L http://127.0.0.1:8081/6,01fc4a422c
hello weed-fs!

居然也能得到相应数据，从volume1的日志来看，volume1也能获取到hello.txt的正确地址，并将返回重定向请求，这样curl 就能从正确的machine上获取数据了。

如果我们通过master来获取hello.txt数据，会是什么结果呢？

$ curl -L http://127.0.0.1:9335/6,01fc4a422c
hello weed-fs!

同样master返回重定向地址，curl从volume节点获取到正确数据。我们看看master是如何返回重定向地址的？

$ curl http://127.0.0.1:9335/6,01fc4a422c
<a href="http://127.0.0.1:8082/6,01fc4a422c">Moved Permanently</a>.
$ curl http://127.0.0.1:9335/6,01fc4a422c
<a href="http://127.0.0.1:8083/6,01fc4a422c">Moved Permanently</a>.

可以看到master会自动均衡负载，轮询式的返回8082和8083。0.70版本以前，通过非leader master是无法得到正确结果的，只能通过leader master得到，0.70版本fix了这个问题。

3、删除

通过fileUrl地址直接删除hello.txt：

$ curl -X DELETE http://127.0.0.1:8082/6,01fc4a422c
{"size":39}

操作成功后，我们再来get一下hello.txt:

$ curl -i http://127.0.0.1:8082/6,01fc4a422c
HTTP/1.1 404 Not Found
Date: Thu, 20 Aug 2015 08:13:28 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

$ curl -i -L http://127.0.0.1:9335/6,01fc4a422c
HTTP/1.1 301 Moved Permanently
Content-Length: 69
Content-Type: text/html; charset=utf-8
Date: Thu, 20 Aug 2015 08:13:56 GMT
Location: http://127.0.0.1:8082/6,01fc4a422c

HTTP/1.1 404 Not Found
Date: Thu, 20 Aug 2015 08:13:56 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

可以看出，无论是直接通过volume还是间接通过master都无法获取到hello.txt了，hello.txt被成功删除了。

不过删除hello.txt后，volume server下的数据文件的size却并没有随之减小，别担心，这就是weed-fs的处理方法，这些数据删除后遗留下来的空洞需要手工清除（对数据文件进行手工紧缩）：

$ curl "http://localhost:9335/vol/vacuum"
{"Topology":{"DataCenters":[{"Free":8,"Id":"dc1","Max":14,"Racks":[{"DataNodes":[{"Free":4,"Max":7,"PublicUrl":"127.0.0.1:8081","Url":"127.0.0.1:8081","Volumes":3},{"Free":4,"Max":7,"PublicUrl":"127.0.0.1:8082","Url":"127.0.0.1:8082","Volumes":3}],”Free”:8,”Id”:”DefaultRack”,”Max”:14}]},{“Free”:1,”Id”:”dc2″,”Max”:7,”Racks”:[{"DataNodes":[{"Free":1,"Max":7,"PublicUrl":"127.0.0.1:8083","Url":"127.0.0.1:8083","Volumes":6}],”Free”:1,”Id”:”DefaultRack”,”Max”:7}]}],”Free”:9,”Max”:21,”layouts”:[{"collection":"","replication":"100","ttl":"","writables":[1,2,3,4,5,6]}]},"Version":"0.70 beta"}

紧缩后，你再查看v1, v2, v3下的文件size，真的变小了。

四、一致性（consistency）

在分布式系统中，“一致性”是永恒的难题。weed-fs支持replication，其多副本的数据一致性需要保证。

weed-fs理论上采用了是一种“强一致性”的策略，即：

存储文件时，当多个副本都存储成功后，才会返回成功；任何一个副本存储失败，此次存储操作则返回失败。
删除文件时，当所有副本都删除成功后，才返回成功；任何一个副本删除失败，则此次删除操作返回失败。

我们来验证一下weed-fs是否做到了以上两点：

1、存储的一致性保证

我们先将volume3停掉(即dc2)，这样在replication 策略为100时，向weed-fs存储hello.txt时会发生如下结果：

$ curl -F file=@hello.txt http://localhost:9333/submit
{"error":"Cannot grow volume group! Not enough data node found!"}

master根据100策略，需要在dc2选择一个volume存储hello.txt的副本，但dc2所有machine都down掉了，因此没有存储空间，于是master认为此次操作无法继续进行，返回失败。这点符合存储一致性的要求。

2、删除的一致性保证

恢复dc2，将hello.txt存入：

$ curl -F file=@hello.txt http://localhost:9333/submit
{"fid":"6,04dce94a72","fileName":"hello.txt","fileUrl":"127.0.0.1:8082/6,04dce94a72","size":39}

再次停掉dc2，之后尝试删除hello.txt（通过master删除)：

$ curl -L -X DELETE http://127.0.0.1:9333/6,04dce94a72
{"error":"Deletion Failed."}

虽然返回的是delete failed，但从8082上的日志来看，似乎8082已经将hello.txt删除了：

I0820 17:32:20 07653 volume_server_handlers_write.go:53] deleting Cookie:3706276466, Id:4, Size:0, DataSize:0, Name: , Mime:

我们再从8082获取一下hello.txt：

$ curl http://127.0.0.1:8082/6,04dce94a72

结果是什么也没有返回。

从8082日志来看：

I0820 17:33:24 07653 volume_server_handlers_read.go:53] read error: File Entry Not Found. Needle 70 Memory 0 /6,04dce94a72

hello.txt的确被删除了！

这时将dc2(8083)重新启动！我们尝试从8083获取hello.txt：

$ curl http://127.0.0.1:8083/6,04dce94a72
hello weed-fs!

8083上的hello.txt依旧存在，可以被读取。

再试试通过master来获取hello.txt：

$ curl -L http://127.0.0.1:9333/6,04dce94a72
$ curl -L http://127.0.0.1:9333/6,04dce94a72
hello weed-fs!

结果是有时能返回hello.txt内容，有时不行。显然这是与master的自动负载均衡有关，返回8082这个重定向地址，则curl无法得到结果；但若返回8083这个重定向地址，我们就可以得到hello.txt的内容。

这样来看，目前weed-fs的删除操作还无法保证强一致性。weed-fs github.com上已有若干issues(#172，#179，#182)是关于这个问题的。在大数据量(TB、PB级别)的情况下，这种不一致性最大的问题是导致storage leak，即空间被占用而无法回收，volume将被逐个逐渐占满，期待后续的解决方案吧。

五、目录支持

weed-fs还支持像传统文件系统那样，将文件放在目录下管理，并通过文件路径对文件进行存储、获取和删除操作。weed-fs对目录的支持是通过另外一个server实现的：filer server。也就是说如果想拥有对目录的支持，则必须启动一个(或若干个) filer server，并且所有的操作都要通过filer server进行。

$ weed filer -port=8888 -dir=./f1 -master=localhost:9333 -defaultReplicaPlacement=100
I0820 22:09:40 08238 file_util.go:20] Folder ./f1 Permission: -rwxrwxr-x
I0820 22:09:40 08238 filer.go:88] Start Seaweed Filer 0.70 beta at port 8888

1、存储

$curl -F "filename=@hello.txt" "http://localhost:8888/foo/"
{"name":"hello.txt","size":39}

2、获取

$ curl http://localhost:8888/foo/hello.txt
hello weed-fs!

3、查询目录文件列表

$ curl "http://localhost:8888/foo/?pretty=y"
{
"Directory": "/foo/",
"Files": [
    {
      "name": "hello.txt",
      "fid": "6,067281a126"
    }
],
"Subdirectories": null
}

4、删除

$ curl -X DELETE http://localhost:8888/foo/hello.txt
{"error":""}

再尝试获取hello.txt：

$curl http://localhost:8888/foo/hello.txt
返回空。hello.txt已被删除。

5、多filer server

weed filer server是单点，我们再来启动一个filer server。

$ weed filer -port=8889 -dir=./f2 -master=localhost:9333 -defaultReplicaPlacement=100
I0821 13:47:52 08973 file_util.go:20] Folder ./f2 Permission: -rwxrwxr-x
I0821 13:47:52 08973 filer.go:88] Start Seaweed Filer 0.70 beta at port 8889

两个filer节点间是否有协调呢？我们来测试一下：我们从8888存储一个文件，然后从8889获取这个文件：

$ curl -F "filename=@hello.txt" "http://localhost:8888/foo/"
{"name":"hello.txt","size":39}
$ curl http://localhost:8888/foo/hello.txt
hello weed-fs!
$ curl http://localhost:8889/foo/hello.txt
空

从测试结果来看，二者各自独立工作，并没有任何联系，也就是说没有共享“文件full path”到"fid"的索引关系。默认情况下 filer server都是工作在standalone模式下的。

weed-fs官方给出了filer的集群方案，即使用redis或Cassandra作为后端，在多个filer节点间共享“文件full path”到"fid"的索引关系。

我们启动一个redis-server(2.8.21)，监听在默认的6379端口。用下面命令重启两个filer server节点：

$ weed filer -port=8888 -dir=./f1 -master=localhost:9333 -defaultReplicaPlacement=100 -redis.server=localhost:6379
$ weed filer -port=8889 -dir=./f2 -master=localhost:9333 -defaultReplicaPlacement=100 -redis.server=localhost:6379

重复一下上面的测试步骤：
$ curl -F "filename=@hello.txt" "http://localhost:8888/foo/"
{"name":"hello.txt","size":39}

$ curl http://localhost:8889/foo/hello.txt
hello weed-fs!

可以看到从8888存储的文件，可以被从8889获取到。

我们删除这个文件：
$ curl -X DELETE http://localhost:8889/foo/hello.txt
{"error":"Invalid fileId "}

提示error，但实际上文件已经被删除了！这块可能是个小bug(#183)。

虽然filer是集群了，但其后端的redis依旧是单点，如果考虑高可靠性，redis显然也要做好集群。

六、Collection

Collection，顾名思义是“集合”，在weed-fs中，它指的是物理volume的集合。前面我们在存储文件时并没有指定 collection，因此weed-fs采用默认collection(空)。如果我们指定集合，结果会是什么样子呢？

$ curl -F file=@hello.txt "http://localhost:9333/submit?collection=picture"
{"fid":"7,0c4f5dc90f","fileName":"hello.txt","fileUrl":"127.0.0.1:8083/7,0c4f5dc90f","size":39}

$ ls v1 v2 v3
v1:
3.dat 3.idx 4.dat 4.idx 5.dat 5.idx picture_7.dat picture_7.idx
v2:
1.dat 1.idx 2.dat 2.idx 6.dat 6.idx
v3:
1.dat 1.idx 2.dat 2.idx 3.dat 3.idx 4.dat 4.idx 5.dat 5.idx 6.dat 6.idx picture_7.dat picture_7.idx

可以看出volume server在自己的-dir下面建立了一个collection名字为prefix的idx和dat文件，上述例子中hello.txt被分配到 8081和8083两个volume server上，因此这两个volume server各自建立了picture_7.dat和picture_7.idx。以picture为前缀的idx和dat文件只是用来存放存储在 collection=picture的文件数据，其他数据要么存储在默认collection中，要么存储在其他名字的collection 中。

collection就好比为Windows下位驱动器存储卷起名。比如C:叫"系统盘"，D叫“程序盘”，E叫“数据盘”。这里各个 volume server下的picture_7.dat和picture_7.idx被起名为picture卷。如果还有video collection，那么它可能由各个volume server下的video_8.dat和video_8.idx。

不过由于默认情况下，weed volume的默认-max="7"，因此在实验环境下每个volume server最多在-dir下建立7个物理卷(七对.idx和.dat)。如果此时我还想建立video卷会怎么样呢？

$ curl -F file=@hello.txt "http://localhost:9333/submit?collection=video"
{"error":"Cannot grow volume group! Not enough data node found!"}

volume server们返回失败结果，提示无法再扩展volume了。这时你需要重启各个volume server，将-max值改大，比如100。

比如：$weed -v=3 volume -port=8083 -dir=./v3 -mserver=localhost:9335 -dataCenter=dc2 -max=100

重启后，我们再来建立video collection:

$ curl -F file=@hello.txt "http://localhost:9333/submit?collection=video"
{"fid":"11,0ee98ca54d","fileName":"hello.txt","fileUrl":"127.0.0.1:8083/11,0ee98ca54d","size":39}

$ ls v1 v2 v3
v1:
3.dat 4.dat 5.dat picture_7.dat video_10.dat video_11.dat video_12.dat video_13.dat video_9.dat
3.idx 4.idx 5.idx picture_7.idx video_10.idx video_11.idx video_12.idx video_13.idx video_9.idx

v2:
1.dat 1.idx 2.dat 2.idx 6.dat 6.idx video_8.dat video_8.idx

v3:
1.dat 2.dat 3.dat 4.dat 5.dat 6.dat picture_7.dat video_10.dat video_11.dat video_12.dat video_13.dat video_8.dat video_9.dat
1.idx 2.idx 3.idx 4.idx 5.idx 6.idx picture_7.idx video_10.idx video_11.idx video_12.idx video_13.idx video_8.idx video_9.idx

可以看到每个datacenter的volume server一次分配了6个volume作为video collection的存储卷。

七、伸缩(Scaling)

对于分布式系统来说，Scaling是不得不考虑的问题，也是极为常见的操作。

1、伸（scale up)

weed-fs对“伸"的支持是很好的，我们分角色说。

【master】
master间采用的是raft协议，增加一个master，对于集群来说是最最基本的操作：

$weed -v=3 master -port=9336 -mdir=./m4 -peers=localhost:9333,localhost:9334,localhost:9335,localhost:9336 -defaultReplication=100
I0821 15:45:47 12398 file_util.go:20] Folder ./m4 Permission: -rwxrwxr-x
I0821 15:45:47 12398 topology.go:86] Using default configurations.
I0821 15:45:47 12398 master_server.go:59] Volume Size Limit is 30000 MB
I0821 15:45:47 12398 master.go:69] Start Seaweed Master 0.70 beta at 0.0.0.0:9336
I0821 15:45:47 12398 raft_server.go:50] Starting RaftServer with IP:localhost:9336:
I0821 15:45:47 12398 raft_server.go:74] Joining cluster: localhost:9333,localhost:9334,localhost:9335,localhost:9336
I0821 15:45:48 12398 raft_server.go:134] Attempting to connect to: http://localhost:9333/cluster/join
I0821 15:45:49 12398 raft_server.go:179] Post returned status: 200

新master节点启动后，会通过raft协议自动加入到以9333为leader的master集群中。

【volume】

和master一样，volume本身就是靠master管理的，volume server之间没有什么联系，增加一个volume server要做的就是启动一个新的volume server就好了：

$ weed -v=3 volume -port=8084 -dir=./v4 -mserver=localhost:9335 -dataCenter=dc2
I0821 15:48:21 12412 file_util.go:20] Folder ./v4 Permission: -rwxrwxr-x
I0821 15:48:21 12412 store.go:225] Store started on dir: ./v4 with 0 volumes max 7
I0821 15:48:21 12412 volume.go:136] Start Seaweed volume server 0.70 beta at 0.0.0.0:8084
I0821 15:48:21 12412 volume_server.go:70] Volume server bootstraps with master localhost:9335
I0821 15:48:22 12412 list_masters.go:18] list masters result :
I0821 15:48:22 12412 list_masters.go:18] list masters result :{"IsLeader":true,"Leader":"localhost:9333","Peers":["localhost:9334","localhost:9335","localhost:9336"]}
I0821 15:48:22 12412 store.go:65] current master nodes is nodes:[localhost:9334 localhost:9335 localhost:9336 localhost:9333 localhost:9333], lastNode:4
I0821 15:48:22 12412 volume_server.go:82] Volume Server Connected with master at localhost:9333

新volume server节点启动后，同样会自动加入集群，后续master就会自动在其上存储数据了。

【filer】

前面已经谈到了，无论是standalone模式，还是distributed模式，filter都可以随意增减，这里就不再重复赘述了。

2、缩(scale down)

master的缩是极其简单的，只需将相应节点shutdown即可；如果master是leader，则其他master会检测到leader shutdown，并自动重新选出新leader。不过在leader选举的过程中，整个集群的服务将短暂停止，直到leader选出。

filer在standalone模式下，谈伸缩是毫无意义的；对于distributed模式下，filter节点和master节点缩的方法一致，shutdown即可。

唯一的麻烦就是volume节点，因为数据存储在volume节点下，我们不能简单的停掉volume，我们需要考虑在不同 replication策略下是否可以做数据迁移，如何做数据迁移。这就是下一节我们要详细描述的。

八、数据迁移

下面我们就来探讨一下weed-fs的volume数据迁移问题。

1、000复制策略下的数据迁移

为方便测试，我简化一下实验环境（一个master+3个volume）：

master:

$ weed -v=3 master -port=9333 -mdir=./m1 -defaultReplication=000

volume:

$ weed -v=3 volume -port=8081 -dir=./v1 -mserver=localhost:9333 -dataCenter=dc1
$ weed -v=3 volume -port=8082 -dir=./v2 -mserver=localhost:9333 -dataCenter=dc1
$ weed -v=3 volume -port=8083 -dir=./v3 -mserver=localhost:9333 -dataCenter=dc1

和之前一样，启动后，v1，v2，v3目录下面是空的，卷的创建要等到第一份数据存入时。000策略就是没有副本的策略，你存储的文件在 weed-fs中只有一份数据。

我们上传一份文件：

$ curl -F filename=@hello1.txt "http://localhost:9333/submit"
{"fid":"1,01655ab58e","fileName":"hello1.txt","fileUrl":"127.0.0.1:8081/1,01655ab58e","size":40}

$ ll v1 v2 v3

v1:
-rw-r–r– 1 tonybai tonybai 104 8 21 21:31 1.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 21:31 1.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 4.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 4.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 7.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 7.idx

v2:
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 2.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 3.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:31 6.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:31 6.idx

v3:
-rw-r–r– 1 tonybai tonybai 8 8 21 21:31 5.dat
-rw-r–r– 1 tonybai tonybai 0 8 21 21:31 5.idx

可以看到hello1.txt被存储在v1下，同时可以看出不同的物理卷分别存放在不同节点下（由于不需要do replication）。

在这种情况(000)下，如果要将v1数据迁移到v2或v3中，只需将v1停掉，将v1下的文件mv到v2或v3中，重启volume server2或volume server3即可。

2、001复制策略下的数据迁移

001复制策略是weed-fs默认的复制策略，weed-fs会为每个文件在同Rack下复制一个副本。我们还利用上面的环境，不过需要停掉 weed-fs，清空目录下的文件，重启后使用，别忘了-defaultReplication=001。

我们连续存储三个文件：

$ curl -F filename=@hello1.txt "http://localhost:9333/submit"
{"fid":"2,01ea84980d","fileName":"hello1.txt","fileUrl":"127.0.0.1:8082/2,01ea84980d","size":40}

$ curl -F filename=@hello2.txt "http://localhost:9333/submit"
{"fid":"1,027883baa8","fileName":"hello2.txt","fileUrl":"127.0.0.1:8083/1,027883baa8","size":40}

$ curl -F filename=@hello3.txt "http://localhost:9333/submit"
{"fid":"6,03220f577e","fileName":"hello3.txt","fileUrl":"127.0.0.1:8081/6,03220f577e","size":40}

可以看出三个文件分别被存储在vol2, vol1和vol6中，我们查看一下v1, v2, v3中的文件情况：

$ ll v1 v2 v3
v1:
-rw-r–r– 1 tonybai tonybai 104 8 21 22:00 1.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:00 1.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 3.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 4.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 4.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:02 6.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:02 6.idx

v2:
-rw-r–r– 1 tonybai tonybai 104 8 21 21:56 2.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 21:56 2.idx
-rw-r–r– 1 tonybai tonybai 8 8 21 21:56 5.dat
-rw-r–r– 1 tonybai tonybai 0 8 21 21:56 5.idx

v3:
-rw-r–r– 1 tonybai tonybai 104 8 21 22:00 1.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:00 1.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 21:56 2.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 21:56 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 3.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 4.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 4.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 21:56 5.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 21:56 5.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:02 6.dat
-rw-r–r– 1 tonybai tonybai 16 8 21 22:02 6.idx

假设我们现在要shutdown v3，将v3数据迁移到其他volume server，我们有3种做法：

1) 不迁移
2) 将v3下的所有文件mv到v2或v1中
3) 将v3下的所有文件先后覆盖到v1和v2中

我们来逐个分析每种做法的后果：

1) 不迁移

001策略下，每份数据有两个copy，v3中的数据其他两个v1+v2总是有的，因此即便不迁移，v1+v2中也会有一份数据copy。你可以测试一下当shutdown volume3后：

$ curl -L "http://localhost:9333/2,01ea84980d"
hello weed-fs1!
$ curl -L "http://localhost:9333/1,027883baa8"
hello weed-fs2!
$ curl -L "http://localhost:9333/6,03220f577e"
hello weed-fs3!

针对每一份文件，你都可以多get几次，都会得到正确的结果。但此时的不足也很明显，那就是存量数据不再拥有另外一份备份。

2) 将v3下的所有文件mv到v2或v1中

还是根据001策略，将v3数据mv到v2或v1中，结果会是什么呢，这里就以v3 mv到 v1举例：

- 对于v1和v3都有的卷id，比如1，两者的文件1.idx和1.dat是一模一样的。这是001策略决定的。但一旦迁移后，系统中的数据就由2份变成1份了。
- 对于v1有，而v3没有的，那自然不必说了。
- 对于v1没有，而v3有的，mv过去就成为了v1的数据。

为此，这种做法依旧不够完美。

3）将v3下的所有文件覆盖到v1和v2中

结合上面的方法，只有此种迁移方式才能保证迁移后，系统中的数据不丢失，且每个都是按照001策略所说的2份，这才是正确的方法。

我们来测试一下：

   – 停掉volume3；
   – 停掉volume1，将v3下的文件copy到v1下，启动volume1
   – 停掉volume2，将v3下的文件copy到v2下，启动volume2

$ curl "http://localhost:9333/6,03220f577e"
<a href="http://127.0.0.1:8081/6,03220f577e">Moved Permanently</a>.

$ curl "http://localhost:9333/6,03220f577e"
<a href="http://127.0.0.1:8082/6,03220f577e">Moved Permanently</a>.

可以看到，master返回了重定向地址8081和8082，说明8083迁移到8082上的数据也生效了。

3、100复制策略下的数据迁移

测试环境稍作变化：

master:

$ weed -v=3 master -port=9333 -mdir=./m1 -defaultReplication=100

volume:

和之前一样，我们上传三份文件：

$ curl -F filename=@hello1.txt "http://localhost:9333/submit"
{"fid":"4,01d937dd30","fileName":"hello1.txt","fileUrl":"127.0.0.1:8083/4,01d937dd30","size":40}

$ curl -F filename=@hello2.txt "http://localhost:9333/submit"
{"fid":"2,025efbef14","fileName":"hello2.txt","fileUrl":"127.0.0.1:8082/2,025efbef14","size":40}

$ curl -F filename=@hello3.txt "http://localhost:9333/submit"
{"fid":"2,03be936488","fileName":"hello3.txt","fileUrl":"127.0.0.1:8082/2,03be936488","size":40}

$ ll v1 v2 v3
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 3.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:58 4.dat
-rw-r–r– 1 tonybai tonybai   16 8 21 22:58 4.idx

v2:
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 1.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 1.idx
-rw-r–r– 1 tonybai tonybai 200 8 21 22:59 2.dat
-rw-r–r– 1 tonybai tonybai   32 8 21 22:59 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 5.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 5.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 6.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 6.idx

v3:
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 1.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 1.idx
-rw-r–r– 1 tonybai tonybai 200 8 21 22:59 2.dat
-rw-r–r– 1 tonybai tonybai   32 8 21 22:59 2.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 3.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 3.idx
-rw-r–r– 1 tonybai tonybai 104 8 21 22:58 4.dat
-rw-r–r– 1 tonybai tonybai   16 8 21 22:58 4.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 5.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 5.idx
-rw-r–r– 1 tonybai tonybai    8 8 21 22:58 6.dat
-rw-r–r– 1 tonybai tonybai    0 8 21 22:58 6.idx

由于100策略是在不同DataCenter中各保持一份copy，因此数据的迁移不应该在数据中心间进行，而同一数据中心内的迁移又回归到了 “000”策略的情形。

其他策略的分析方式也是如此，这里就不长篇大论了。

九、Benchmark

在HP ProLiant DL380 G4, Intel(R) Xeon(TM) CPU 3.60GHz 4核，6G内存的机器(非SSD硬盘)上，执行benchmark test:

$ weed benchmark -server=localhost:9333

This is SeaweedFS version 0.70 beta linux amd64

———— Writing Benchmark ———-
Concurrency Level:      16
Time taken for tests:   831.583 seconds
Complete requests:      1048576
Failed requests:        0
Total transferred:      1106794545 bytes
Requests per second:    1260.94 [#/sec]
Transfer rate:          1299.75 [Kbytes/sec]

Connection Times (ms)
min avg max std
Total: 2.2 12.5 1118.4 9.3

Percentage of the requests served within a certain time (ms)
   50%     11.4 ms
   66%     13.3 ms
   75%     14.8 ms
   80%     15.9 ms
   90%     19.2 ms
   95%     22.6 ms
   98%     27.4 ms
   99%     31.2 ms
100%    1118.4 ms

———— Randomly Reading Benchmark ———-
Concurrency Level:      16
Time taken for tests:   151.480 seconds
Complete requests:      1048576
Failed requests:        0
Total transferred:      1106791113 bytes
Requests per second:    6922.22 [#/sec]
Transfer rate:          7135.28 [Kbytes/sec]

Connection Times (ms)
min avg max std
Total: 0.1 2.2 116.7 3.9

Percentage of the requests served within a certain time (ms)
   50%      1.6 ms
   66%      2.1 ms
   75%      2.5 ms
   80%      2.8 ms
   90%      3.7 ms
   95%      4.8 ms
   98%      7.4 ms
   99%     11.1 ms
100%    116.7 ms

这个似乎比作者在mac笔记本(SSD)上性能还要差些，当然此次我们用的策略是100，并且这个服务器上还运行着其他程序。但即便如此，感觉weed-fs还是有较大优化的空间的。

作者在官网上将weed-fs与其他分布式文件系统如Ceph，hdfs等做了简要对比，强调了weed-fs相对于其他分布式文件系统的优点。

十、其它

weed-fs使用google glog，因此所有log的级别设置以及log定向的方法均与glog一致。

weed-fs提供了backup命令，用来在同机上备份volume server上的数据。

weed-fs没有提供官方client包，但在wiki上列出多种第三方client包（各种语言），就Go client包来看，似乎还没有特别理想的。

weed-fs目前还没有web console，只能通过命令行进行操作。

使用weed-fs时，别忘了将open files no limit调大，否则可能会导致volume server crash。

十一、小结

weed-fs为想寻找开源分布式文件系统的朋友们提供了一个新选择。尤其是在存储大量小图片时，weed-fs自身就是基于haystack这一优化图片存储的论文的。另外weed-fs使用起来的确十分简单，分分钟就可以建立起一个分布式系统，部署容易，几乎不需要什么配置。但weed-fs目前最大的问题似乎是没有重量级的使用案例，自身也还有不少不足，但希望通过这篇文章能让更多人认识weed-fs，并使用weed-fs，帮助改善weed-fs吧。