Gotools | Tony Bai

标签 gotools 下的文章

Go语言数据竞争检测与数据竞争模式

六月 21, 2022
0 条评论

本文永久链接 – https://tonybai.com/2022/06/21/data-race-detection-and-pattern-in-go

uber，就是那个早早退出中国打车市场的优步，是Go语言早期接纳者，也是Go技术栈的“重度用户”。uber内部的Go代码仓库有5000w+行Go代码，有2100个Go实现的独立服务，这样的Go应用规模在世界范围内估计也是Top3了吧。

uber不仅用Go，还经常输出它们使用Go的经验与教训，uber工程博客就是这些高质量Go文章的载体，这些文章都值得想“深造”的gopher们反复阅读和体会。

近期该博客发布了两篇有关Go并发数据竞争的文章，一篇为《Dynamic Data Race Detection in Go Code》，另一篇为《Data Race Patterns in Go》。这两篇文章也源于uber工程师发表在arxiv上的预印版论文《A Study of Real-World Data Races in Golang》。

感慨一下：不得不佩服国外工程师的这种“下得了厨房，还上得了厅堂”的研发能力，这也是我在团队中为大家树立的目标。

这里和大家过一下这两篇精简版的博客文章，希望我们都能有收获。

一. Go内置data race detector

我们知道：并发程序不好开发，更难于调试。并发是问题的滋生地，即便Go内置并发并提供了基于CSP并发模型的并发原语(goroutine、channel和select)，实际证明，现实世界中，Go程序带来的并发问题并没有因此减少(手动允悲)。“没有银弹”再一次应验！

不过Go核心团队早已意识到了这一点，在Go 1.1版本中就为Go工具增加了race detector，通过在执行go工具命令时加入-race，该detector可以发现程序中因对同一变量的并发访问(至少一个访问是写操作)而引发潜在并发错误的地方。Go标准库也是引入race detector后的受益者。race detector曾帮助Go标准库检测出42个数据竞争问题。

race detector基于Google一个团队开发的工具Thread Sanitizer(TSan)(除了thread sanitizer，google还有一堆sanitizer，比如：AddressSanitizer, LeakSanitizer, MemorySanitizer等)。第一版TSan的实现发布于2009年，其使用的检测算法“源于”老牌工具Valgrind。出世后，TSan就帮助Chromium浏览器团队找出近200个潜在的并发问题，不过第一版TSan有一个最大的问题，那就是慢！。

因为有了成绩，开发团队决定重写TSan，于是就有了v2版本。与V1版本相比，v2版本有几个主要变化：

编译期注入代码(instrumentation)；
重新实现运行时库，并内置到编译器(LLVM和GCC)中；
除了可以做数据竞争(data race)检测外，还可以检测死锁、加锁状态下的锁释放等问题；
与V1版本相比，v2版本性能提升约20倍；
支持Go语言。

那么TSan v2究竟是怎么工作的呢？我们继续往下看。

二. ThreadSanitizer v2版本工作原理

根据Thread Sanitizer wiki上对v2版算法的描述，Thread Sanitizer分为两部分：注入代码与运行时库。

1. 注入代码

第一部分是在编译阶段配合编译器在源码中注入代码。那么在什么位置注入什么代码呢？前面说过Thread Sanitizer会跟踪程序中的每次内存访问，因此TSan会在每次内存访问的地方注入代码，当然下面的情况除外：

肯定不会出现数据竞争的内存访问

比如：全局常量的读访问、函数中对已被证明不会逃逸到堆上的内存的访问；

冗余访问：写入某个内存位置之前发生的读操作
… …

那么注入的什么代码呢？下面是一个在函数foo内写内存操作的例子：

我们看到对地址p的写操作前注入了__tsan_write4函数，函数foo的入口和出口分别注入了__tsan_func_entry和 __tsan_func_exit。而对于需要注入代码的内存读操作，注入代码则是__tsan_read4；原子内存操作使用__tsan_atomic进行注入…。

2. TSan运行时库

一旦在编译期注入代码完毕，构建出带有TSan的Go程序，那么在Go程序运行阶段，起到数据竞争检测作用的就是Tsan运行时库了。TSan是如何检测到有数据竞争的呢？

TSan的检测借助了一个称为Shadow Cell的概念。什么是Shadow Cell呢？一个Shadow Cell本身是一个8字节的内存单元，它代表一个对某个内存地址的读/写操作的事件，即每次对某内存块的写或读操作都会生成一个Shadow Cell。显然Shadow Cell作为内存读写事件的记录者，其本身存储了与此事件相关的信息，如下图：

我们看到，每个Shadow Cell记录了线程ID、时钟时间、操作访问内存的位置(偏移)和长度以及该内存访问事件的操作属性(是否是写操作)。针对每个应用程序的8字节内存，TSan都会对应有一组(N个)Shadow Cell，如下图：

N可以取2、4和8。N的取值直接影响TSan带来的开销以及data race检测的“精度”。

3. 检测算法

有了代码注入，也有了记录内存访问事件的Shadow Cell，那么TSan是通过什么逻辑检测data race的呢？我们结合Google大神Dmitry Vyukov在一次speak中举的例子来看一下检测算法是怎么运作的：

我们以N=8为例(即8个Shadow Cell用于跟踪和校验一个应用的8字节内存块)，下面是初始情况，假设此时尚没有对该8字节应用内存块的读写操作：

现在，一个线程T1向该块内存的前两个字节进行了写操作，写操作会生成第一个Shadow Cell，如下图所示：

这里我们结合图中的Shadow Cell说说Pos字段。Pos字段描述的是写/读操作访问的8字节内存单元的起始偏移与长度，比如这里的0:2代表的就是起始字节为第一个字节，长度为2个字节。此时Shadow Cell窗口只有一个Shadow Cell，不存在race的可能。

接下来，一个线程T2又针对该块内存的后四个字节进行了一次读操作，读操作会生成第二个Shadow Cell，如下图所示：

此次读操作涉及的字节与第一个Shadow Cell没有交集，不存在data race的可能。

再接下来，一个线程T3针对该块内存的前四个字节进行了一次写操作，写操作会生成第三个Shadow Cell，如下图所示：

我们看到T1和T3两个线程对该内存块的访问有重叠区域，且T1为写操作，那么这种情况就有可能存在data race。而TSan的race检测算法本质上就是一个状态机，每当发生一次内存访问，都会走一遍状态机。状态机的逻辑也很简单，就是遍历这块内存对应的Shadow Cell窗口中的所有Cell，用最新的Cell与已存在的Cell逐一比对，如果存在race，则给出warning。

像这个例子中T1的write与T3的read区域重叠，如果Shallow Cell1的时钟E1没有happens-before Shadow Cell的时钟E3，那么就存在data race的情况。happens-before如何判定，我们可以从tsan的实现中找到端倪：

https://code.woboq.org/gcc/libsanitizer/tsan/tsan_rtl.cc.html

static inline bool HappensBefore(Shadow old, ThreadState *thr) {
    return thr->clock.get(old.TidWithIgnore()) >= old.epoch();
}

在这个例子中，对应一个8字节应用内存的一组Shadow Cell的数量为N=8，但内存访问是高频事件，因此很快Shadow Cell窗口就会写满，那么新的Shadow Cell存储在哪里呢？在这种情况下，TSan算法会随机删除一个old Shadow Cell，并将新Shadow Cell写入。这也印证了前面提到的：N值的选取会在一定程度上影响到TSan的检测精度。

好了，初步了解了TSan v2的检测原理后，我们再回到uber的文章，看看uber是在何时部署race检测的。

三. 何时部署一个动态的Go数据竞争检测器

通过前面对TSan原理的简单描述我们也可以看出，-race带来的数据竞争检测对程序运行性能和开销的影响还是蛮大的。Go官方文档《Data Race Detector》一文中给出使用-race构建的Go程序相较于正常构建的Go程序，运行时其内存开销是后者的5-10倍，执行时间是2-20倍。但我们知道race detector只能在程序运行时才能实施数据竞争问题的检测。因此，Gopher在使用-race都会比较慎重，尤其是在生产环境中。 2013年，Dmitry Vyukov和Andrew Gerrand联合撰写的介绍Go race detector的文章“introducing the go race detector”中也直言：在生产环境一直开着race detector是不实际的。他们推荐两个使用race detector的时机：一个是在测试执行中开启race detector，尤其是集成测试和压力测试场景下；另外一个则是在生产环境下开启race detector，但具体操作是：仅在众多服务实例中保留一个带有race detector的服务实例，但有多少流量打到这个实例上，你自己看着办^_^。

那么，uber内部是怎么做的呢？前面提到过：uber内部有一个包含5000w+行代码的单一仓库，在这个仓库中有10w+的单元测试用例。uber在部署race detector的时机上遇到两个问题：

由于-race探测结果的不确定性，使得针对每个pr进行race detect的效果不好。

比如：某个pr存在数据竞争，但race detector执行时没有检测到；后来的没有data race的PR在执行race detect时可能会因前面的pr中的data race而被检测出问题，这就可能影响该pr的顺利合入，影响相关开发人员的效率。

同时，将已有的5000w+代码中的所有data race情况都找出来本身也是不可能的事情。

race detector的开销会影响到SLA(我理解是uber内部的CI流水线也有时间上的SLA(给开发者的承诺)，每个PR跑race detect，可能无法按时跑完)，并且提升硬件成本

针对上述这两个问题，给出的部署策略是“事后检测”，即每隔一段时间，取出一版代码仓库的快照，然后在-race开启的情况下，把所有单元测试用例跑一遍。好吧，似乎没有什么新鲜玩意。很多公司可能都是这么做的。

发现data race问题，就发报告给相应开发者。这块uber工程师做了一些工作，通过data race检测结果信息找出最可能引入该bug的作者，并将报告发给他。

不过有一个数据值得大家参考：在没有data race检测的情况下，uber内部跑完所有单元测试的时间p95位数是25分钟，而在启用data race后，这个时间增加了4倍，约为100分钟。

uber工程师在2021年中旬实施的上述实验，在这一实验过程中，他们找到了产生data race的主要代码模式，后续他们可能会针对这些模式制作静态代码分析工具，以更早、更有效地帮助开发人员捕捉代码中的data race问题。接下来，我们就来看看这些代码模式。

四. 常见的数据竞争模式都有哪些

uber工程师总结了7类数据竞争模式，我们逐一看一下。

1. 闭包的“锅”

Go语言原生提供了对闭包(closure)的支持。在Go语言中，闭包就是函数字面值。闭包可以引用其包裹函数(surrounding function)中定义的变量。然后，这些变量在包裹函数和函数字面值之间共享，只要它们可以被访问，这些变量就会继续存在。

不过不知道大家是否意识到了Go闭包对其包裹函数中的变量的捕捉方式都是通过引用的方式。而不像C++等语言那样可以选择通过值方式(by value)还是引用方式(by reference)进行捕捉。引用的捕捉方式意味着一旦闭包在一个新的goroutine中执行，那么两个goroutine对被捕捉的变量的访问就很大可能形成数据竞争。“不巧的”的是在Go中闭包常被用来作为一个goroutine的执行函数。

uber文章中给出了三个与这种无差别的通过引用方式对变量的捕捉方式导致的数据竞争模式的例子：

例子1

这第一个例子中，每次循环都基于一个闭包函数创建一个新的goroutine，这些goroutine都捕捉了外面的循环变量job，这就在多个goroutine之间建立起对job的竞争态势。

例子2

例子2中闭包与变量声明作用域的结合共同造就了新goroutine中的err变量就是外部Foo函数的返回值err。这就会造成err值成为两个goroutine竞争的“焦点”。

例子3

例子3中，具名返回值变量result被作为新goroutine执行函数的闭包所捕获，导致了两个goroutine在result这个变量上产生数据竞争。

2. 切片的“锅”

切片是Go内置的复合数据类型，与传统数组相比，切片具备动态扩容的能力，并且在传递时传递的是“切片描述符”，开销小且固定，这让其在Go语言中得到了广泛的应用。但灵活的同时，切片也是Go语言中“挖坑”最多的数据类型之一，大家在使用切片时务必认真细致，稍不留神就可能犯错。

下面是一个在切片变量上形成数据竞争的例子：

从这份代码来看，开发人员虽然对被捕捉的切片变量myResults通过mutex做了同步，但在后面创建新goroutine时，在传入切片时却因没有使用mutex保护。不过例子代码似乎有问题，传入的myResults似乎没有额外的使用。

3. map的“锅”

map是Go另外一个最常用的内置复合数据类型，对于go入学者而言，由map导致的问题可能仅次于切片。go map并非goroutine-safe的，go禁止对map变量的并发读写。但由于是内置hash表类型，map在go编程中得到了十分广泛的应用。

上面例子就是一个并发读写map的例子，不过与slice不同，go在map实现中内置了对并发读写的检测，即便不加入-race，一旦发现也会抛出panic。

4. 误传值惹的祸

Go推荐使用传值语义，因为它简化了逃逸分析，并使变量有更好的机会被分配到栈中，从而减少GC的压力。但有些类型是不能通过传值方式传递的，比如下面例子中的sync.Mutex：

sync.Mutex是一个零值可用的类型，我们无需做任何初始赋值即可使用Mutex实例。但Mutex类型有内部状态的：

通过传值方式会导致状态拷贝，失去了在多个goroutine间同步数据访问的作用，就像上面例子中的Mutex类型变量m那样。

5. 误用消息传递(channel)与共享内存

Go采用CSP的并发模型，而channel类型充当goroutine间的通信机制。虽然相对于共享内存，CSP并发模型更为高级，但从实际来看，在对CSP模型理解不到位的情况下，使用channel时也十分易错。

这个例子中的问题在于Start函数启动的goroutine可能阻塞在f.ch的send操作上。因为，一旦ctx cancel了，Wait就会退出，此时没有goroutine再在f.ch上阻塞读，这将导致Start函数启动的新goroutine可能阻塞在“f.ch <- 1”这一行上。

大家也可以看到，像这样的问题很细微，如果不细致分析，很难肉眼识别出来。

6. sync.WaitGroup误用导致data race问题

sync.WaitGroup是Go并发程序常用的用于等待一组goroutine退出的机制。它通过Add和Done方法实现内部计数的调整。而Wait方法用于等待，直到内部计数器为0才会返回。不过像下面例子中的对WaitGroup的误用会导致data race问题：

我们看到例子中的代码将wg.Add(1)放在了goroutine执行的函数中了，而没有像正确方法那样，将Add(1)放在goroutine创建启动之前，这就导致了对WaitGroup内部计数器形成了数据竞争，很可能因goroutine调度问题，是的Add(1)在未来得及调用，从而导致Wait提前返回。

下面这个例子则是由于defer函数在函数返回时的执行顺序问题，导致两个goroutine在locationErr这个变量上形成数据竞争：

main goroutine在判断locationErr是否为nil的时候，另一个goroutine中的doCleanup可能执行，也可能没有执行。

7. 并行的表驱动测试可能引发数据竞争

Go内置单测框架，并支持并行测试(testing.T.Parallel())。但如若使用并行测试，则极其容易导致数据竞争问题，原文没有给出例子，这个大家自行体会吧。

五. 小结

关于data race的代码模式，在uber发布这两篇文章之前，也有一些资料对数据竞争问题的代码模式进行了分类整理，比如下面两个资源，大家可以参照着看。

《Data Race Detector》- https://go.dev/doc/articles/race_detector
《ThreadSanitizer Popular Data Races》- https://github.com/google/sanitizers/wiki/ThreadSanitizerPopularDataRaces中的模式

在刚刚发布的Go 1.19beta1版本中提到，最新的-race升级到了TSan v3版本，race检测性能相对于上一版将提升1.5倍-2倍，内存开销减半，并且没有对goroutine的数量的上限限制。

注：Go要使用-race，则必须启用CGO。

// runtime/race.go

//go:nosplit
func raceinit() (gctx, pctx uintptr) {
    // cgo is required to initialize libc, which is used by race runtime
    if !iscgo {
        throw("raceinit: race build must use cgo")
    }
    ... ...
}

六. 参考资料

“Finding races and memory errors with compiler instrumentation” – http://gcc.gnu.org/wiki/cauldron2012?action=AttachFile&do=get&target=kcc.pdf
《Race detection and more with ThreadSanitizer 2》 – https://lwn.net/Articles/598486/
《Google ThreadSanitizer — 排查多线程问题data race的大杀器》- https://zhuanlan.zhihu.com/p/139000777
《Introducing the Go Race Detector》- https://go.dev/blog/race-detector
ThreadSanitizer Algorithm V2 – https://github.com/google/sanitizers/wiki/ThreadSanitizerAlgorithm
paper: FastTrack: Efficient and Precise Dynamic Race Detection – https://users.soe.ucsc.edu/~cormac/papers/pldi09.pdf
paper: Eraser: A Dynamic Data Race Detector for Multithreaded Programs – https://homes.cs.washington.edu/~tom/pubs/eraser.pdf

“Gopher部落”知识星球旨在打造一个精品Go学习和进阶社群！高品质首发Go技术文章，“三天”首发阅读权，每年两期Go语言发展现状分析，每天提前1小时阅读到新鲜的Gopher日报，网课、技术专栏、图书内容前瞻，六小时内必答保证等满足你关于Go语言生态的所有需求！2022年，Gopher部落全面改版，将持续分享Go语言与Go应用领域的知识、技巧与实践，并增加诸多互动形式。欢迎大家加入！

img{512x368}

我爱发短信：企业级短信平台定制开发专家 https://tonybai.com/。smspush : 可部署在企业内部的定制化短信平台，三网覆盖，不惧大并发接入，可定制扩展；短信内容你来定，不再受约束, 接口丰富，支持长短信，签名可选。2020年4月8日，中国三大电信运营商联合发布《5G消息白皮书》，51短信平台也会全新升级到“51商用消息平台”，全面支持5G RCS消息。

著名云主机服务厂商DigitalOcean发布最新的主机计划，入门级Droplet配置升级为：1 core CPU、1G内存、25G高速SSD，价格5$/月。有使用DigitalOcean需求的朋友，可以打开这个链接地址：https://m.do.co/c/bff6eed92687 开启你的DO主机之路。

Gopher Daily(Gopher每日新闻)归档仓库 – https://github.com/bigwhite/gopherdaily

我的联系方式：

微博：https://weibo.com/bigwhite20xx
博客：tonybai.com
github: https://github.com/bigwhite

商务合作方式：撰稿、出书、培训、在线课程、合伙创业、咨询、广告合作。

Go程序调试、分析与优化

八月 25, 2015
18 条评论

Brad Fitzpatrick在YAPC Asia 2015（Yet Another Perl Conference）上做了一次技术分享，题为："Go Debugging, Profiling, and Optimization"。个人感觉这篇分享中价值最大的是BradFitz现场演示的一个有关如何对Go程序进行调试、分析和优化的 Demo，Brad将demo上传到了他个人在github.com的repo中，但不知为何，repo中的代码似乎与repo里talk.md中的说明不甚一致(btw，我并没有看video)。于是打算在这里按照Brad的思路重新走一遍demo的演示流程(所有演示代码在这里可以下载到)。

一、实验环境

$uname -a
Linux pc-tony 3.13.0-61-generic #100~precise1-Ubuntu SMP Wed Jul 29 12:06:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

注意:在Darwin或Windows下，profile的结果可能与这里有很大不同(甚至完全不一样的输出和瓶颈热点)。

$go version
go version go1.5 linux/amd64

$ go env
GOARCH="amd64"
GOBIN="/home1/tonybai/.bin/go15/bin"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home1/tonybai/proj/GoProjects"
GORACE=""
GOROOT="/home1/tonybai/.bin/go15"
GOTOOLDIR="/home1/tonybai/.bin/go15/pkg/tool/linux_amd64"
GO15VENDOREXPERIMENT="1"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"

代码基于Brad的github.com/bradfitz/talk-yapc-asia-2015。

二、待优化程序(step0)

待优化程序，也就是原始程序，我们放在step0中：

//go-debug-profile-optimization/step0/demo.go

package main

import (
    "fmt"
    "log"
    "net/http"
    "regexp"
)

var visitors int

func handleHi(w http.ResponseWriter, r *http.Request) {
    if match, _ := regexp.MatchString(`^\w*$`, r.FormValue("color")); !match {
        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
        return
    }
    visitors++
    w.Header().Set("Content-Type", "text/html; charset=utf-8")
    w.Write([]byte("<h1 style='color: " + r.FormValue("color") +
        "'>Welcome!</h1>You are visitor number " + fmt.Sprint(visitors) + "!"))
}

func main() {
    log.Printf("Starting on port 8080")
    http.HandleFunc("/hi", handleHi)
    log.Fatal(http.ListenAndServe("127.0.0.1:8080", nil))
}

$go run demo.go
2015/08/25 09:42:35 Starting on port 8080

在浏览器输入：http://localhost:8080/hi

一切顺利的话，页面会显示：

Welcome!

You are visitor number 1!

三、添加测试代码

按照talk.md中的说明，brad repo中demo中根本没有测试代码(commit 2427d0faa12ed1fb05f1e6a1e69307c11259c2b2)。

于是我根据作者的意图，新增了demo_test.go，采用TestHandleHi_Recorder和TestHandleHi_TestServer对HandleHi进行测试：

//go-debug-profile-optimization/step0/demo_test.go
package main

import (
    "bufio"
    "net/http"
    "net/http/httptest"
    "strings"
    "testing"
)

func TestHandleHi_Recorder(t *testing.T) {
    rw := httptest.NewRecorder()
    handleHi(rw, req(t, "GET / HTTP/1.0\r\n\r\n"))
    if !strings.Contains(rw.Body.String(), "visitor number") {
        t.Errorf("Unexpected output: %s", rw.Body)
    }
}

func req(t *testing.T, v string) *http.Request {
    req, err := http.ReadRequest(bufio.NewReader(strings.NewReader(v)))
    if err != nil {
        t.Fatal(err)
    }
    return req
}

func TestHandleHi_TestServer(t *testing.T) {
    ts := httptest.NewServer(http.HandlerFunc(handleHi))
    defer ts.Close()
    res, err := http.Get(ts.URL)
    if err != nil {
        t.Error(err)
        return
    }
    if g, w := res.Header.Get("Content-Type"), "text/html; charset=utf-8"; g != w {
        t.Errorf("Content-Type = %q; want %q", g, w)
    }
    slurp, err := ioutil.ReadAll(res.Body)
    defer res.Body.Close()
    if err != nil {
        t.Error(err)
        return
    }
    t.Logf("Got: %s", slurp)
}

$ go test -v
=== RUN   TestHandleHi_Recorder
— PASS: TestHandleHi_Recorder (0.00s)
=== RUN   TestHandleHi_TestServer
— PASS: TestHandleHi_TestServer (0.00s)
    demo_test.go:45: Got: <h1 style='color: '>Welcome!</h1>You are visitor number 2!
PASS
ok     _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step0    0.007s

测试通过！

至此，step0使命结束。

四、Race Detector(竞态分析）

并发设计使得程序可以更好更有效的利用现代处理器的多核心。但并发设计很容易引入竞态，导致严重bug。Go程序中竞态就是当多个goroutine并发访问某共享数据且未使用同步机制时，且至少一个goroutine进行了写操作。不过go工具自带race分析功能。在分析优化step0中demo代码前，我们先要保证demo代码中不存在竞态。

工具的使用方法就是在go test后加上-race标志，在step0目录下：

$ go test -v -race
=== RUN   TestHandleHi_Recorder
— PASS: TestHandleHi_Recorder (0.00s)
=== RUN   TestHandleHi_TestServer
— PASS: TestHandleHi_TestServer (0.00s)
    demo_test.go:45: Got: <h1 style='color: '>Welcome!</h1>You are visitor number 2!
PASS
ok     _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step0    1.012s

-race通过做运行时分析做竞态分析，虽然不存在误报，但却存在实际有竞态，但工具没发现的情况。接下来我们改造一下测试代码，让test并发起来：

向step1(copy自step0)中demo_test.go中添加一个test method:

//go-debug-profile-optimization/step1/demo_test.go
… …
func TestHandleHi_TestServer_Parallel(t *testing.T) {
    ts := httptest.NewServer(http.HandlerFunc(handleHi))
    defer ts.Close()
    var wg sync.WaitGroup
    for i := 0; i < 2; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            res, err := http.Get(ts.URL)
            if err != nil {
                t.Error(err)
                return
            }
            if g, w := res.Header.Get("Content-Type"), "text/html; charset=utf-8"; g != w {
                t.Errorf("Content-Type = %q; want %q", g, w)
            }
            slurp, err := ioutil.ReadAll(res.Body)
            defer res.Body.Close()
            if err != nil {
                t.Error(err)
                return
            }
            t.Logf("Got: %s", slurp)
        }()
    }
    wg.Wait()
}
… …

执行竞态test：

$ go test -v -race
=== RUN   TestHandleHi_Recorder
— PASS: TestHandleHi_Recorder (0.00s)
=== RUN   TestHandleHi_TestServer
— PASS: TestHandleHi_TestServer (0.00s)
    demo_test.go:46: Got: <h1 style='color: '>Welcome!</h1>You are visitor number 2!
=== RUN   TestHandleHi_TestServer_Parallel
==================
WARNING: DATA RACE
Read by goroutine 22:
_/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step1.handleHi()
      /home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step1/demo.go:17 +0xf5
net/http.HandlerFunc.ServeHTTP()
      /tmp/workdir/go/src/net/http/server.go:1422 +0×47
net/http/httptest.(*waitGroupHandler).ServeHTTP()
      /tmp/workdir/go/src/net/http/httptest/server.go:200 +0xfe
net/http.serverHandler.ServeHTTP()
      /tmp/workdir/go/src/net/http/server.go:1862 +0×206
net/http.(*conn).serve()
      /tmp/workdir/go/src/net/http/server.go:1361 +0x117c

Previous write by goroutine 25:
_/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step1.handleHi()
      /home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step1/demo.go:17 +0×111
net/http.HandlerFunc.ServeHTTP()
      /tmp/workdir/go/src/net/http/server.go:1422 +0×47
net/http/httptest.(*waitGroupHandler).ServeHTTP()
      /tmp/workdir/go/src/net/http/httptest/server.go:200 +0xfe
net/http.serverHandler.ServeHTTP()
      /tmp/workdir/go/src/net/http/server.go:1862 +0×206
net/http.(*conn).serve()
      /tmp/workdir/go/src/net/http/server.go:1361 +0x117c

Goroutine 22 (running) created at:
net/http.(*Server).Serve()
/tmp/workdir/go/src/net/http/server.go:1910 +0×464

Goroutine 25 (running) created at:
net/http.(*Server).Serve()
      /tmp/workdir/go/src/net/http/server.go:1910 +0×464
==================
— PASS: TestHandleHi_TestServer_Parallel (0.00s)
    demo_test.go:71: Got: <h1 style='color: '>Welcome!</h1>You are visitor number 3!
    demo_test.go:71: Got: <h1 style='color: '>Welcome!</h1>You are visitor number 4!
PASS
Found 1 data race(s)
exit status 66
FAIL    _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step1    1.023s

工具发现demo.go第17行：
visitors++
是一处潜在的竞态条件。

visitors被多个goroutine访问但未采用同步机制。

既然发现了竞态条件，我们就需要fix it。有多种fix方法可选：

1、使用channel
2、使用Mutex
3、使用atomic

Brad使用了atomic：

//go-debug-profile-optimization/step1/demo.go
… …
var visitors int64 // must be accessed atomically

func handleHi(w http.ResponseWriter, r *http.Request) {
    if match, _ := regexp.MatchString(`^\w*$`, r.FormValue("color")); !match {
        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
        return
    }
    visitNum := atomic.AddInt64(&visitors, 1)
    w.Header().Set("Content-Type", "text/html; charset=utf-8")
    w.Write([]byte("<h1 style='color: " + r.FormValue("color") +
        "'>Welcome!</h1>You are visitor number " + fmt.Sprint(visitNum) + "!"))
}
… …

再做一次测试：

竞态条件被消除了！

至此，step1结束了使命！

五、CPU Profiling

要做CPU Profilling，我们需要benchmark数据，Go test提供benchmark test功能，我们只要写对应的Benchmark测试方法即可：

//go-debug-profile-optimization/step2/demo_test.go
… …
func BenchmarkHi(b *testing.B) {
b.ReportAllocs()

    req, err := http.ReadRequest(bufio.NewReader(strings.NewReader("GET / HTTP/1.0\r\n\r\n")))
    if err != nil {
        b.Fatal(err)
    }

    for i := 0; i < b.N; i++ {
        rw := httptest.NewRecorder()
        handleHi(rw, req)
    }
}
… …

$ go test -v -run=^$ -bench=.
PASS
BenchmarkHi-4 100000 14808 ns/op 4961 B/op 81 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step2 1.648s

开始CPU Profiling：

$ go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=2s -cpuprofile=prof.cpu
PASS
BenchmarkHi-4 200000 14679 ns/op 4961 B/op 81 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step2 3.096s

执行完benchmark test后，step2目录下出现两个新文件prof.cpu和step2.test，这两个文件将作为后续go tool pprof的输入：
$ls
demo.go demo_test.go prof.cpu step2.test*

使用go profile viewer工具：

$ go tool pprof step2.test prof.cpu
Entering interactive mode (type "help" for commands)
(pprof) top
1830ms of 3560ms total (51.40%)
Dropped 53 nodes (cum <= 17.80ms)
Showing top 10 nodes out of 133 (cum >= 1290ms)
      flat flat%   sum%        cum   cum%
     480ms 13.48% 13.48%      980ms 27.53% runtime.growslice
     360ms 10.11% 23.60%      700ms 19.66% runtime.mallocgc
     170ms 4.78% 28.37%      170ms 4.78% runtime.heapBitsSetType
     170ms 4.78% 33.15%      200ms 5.62% runtime.scanblock
     120ms 3.37% 36.52%     1100ms 30.90% regexp.makeOnePass.func2
     120ms 3.37% 39.89%      550ms 15.45% runtime.newarray
     110ms 3.09% 42.98%      300ms 8.43% runtime.makeslice
     110ms 3.09% 46.07%      220ms 6.18% runtime.mapassign1
     100ms 2.81% 48.88%      100ms 2.81% runtime.futex
      90ms 2.53% 51.40%     1290ms 36.24% regexp.makeOnePass

(pprof) top –cum
0.18s of 3.56s total ( 5.06%)
Dropped 53 nodes (cum <= 0.02s)
Showing top 10 nodes out of 133 (cum >= 1.29s)
      flat flat%   sum%        cum   cum%
         0     0%     0%      3.26s 91.57% runtime.goexit
     0.02s 0.56% 0.56%      2.87s 80.62% BenchmarkHi
         0     0% 0.56%      2.87s 80.62% testing.(*B).launch
         0     0% 0.56%      2.87s 80.62% testing.(*B).runN
     0.03s 0.84% 1.40%      2.80s 78.65% step2.handleHi
     0.01s 0.28% 1.69%      2.46s 69.10% regexp.MatchString
         0     0% 1.69%      2.24s 62.92% regexp.Compile
         0     0% 1.69%      2.24s 62.92% regexp.compile
     0.03s 0.84% 2.53%      1.56s 43.82% regexp.compileOnePass
     0.09s 2.53% 5.06%      1.29s 36.24% regexp.makeOnePass

(pprof) list handleHi
Total: 3.56s
ROUTINE ======================== handleHi in go-debug-profile-optimization/step2/demo.go
      30ms      2.80s (flat, cum) 78.65% of Total
         .          .      9:)
         .          .     10:
         .          .     11:var visitors int64 // must be accessed atomically
         .          .     12:
         .          .     13:func handleHi(w http.ResponseWriter, r *http.Request) {
         .      2.47s     14:    if match, _ := regexp.MatchString(`^\w*$`, r.FormValue("color")); !match {
         .          .     15:        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
         .          .     16:        return
         .          .     17:    }
      10ms       20ms     18:    visitNum := atomic.AddInt64(&visitors, 1)
      10ms       90ms     19:    w.Header().Set("Content-Type", "text/html; charset=utf-8")
      10ms       20ms     20:    w.Write([]byte("<h1 style='color: " + r.FormValue("color") +
         .      200ms     21:        "'>Welcome!</h1>You are visitor number " + fmt.Sprint(visitNum) + "!"))
         .          .     22:}
         .          .     23:
         .          .     24:func main() {
         .          .     25:    log.Printf("Starting on port 8080")
         .          .     26:    http.HandleFunc("/hi", handleHi)
(pprof)

从top –cum来看，handleHi消耗cpu较大，而handleHi中，又是MatchString耗时最长。

六、第一次优化

前面已经发现MatchString较为耗时，优化手段：让正则式仅编译一次(step3)：

// go-debug-profile-optimization/step3/demo.go

… …
var visitors int64 // must be accessed atomically

var rxOptionalID = regexp.MustCompile(`^\d*$`)

func handleHi(w http.ResponseWriter, r *http.Request) {
    if !rxOptionalID.MatchString(r.FormValue("color")) {
        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
        return
    }

    visitNum := atomic.AddInt64(&visitors, 1)
    w.Header().Set("Content-Type", "text/html; charset=utf-8")
    w.Write([]byte("<h1 style='color: " + r.FormValue("color") +
        "'>Welcome!</h1>You are visitor number " + fmt.Sprint(visitNum) + "!"))
}
… …

运行一下bench：

$ go test -bench=.
PASS
BenchmarkHi-4 1000000 1678 ns/op 720 B/op 9 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step3 1.710s

对比之前在step2中运行的bench结果：

$ go test -v -run=^$ -bench=.
PASS
BenchmarkHi-4 100000 14808 ns/op 4961 B/op 81 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step2 1.648s

耗时相同，但优化后的bench运行了100w次，而之前的Bench运行10w次，相当于性能提高10倍。

再看看cpu prof结果：

$ go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=3s -cpuprofile=prof.cpu
PASS
BenchmarkHi-4 3000000 1640 ns/op 720 B/op 9 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step3 6.540s

$ go tool pprof step3.test prof.cpu
Entering interactive mode (type "help" for commands)
(pprof) top –cum 30
2.74s of 8.07s total (33.95%)
Dropped 72 nodes (cum <= 0.04s)
Showing top 30 nodes out of 103 (cum >= 0.56s)
      flat flat%   sum%        cum   cum%
         0     0%     0%      7.17s 88.85% runtime.goexit
     0.05s 0.62% 0.62%      6.21s 76.95% step3.BenchmarkHi
         0     0% 0.62%      6.21s 76.95% testing.(*B).launch
         0     0% 0.62%      6.21s 76.95% testing.(*B).runN
     0.06s 0.74% 1.36%      4.96s 61.46% step3.handleHi
     1.15s 14.25% 15.61%      2.35s 29.12% runtime.mallocgc
     0.02s 0.25% 15.86%      1.63s 20.20% runtime.systemstack
         0     0% 15.86%      1.53s 18.96% net/http.Header.Set
     0.06s 0.74% 16.60%      1.53s 18.96% net/textproto.MIMEHeader.Set
     0.09s 1.12% 17.72%      1.22s 15.12% runtime.newobject
     0.05s 0.62% 18.34%      1.09s 13.51% fmt.Sprint
     0.20s 2.48% 20.82%         1s 12.39% runtime.mapassign1
         0     0% 20.82%      0.81s 10.04% runtime.mcall
     0.01s 0.12% 20.94%      0.79s 9.79% runtime.schedule
     0.05s 0.62% 21.56%      0.76s 9.42% regexp.(*Regexp).MatchString
     0.09s 1.12% 22.68%      0.71s 8.80% regexp.(*Regexp).doExecute
     0.01s 0.12% 22.80%      0.71s 8.80% runtime.concatstring5
     0.20s 2.48% 25.28%      0.70s 8.67% runtime.concatstrings
         0     0% 25.28%      0.69s 8.55% runtime.gosweepone
     0.05s 0.62% 25.90%      0.69s 8.55% runtime.mSpan_Sweep
         0     0% 25.90%      0.68s 8.43% runtime.bgsweep
     0.04s   0.5% 26.39%      0.68s 8.43% runtime.newarray
     0.01s 0.12% 26.52%      0.67s 8.30% runtime.goschedImpl
     0.01s 0.12% 26.64%      0.65s 8.05% runtime.gosched_m
         0     0% 26.64%      0.65s 8.05% runtime.gosweepone.func1
     0.01s 0.12% 26.77%      0.65s 8.05% runtime.sweepone
     0.28s 3.47% 30.24%      0.62s 7.68% runtime.makemap
     0.17s 2.11% 32.34%      0.59s 7.31% runtime.heapBitsSweepSpan
     0.02s 0.25% 32.59%      0.58s 7.19% fmt.(*pp).doPrint
     0.11s 1.36% 33.95%      0.56s 6.94% fmt.(*pp).printArg

handleHi耗时有一定下降。

七、Mem Profiling

在step3目录下执行bench，获取mem分配数据：

$ go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=2s -memprofile=prof.mem
PASS
BenchmarkHi-4 2000000 1657 ns/op 720 B/op 9 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step3 5.002s

使用pprof工具分析mem：

$ go tool pprof –alloc_space step3.test prof.mem
Entering interactive mode (type "help" for commands)
(pprof) top
2065.91MB of 2067.41MB total (99.93%)
Dropped 14 nodes (cum <= 10.34MB)
      flat flat%   sum%        cum   cum%
1076.35MB 52.06% 52.06% 1076.35MB 52.06% net/textproto.MIMEHeader.Set
535.54MB 25.90% 77.97% 2066.91MB   100% step3.BenchmarkHi
406.52MB 19.66% 97.63% 1531.37MB 74.07% step3.handleHi
   47.50MB 2.30% 99.93%    48.50MB 2.35% fmt.Sprint
         0     0% 99.93% 1076.35MB 52.06% net/http.Header.Set
         0     0% 99.93% 2066.91MB   100% runtime.goexit
         0     0% 99.93% 2066.91MB   100% testing.(*B).launch
         0     0% 99.93% 2066.91MB   100% testing.(*B).runN

(pprof) top -cum
2065.91MB of 2067.41MB total (99.93%)
Dropped 14 nodes (cum <= 10.34MB)
      flat flat%   sum%        cum   cum%
535.54MB 25.90% 25.90% 2066.91MB   100% step3.BenchmarkHi
         0     0% 25.90% 2066.91MB   100% runtime.goexit
         0     0% 25.90% 2066.91MB   100% testing.(*B).launch
         0     0% 25.90% 2066.91MB   100% testing.(*B).runN
406.52MB 19.66% 45.57% 1531.37MB 74.07% step3.handleHi
         0     0% 45.57% 1076.35MB 52.06% net/http.Header.Set
1076.35MB 52.06% 97.63% 1076.35MB 52.06% net/textproto.MIMEHeader.Set
   47.50MB 2.30% 99.93%    48.50MB 2.35% fmt.Sprint

(pprof) list handleHi
Total: 2.02GB
     ROUTINE =========step3.handleHi in step3/demo.go
406.52MB     1.50GB (flat, cum) 74.07% of Total
         .          .     17:        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
         .          .     18:        return
         .          .     19:    }
         .          .     20:
         .          .     21:    visitNum := atomic.AddInt64(&visitors, 1)
         .     1.05GB     22:    w.Header().Set("Content-Type", "text/html; charset=utf-8")
         .          .     23:    w.Write([]byte("<h1 style='color: " + r.FormValue("color") +
406.52MB   455.02MB     24:        "'>Welcome!</h1>You are visitor number " + fmt.Sprint(visitNum) + "!"))
         .          .     25:}
         .          .     26:
         .          .     27:func main() {
         .          .     28:    log.Printf("Starting on port 8080")
         .          .     29:    http.HandleFunc("/hi", handleHi)
(pprof)

可以看到handleHi22、23两行占用了较多内存。

八、第二次优化

第二次优化的方法：
1、删除w.Header().Set这行
2、用fmt.Fprintf替代w.Write

第二次优化的代码在step4目录中：

// go-debug-profile-optimization/step4/demo.go
… …
func handleHi(w http.ResponseWriter, r *http.Request) {
    if !rxOptionalID.MatchString(r.FormValue("color")) {
        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
        return
    }

visitNum := atomic.AddInt64(&visitors, 1)
fmt.Fprintf(w, "<html><h1 stype='color: \"%s\"'>Welcome!</h1>You are visitor number %d!", r.FormValue("color"), visitNum)
}
… …

执行一遍pprof:

$ go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=2s -memprofile=prof.mem
PASS
BenchmarkHi-4 2000000 1428 ns/op 304 B/op 6 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step4 4.343s

$ go tool pprof –alloc_space step4.test prof.mem
Entering interactive mode (type "help" for commands)
(pprof) top
868.06MB of 868.56MB total (99.94%)
Dropped 5 nodes (cum <= 4.34MB)
      flat flat%   sum%        cum   cum%
559.54MB 64.42% 64.42%   868.06MB 99.94% step4.BenchmarkHi
219.52MB 25.27% 89.70%   219.52MB 25.27% bytes.makeSlice
      89MB 10.25% 99.94%   308.52MB 35.52% step4.handleHi
         0     0% 99.94%   219.52MB 25.27% bytes.(*Buffer).Write
         0     0% 99.94%   219.52MB 25.27% bytes.(*Buffer).grow
         0     0% 99.94%   219.52MB 25.27% fmt.Fprintf
         0     0% 99.94%   219.52MB 25.27% net/http/httptest.(*ResponseRecorder).Write
         0     0% 99.94%   868.06MB 99.94% runtime.goexit
         0     0% 99.94%   868.06MB 99.94% testing.(*B).launch
         0     0% 99.94%   868.06MB 99.94% testing.(*B).runN
(pprof) top –cum
868.06MB of 868.56MB total (99.94%)
Dropped 5 nodes (cum <= 4.34MB)
      flat flat%   sum%        cum   cum%
559.54MB 64.42% 64.42%   868.06MB 99.94% step4.BenchmarkHi
         0     0% 64.42%   868.06MB 99.94% runtime.goexit
         0     0% 64.42%   868.06MB 99.94% testing.(*B).launch
         0     0% 64.42%   868.06MB 99.94% testing.(*B).runN
      89MB 10.25% 74.67%   308.52MB 35.52% step4.handleHi
         0     0% 74.67%   219.52MB 25.27% bytes.(*Buffer).Write
         0     0% 74.67%   219.52MB 25.27% bytes.(*Buffer).grow
219.52MB 25.27% 99.94%   219.52MB 25.27% bytes.makeSlice
         0     0% 99.94%   219.52MB 25.27% fmt.Fprintf
         0     0% 99.94%   219.52MB 25.27% net/http/httptest.(*ResponseRecorder).Write
(pprof) list handleHi
Total: 868.56MB
ROUTINE ============ step4.handleHi in step4/demo.go
      89MB   308.52MB (flat, cum) 35.52% of Total
         .          .     17:        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
         .          .     18:        return
         .          .     19:    }
         .          .     20:
         .          .     21:    visitNum := atomic.AddInt64(&visitors, 1)
      89MB   308.52MB     22:    fmt.Fprintf(w, "<html><h1 stype='color: \"%s\"'>Welcome!</h1>You are visitor number %d!", r.FormValue("color"), visitNum)
         .          .     23:}
         .          .     24:
         .          .     25:func main() {
         .          .     26:    log.Printf("Starting on port 8080")
         .          .     27:    http.HandleFunc("/hi", handleHi)
(pprof)

可以看出内存占用大幅减少。

九、Benchcmp

golang.org/x/tools中有一个工具：benchcmp，可以给出两次bench的结果对比。

github.com/golang/tools是golang.org/x/tools的一个镜像。安装benchcmp步骤：

1、go get -u github.com/golang/tools
2、mkdir -p $GOPATH/src/golang.org/x
3、mv $GOPATH/src/github.com/golang/tools $GOPATH/src/golang.org/x
4、go install golang.org/x/tools/cmd/benchcmp

我们分别在step2、step3和step4下执行如下命令：

$ go-debug-profile-optimization/step2$ go test -bench=. -memprofile=prof.mem | tee mem.2
PASS
BenchmarkHi-4 100000 14786 ns/op 4961 B/op 81 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step2 1.644s

go-debug-profile-optimization/step3$ go test -bench=. -memprofile=prof.mem | tee mem.3
PASS
BenchmarkHi-4 1000000 1662 ns/op 720 B/op 9 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step3 1.694s

go-debug-profile-optimization/step4$ go test -bench=. -memprofile=prof.mem | tee mem.4
PASS
BenchmarkHi-4 1000000 1428 ns/op 304 B/op 6 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step4 1.456s

利用benchcmp工具对比结果（benchcmp old new）：

$ benchcmp step3/mem.3 step4/mem.4
benchmark old ns/op new ns/op delta
BenchmarkHi-4 1662 1428 -14.08%

benchmark old allocs new allocs delta
BenchmarkHi-4 9 6 -33.33%

benchmark old bytes new bytes delta
BenchmarkHi-4 720 304 -57.78%

$ benchcmp step2/mem.2 step4/mem.4
benchmark old ns/op new ns/op delta
BenchmarkHi-4 14786 1428 -90.34%

benchmark old allocs new allocs delta
BenchmarkHi-4 81 6 -92.59%

benchmark old bytes new bytes delta
BenchmarkHi-4 4961 304 -93.87%

可以看出优化后，内存分配大幅减少，gc的时间也随之减少。

十、内存来自哪

我们在BenchmarkHi中清理每次handleHi执行后的内存：

//step5/demo_test.go
… …
func BenchmarkHi(b *testing.B) {
b.ReportAllocs()

    req, err := http.ReadRequest(bufio.NewReader(strings.NewReader("GET / HTTP/1.0\r\n\r\n")))
    if err != nil {
        b.Fatal(err)
    }

    for i := 0; i < b.N; i++ {
        rw := httptest.NewRecorder()
        handleHi(rw, req)
        reset(rw)
    }
}

func reset(rw *httptest.ResponseRecorder) {
    m := rw.HeaderMap
    for k := range m {
        delete(m, k)
    }
    body := rw.Body
    body.Reset()
    *rw = httptest.ResponseRecorder{
        Body:      body,
        HeaderMap: m,
    }
}

… …
$ go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=2s -memprofile=prof.mem
PASS
BenchmarkHi-4 2000000 1518 ns/op 304 B/op 6 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step5 4.577s

$ go tool pprof –alloc_space step5.test prof.mem
Entering interactive mode (type "help" for commands)
(pprof) top –cum 10
290.52MB of 291.52MB total (99.66%)
Dropped 14 nodes (cum <= 1.46MB)
      flat flat%   sum%        cum   cum%
         0     0%     0%   291.02MB 99.83% runtime.goexit
179.01MB 61.41% 61.41%   290.52MB 99.66% step5.BenchmarkHi
         0     0% 61.41%   290.52MB 99.66% testing.(*B).launch
         0     0% 61.41%   290.52MB 99.66% testing.(*B).runN
   26.50MB 9.09% 70.50%   111.51MB 38.25% step5.handleHi
         0     0% 70.50%    85.01MB 29.16% bytes.(*Buffer).Write
         0     0% 70.50%    85.01MB 29.16% bytes.(*Buffer).grow
   85.01MB 29.16% 99.66%    85.01MB 29.16% bytes.makeSlice
         0     0% 99.66%    85.01MB 29.16% fmt.Fprintf
         0     0% 99.66%    85.01MB 29.16% net/http/httptest.(*ResponseRecorder).Write
(pprof) list handleHi
Total: 291.52MB
ROUTINE ======================== _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step5.handleHi in /home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step5/demo.go
   26.50MB   111.51MB (flat, cum) 38.25% of Total
         .          .     17:        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
         .          .     18:        return
         .          .     19:    }
         .          .     20:
         .          .     21:    visitNum := atomic.AddInt64(&visitors, 1)
   26.50MB   111.51MB     22:    fmt.Fprintf(w, "<html><h1 stype='color: \"%s\"'>Welcome!</h1>You are visitor number %d!", r.FormValue("color"), visitNum)
         .          .     23:}
         .          .     24:
         .          .     25:func main() {
         .          .     26:    log.Printf("Starting on port 8080")
         .          .     27:    http.HandleFunc("/hi", handleHi)
(pprof)

内存从300MB降到111MB。内存来自哪？看到list handleHi，fmt.Fprintf分配了111.51MB。

我们来看这一行代码：
fmt.Fprintf(w, "<h1 style='color: %s'>Welcome!</h1>You are visitor number %d!",
r.FormValue("color"), num)

fmt.Fprintf的manual：

$ go doc fmt.Fprintf
func Fprintf(w io.Writer, format string, a …interface{}) (n int, err error)

Fprintf formats according to a format specifier and writes to w. It returns
the number of bytes written and any write error encountered.

这里回顾一下Go type在runtime中的内存占用：

A Go interface is 2 words of memory: (type, pointer).
A Go string is 2 words of memory: (base pointer, length)
A Go slice is 3 words of memory: (base pointer, length, capacity)

每次调用fmt.Fprintf，参数以value值形式传入函数时，程序就要为每个变参分配一个占用16bytes的empty interface，然后用传入的类型初始化该interface value。这就是这块累计分配内存较多的原因。

十一、消除所有内存分配

下面的优化代码可能在实际中并不需要，但一旦真的成为瓶颈，可以这么做：

//go-debug-profile-optimization/step6/demo.go
… …
var bufPool = sync.Pool{
    New: func() interface{} {
        return new(bytes.Buffer)
    },
}

    visitNum := atomic.AddInt64(&visitors, 1)
    buf := bufPool.Get().(*bytes.Buffer)
    defer bufPool.Put(buf)
    buf.Reset()
    buf.WriteString("<h1 style='color: ")
    buf.WriteString(r.FormValue("color"))
    buf.WriteString("'>Welcome!</h1>You are visitor number ")
    b := strconv.AppendInt(buf.Bytes(), int64(visitNum), 10)
    b = append(b, '!')
    w.Write(b)
}
… …

$ go test -v -run=^$ -bench=^BenchmarkHi$ -benchtime=2s -memprofile=prof.mem
PASS
BenchmarkHi-4 5000000 780 ns/op 192 B/op 3 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step6 4.709s

go tool pprof –alloc_space step6.test prof.mem
Entering interactive mode (type "help" for commands)
(pprof) top –cum 10
1.07GB of 1.07GB total ( 100%)
Dropped 5 nodes (cum <= 0.01GB)
      flat flat%   sum%        cum   cum%
    1.07GB   100%   100%     1.07GB   100% step6.BenchmarkHi
         0     0%   100%     1.07GB   100% runtime.goexit
         0     0%   100%     1.07GB   100% testing.(*B).launch
         0     0%   100%     1.07GB   100% testing.(*B).runN

$ go test -bench=. -memprofile=prof.mem | tee mem.6
PASS
BenchmarkHi-4 2000000 790 ns/op 192 B/op 3 allocs/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step6 2.401s

$ benchcmp step5/mem.5 step6/mem.6
benchmark old ns/op new ns/op delta
BenchmarkHi-4 1513 790 -47.79%

benchmark old allocs new allocs delta
BenchmarkHi-4 6 3 -50.00%

benchmark old bytes new bytes delta
BenchmarkHi-4 304 192 -36.84%

可以看到handleHi已经不在top列表中了。benchcmp结果也显示内存分配又有大幅下降！

十二、竞争(Contention)优化

为handleHi编写一个Parallel benchmark test:

//go-debug-profile-optimization/step7/demo_test.go
… …
func BenchmarkHiParallel(b *testing.B) {
    r, err := http.ReadRequest(bufio.NewReader(strings.NewReader("GET / HTTP/1.0\r\n\r\n")))
    if err != nil {
        b.Fatal(err)
    }

    b.RunParallel(func(pb *testing.PB) {
        rw := httptest.NewRecorder()
        for pb.Next() {
            handleHi(rw, r)
            reset(rw)
        }
    })
}
… …

执行测试，并分析结果:

$ go test -bench=Parallel -blockprofile=prof.block
PASS
BenchmarkHiParallel-4 5000000 305 ns/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step7 1.947s

$ go tool pprof step7.test prof.block
Entering interactive mode (type "help" for commands)
(pprof) top –cum 10
3.68s of 3.72s total (98.82%)
Dropped 29 nodes (cum <= 0.02s)
Showing top 10 nodes out of 20 (cum >= 1.84s)
      flat flat%   sum%        cum   cum%
         0     0%     0%      3.72s   100% runtime.goexit
     1.84s 49.46% 49.46%      1.84s 49.46% runtime.chanrecv1
         0     0% 49.46%      1.84s 49.46% main.main
         0     0% 49.46%      1.84s 49.46% runtime.main
         0     0% 49.46%      1.84s 49.46% testing.(*M).Run
         0     0% 49.46%      1.84s 49.43% testing.(*B).run
         0     0% 49.46%      1.84s 49.43% testing.RunBenchmarks
         0     0% 49.46%      1.84s 49.36% step7.BenchmarkHiParallel
     1.84s 49.36% 98.82%      1.84s 49.36% sync.(*WaitGroup).Wait
         0     0% 98.82%      1.84s 49.36% testing.(*B).RunParallel
(pprof) list BenchmarkHiParallel
Total: 3.72s
ROUTINE ====== step7.BenchmarkHiParallel in step7/demo_test.go
         0      1.84s (flat, cum) 49.36% of Total
         .          .    113:        rw := httptest.NewRecorder()
         .          .    114:        for pb.Next() {
         .          .    115:            handleHi(rw, r)
         .          .    116:            reset(rw)
         .          .    117:        }
         .      1.84s    118:    })
         .          .    119:}
ROUTINE ==== step7.BenchmarkHiParallel.func1 in step7/demo_test.go
         0    43.02ms (flat, cum) 1.16% of Total
         .          .    110:    }
         .          .    111:
         .          .    112:    b.RunParallel(func(pb *testing.PB) {
         .          .    113:        rw := httptest.NewRecorder()
         .          .    114:        for pb.Next() {
         .    43.02ms    115:            handleHi(rw, r)
         .          .    116:            reset(rw)
         .          .    117:        }
         .          .    118:    })
         .          .    119:}
(pprof) list handleHi
Total: 3.72s
ROUTINE =====step7.handleHi in step7/demo.go
         0    43.02ms (flat, cum) 1.16% of Total
         .          .     18:        return new(bytes.Buffer)
         .          .     19:    },
         .          .     20:}
         .          .     21:
         .          .     22:func handleHi(w http.ResponseWriter, r *http.Request) {
         .    43.01ms     23:    if !rxOptionalID.MatchString(r.FormValue("color")) {
         .          .     24:        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
         .          .     25:        return
         .          .     26:    }
         .          .     27:
         .          .     28:    visitNum := atomic.AddInt64(&visitors, 1)
         .     2.50us     29:    buf := bufPool.Get().(*bytes.Buffer)
         .          .     30:    defer bufPool.Put(buf)
         .          .     31:    buf.Reset()
         .          .     32:    buf.WriteString("<h1 style='color: ")
         .          .     33:    buf.WriteString(r.FormValue("color"))
         .          .     34:    buf.WriteString("'>Welcome!</h1>You are visitor number ")
(pprof)

handleHi中MatchString这块是一个焦点，这里耗时较多。

优化方法（step8）：

//go-debug-profile-optimization/step8/demo.go
… …
var colorRxPool = sync.Pool{
New: func() interface{} { return regexp.MustCompile(`\w*$`) },
}

func handleHi(w http.ResponseWriter, r *http.Request) {
    if !colorRxPool.Get().(*regexp.Regexp).MatchString(r.FormValue("color")) {
        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
        return
    }

测试执行与分析：

$ go test -bench=Parallel -blockprofile=prof.block
PASS
BenchmarkHiParallel-4 100000 19190 ns/op
ok _/home1/tonybai/proj/opensource/github/experiments/go-debug-profile-optimization/step8 2.219s

$ go tool pprof step8.test prof.block
Entering interactive mode (type "help" for commands)
(pprof) top –cum 10
4.22s of 4.23s total (99.69%)
Dropped 28 nodes (cum <= 0.02s)
Showing top 10 nodes out of 12 (cum >= 2.11s)
      flat flat%   sum%        cum   cum%
         0     0%     0%      4.23s   100% runtime.goexit
     2.11s 49.90% 49.90%      2.11s 49.90% runtime.chanrecv1
         0     0% 49.90%      2.11s 49.89% main.main
         0     0% 49.90%      2.11s 49.89% runtime.main
         0     0% 49.90%      2.11s 49.89% testing.(*M).Run
         0     0% 49.90%      2.11s 49.86% testing.(*B).run
         0     0% 49.90%      2.11s 49.86% testing.RunBenchmarks
         0     0% 49.90%      2.11s 49.79% step8.BenchmarkHiParallel
     2.11s 49.79% 99.69%      2.11s 49.79% sync.(*WaitGroup).Wait
         0     0% 99.69%      2.11s 49.79% testing.(*B).RunParallel
(pprof) list BenchmarkHiParallel
Total: 4.23s
ROUTINE ======step8.BenchmarkHiParallel in step8/demo_test.go
         0      2.11s (flat, cum) 49.79% of Total
         .          .    113:        rw := httptest.NewRecorder()
         .          .    114:        for pb.Next() {
         .          .    115:            handleHi(rw, r)
         .          .    116:            reset(rw)
         .          .    117:        }
         .      2.11s    118:    })
         .          .    119:}
ROUTINE ======step8.BenchmarkHiParallel.func1 in step8/demo_test.go
         0    11.68ms (flat, cum) 0.28% of Total
         .          .    110:    }
         .          .    111:
         .          .    112:    b.RunParallel(func(pb *testing.PB) {
         .          .    113:        rw := httptest.NewRecorder()
         .          .    114:        for pb.Next() {
         .    11.68ms    115:            handleHi(rw, r)
         .          .    116:            reset(rw)
         .          .    117:        }
         .          .    118:    })
         .          .    119:}
(pprof) list handleHi
Total: 4.23s
ROUTINE ======step8.handleHi in step8/demo.go
         0    11.68ms (flat, cum) 0.28% of Total
         .          .     21:var colorRxPool = sync.Pool{
         .          .     22:    New: func() interface{} { return regexp.MustCompile(`\w*$`) },
         .          .     23:}
         .          .     24:
         .          .     25:func handleHi(w http.ResponseWriter, r *http.Request) {
         .     5.66ms     26:    if !colorRxPool.Get().(*regexp.Regexp).MatchString(r.FormValue("color")) {
         .          .     27:        http.Error(w, "Optional color is invalid", http.StatusBadRequest)
         .          .     28:        return
         .          .     29:    }
         .          .     30:
         .          .     31:    visitNum := atomic.AddInt64(&visitors, 1)
         .     6.02ms     32:    buf := bufPool.Get().(*bytes.Buffer)
         .          .     33:    defer bufPool.Put(buf)
         .          .     34:    buf.Reset()
         .          .     35:    buf.WriteString("<h1 style='color: ")
         .          .     36:    buf.WriteString(r.FormValue("color"))
         .          .     37:    buf.WriteString("'>Welcome!</h1>You are visitor number ")
(pprof)