内存 - Tony Bai

标签内存下的文章

Go开发者必知：五大缓存策略详解与选型指南

四月 28, 2025
0 条评论

本文永久链接 – https://tonybai.com/2025/04/28/five-cache-strategies

大家好，我是Tony Bai。

在构建高性能、高可用的后端服务时，缓存几乎是绕不开的话题。无论是为了加速数据访问，还是为了减轻数据库等主数据源的压力，缓存都扮演着至关重要的角色。对于我们 Go 开发者来说，选择并正确地实施缓存策略，是提升应用性能的关键技能之一。

目前业界主流的缓存策略有多种，每种都有其独特的适用场景和优缺点。今天，我们就来探讨其中五种最常见也是最核心的缓存策略：Cache-Aside、Read-Through、Write-Through、Write-Behind (Write-Back) 和Write-Around，并结合Go语言的特点和示例（使用内存缓存和SQLite），帮助大家在实际项目中做出明智的选择。

0. 准备工作：示例代码环境与结构

为了清晰地演示这些策略，本文的示例代码采用了模块化的结构，将共享的模型、缓存接口、数据库接口以及每种策略的实现分别放在不同的包中。我们将使用Go语言，配合一个简单的内存缓存（带 TTL 功能）和一个 SQLite 数据库作为持久化存储。

示例项目的结构如下：

$tree -F ./go-cache-strategy
./go-cache-strategy
├── go.mod
├── go.sum
├── internal/
│   ├── cache/
│   │   └── cache.go
│   ├── database/
│   │   └── database.go
│   └── models/
│       └── models.go
├── main.go
└── strategy/
    ├── cacheaside/
    │   └── cacheaside.go
    ├── readthrough/
    │   └── readthrough.go
    ├── writearound/
    │   └── writearound.go
    ├── writebehind/
    │   └── writebehind.go
    └── writethrough/
        └── writethrough.go

其中核心组件包括：

internal/models: 定义共享数据结构 (如 User, LogEntry)。
internal/cache: 定义 Cache 接口及 InMemoryCache 实现。
internal/database: 定义 Database 接口及 SQLite DB 实现。
strategy/xxx: 每个子目录包含一种缓存策略的核心实现逻辑。

注意： 文中仅展示各策略的核心实现代码片段。完整的、可运行的示例项目代码在Github上，大家可以通过文末链接访问。

接下来，我们将详细介绍五种缓存策略及其Go实现片段。

1. Cache-Aside (旁路缓存/懒加载Lazy Loading)

这是最常用、也最经典的缓存策略。核心思想是：应用程序自己负责维护缓存。

工作流程：

应用需要读取数据时，先检查缓存中是否存在。
缓存命中 (Hit): 如果存在，直接从缓存返回数据。
缓存未命中 (Miss): 如果不存在，应用从主数据源（如数据库）读取数据。
读取成功后，应用将数据写入缓存（设置合理的过期时间）。
最后，应用将数据返回给调用方。

Go示例 (核心实现 – strategy/cacheaside/cacheaside.go):

package cacheaside

import (
    "context"
    "fmt"
    "log"
    "time"

    "cachestrategysdemo/internal/cache"
    "cachestrategysdemo/internal/database"
    "cachestrategysdemo/internal/models"
)

const userCacheKeyPrefix = "user:" // Example prefix

// GetUser retrieves user info using Cache-Aside strategy.
func GetUser(ctx context.Context, userID string, db database.Database, memCache cache.Cache, ttl time.Duration) (*models.User, error) {
    cacheKey := userCacheKeyPrefix + userID

    // 1. Check cache first
    if cachedVal, found := memCache.Get(cacheKey); found {
        if user, ok := cachedVal.(*models.User); ok {
            log.Println("[Cache-Aside] Cache Hit for user:", userID)
            return user, nil
        }
        memCache.Delete(cacheKey) // Remove bad data
    }

    // 2. Cache Miss
    log.Println("[Cache-Aside] Cache Miss for user:", userID)

    // 3. Fetch from Database
    user, err := db.GetUser(ctx, userID)
    if err != nil {
        return nil, fmt.Errorf("failed to get user from DB: %w", err)
    }
    if user == nil {
        return nil, nil // Not found
    }

    // 4. Store data into cache
    memCache.Set(cacheKey, user, ttl)
    log.Println("[Cache-Aside] User stored in cache:", userID)

    // 5. Return data
    return user, nil
}

优点:
* 实现相对简单直观。
* 对读密集型应用效果好，缓存命中时速度快。
* 缓存挂掉不影响应用读取主数据源（只是性能下降）。

缺点:
* 首次请求（冷启动）或缓存过期后，会有一次缓存未命中，延迟较高。
* 存在数据不一致的风险：需要额外的缓存失效策略。
* 应用代码与缓存逻辑耦合。

使用场景: 读多写少，能容忍短暂数据不一致的场景。

2. Read-Through (穿透读缓存)

核心思想：应用程序将缓存视为主要数据源，只与缓存交互。缓存内部负责在未命中时从主数据源加载数据。

工作流程：

应用向缓存请求数据。
缓存检查数据是否存在。
缓存命中: 直接返回数据。
缓存未命中: 缓存自己负责从主数据源加载数据。
加载成功后，缓存将数据存入自身，并返回给应用。

Go 示例 (模拟实现 – strategy/readthrough/readthrough.go):

Read-Through 通常依赖缓存库自身特性。这里我们通过封装 Cache 接口模拟其行为。

package readthrough

import (
    "context"
    "fmt"
    "log"
    "time"

    "cachestrategysdemo/internal/cache"
    "cachestrategysdemo/internal/database"
)

// LoaderFunc defines the function signature for loading data on cache miss.
type LoaderFunc func(ctx context.Context, key string) (interface{}, error)

// Cache wraps a cache instance to provide Read-Through logic.
type Cache struct {
    cache      cache.Cache // Use the cache interface
    loaderFunc LoaderFunc
    ttl        time.Duration
}

// New creates a new ReadThrough cache wrapper.
func New(cache cache.Cache, loaderFunc LoaderFunc, ttl time.Duration) *Cache {
    return &Cache{cache: cache, loaderFunc: loaderFunc, ttl: ttl}
}

// Get retrieves data, using the loader on cache miss.
func (rtc *Cache) Get(ctx context.Context, key string) (interface{}, error) {
    // 1 & 2: Check cache
    if cachedVal, found := rtc.cache.Get(key); found {
        log.Println("[Read-Through] Cache Hit for:", key)
        return cachedVal, nil
    }

    // 4: Cache Miss - Cache calls loader
    log.Println("[Read-Through] Cache Miss for:", key)
    loadedVal, err := rtc.loaderFunc(ctx, key) // Loader fetches from DB
    if err != nil {
        return nil, fmt.Errorf("loader function failed for key %s: %w", key, err)
    }
    if loadedVal == nil {
        return nil, nil // Not found from loader
    }

    // 5: Store loaded data into cache & return
    rtc.cache.Set(key, loadedVal, rtc.ttl)
    log.Println("[Read-Through] Loaded and stored in cache:", key)
    return loadedVal, nil
}

// Example UserLoader function (needs access to DB instance and key prefix)
func NewUserLoader(db database.Database, keyPrefix string) LoaderFunc {
    return func(ctx context.Context, cacheKey string) (interface{}, error) {
        userID := cacheKey[len(keyPrefix):] // Extract ID
        // log.Println("[Read-Through Loader] Loading user from DB:", userID)
        return db.GetUser(ctx, userID)
    }
}

优点:
* 应用代码逻辑更简洁，将数据加载逻辑从应用中解耦出来。
* 代码更易于维护和测试（可以单独测试 Loader）。

缺点:
* 强依赖缓存库或服务是否提供此功能，或需要自行封装。
* 首次请求延迟仍然存在。
* 数据不一致问题依然存在。

使用场景: 读密集型，希望简化应用代码，使用的缓存系统支持此特性或愿意自行封装。

3. Write-Through (穿透写缓存)

核心思想：数据一致性优先！应用程序更新数据时，同时写入缓存和主数据源，并且两者都成功后才算操作完成。

工作流程：

应用发起写请求（新增或更新）。
应用先将数据写入主数据源（或缓存，顺序可选）。
如果第一步成功，应用再将数据写入另一个存储（缓存或主数据源）。
第二步写入成功（或至少尝试写入）后，操作完成，向调用方返回成功。
通常以主数据源写入成功为准，缓存写入失败一般只记录日志。

Go 示例 (核心实现 – strategy/writethrough/writethrough.go):

package writethrough

import (
    "context"
    "fmt"
    "log"
    "time"

    "cachestrategysdemo/internal/cache"
    "cachestrategysdemo/internal/database"
    "cachestrategysdemo/internal/models"
)

const userCacheKeyPrefix = "user:" // Example prefix

// UpdateUser updates user info using Write-Through strategy.
func UpdateUser(ctx context.Context, user *models.User, db database.Database, memCache cache.Cache, ttl time.Duration) error {
    cacheKey := userCacheKeyPrefix + user.ID

    // Decision: Write to DB first for stronger consistency guarantee.
    log.Println("[Write-Through] Writing to database first for user:", user.ID)
    err := db.UpdateUser(ctx, user)
    if err != nil {
        // DB write failed, do not proceed to cache write
        return fmt.Errorf("failed to write to database: %w", err)
    }
    log.Println("[Write-Through] Successfully wrote to database for user:", user.ID)

    // Now write to cache (best effort after successful DB write).
    log.Println("[Write-Through] Writing to cache for user:", user.ID)
    memCache.Set(cacheKey, user, ttl)
    // If strict consistency cache+db is needed, distributed transaction is required (complex).
    // For simplicity, assume cache write is best-effort. Log potential errors.

    return nil
}

优点:
* 数据一致性相对较高。
* 读取时（若命中）能获取较新数据。

缺点:
* 写入延迟较高。
* 实现需考虑失败处理（特别是DB成功后缓存失败的情况）。
* 缓存可能成为写入瓶颈。

使用场景: 对数据一致性要求较高，可接受一定的写延迟。

4. Write-Behind / Write-Back (回写 / 后写缓存)

核心思想：写入性能优先！应用程序只将数据写入缓存，缓存立即返回成功。缓存随后异步地、批量地将数据写入主数据源。

工作流程：

应用发起写请求。
应用将数据写入缓存。
缓存立即向应用返回成功。
缓存将此写操作放入一个队列或缓冲区。
一个独立的后台任务在稍后将队列中的数据批量写入主数据源。

Go 示例 (核心实现 – strategy/writebehind/writebehind.go):

package writebehind

import (
    "context"
    "fmt"
    "log"
    "sync"
    "time"

    "cachestrategysdemo/internal/cache"
    "cachestrategysdemo/internal/database"
    "cachestrategysdemo/internal/models"
)

// Config holds configuration for the Write-Behind strategy.
type Config struct {
    Cache     cache.Cache
    DB        database.Database
    KeyPrefix string
    TTL       time.Duration
    QueueSize int
    BatchSize int
    Interval  time.Duration
}

// Strategy holds the state for the Write-Behind implementation.
type Strategy struct {
    // ... (fields: cache, db, updateQueue, wg, stopOnce, cancelCtx/Func, dbWriteMutex, config fields) ...
    // Fields defined in the full code example provided previously
    cache       cache.Cache
    db          database.Database
    updateQueue chan *models.User
    wg          sync.WaitGroup
    stopOnce    sync.Once
    cancelCtx   context.Context
    cancelFunc  context.CancelFunc
    dbWriteMutex sync.Mutex // Simple lock for batch DB writes
    keyPrefix   string
    ttl         time.Duration
    batchSize   int
    interval    time.Duration
}

// New creates and starts a new Write-Behind strategy instance.
// (Implementation details in full code example - initializes struct, starts worker)
func New(cfg Config) *Strategy {
    // ... (Initialization code as provided previously) ...
    // For brevity, showing only the function signature here.
    // It sets defaults, creates the context/channel, and starts the worker goroutine.
    // Returns the *Strategy instance.
    // ... Full implementation in GitHub Repo ...
    panic("Full implementation required from GitHub Repo") // Placeholder
}

// UpdateUser queues a user update using Write-Behind strategy.
func (s *Strategy) UpdateUser(ctx context.Context, user *models.User) error {
    cacheKey := s.keyPrefix + user.ID
    s.cache.Set(cacheKey, user, s.ttl) // Write to cache immediately

    // Add to async queue
    select {
    case s.updateQueue <- user:
        return nil // Return success to the client immediately
    default:
        // Queue is full! Handle backpressure.
        log.Printf("[Write-Behind] Error: Update queue is full. Dropping update for user: %s\n", user.ID)
        return fmt.Errorf("update queue overflow for user %s", user.ID)
    }
}

// dbWriterWorker processes the queue (Implementation details in full code example)
func (s *Strategy) dbWriterWorker() {
    // ... (Worker loop logic: select on queue, ticker, context cancellation) ...
    // ... (Calls flushBatchToDB) ...
    // ... Full implementation in GitHub Repo ...
}

// flushBatchToDB writes a batch to the database (Implementation details in full code example)
func (s *Strategy) flushBatchToDB(ctx context.Context, batch []*models.User) {
    // ... (Handles batch write logic using s.db.BulkUpdateUsers) ...
    // ... Full implementation in GitHub Repo ...
}

// Stop gracefully shuts down the Write-Behind worker.
// (Implementation details in full code example - signals context, waits for WaitGroup)
func (s *Strategy) Stop() {
    // ... (Stop logic using stopOnce, cancelFunc, wg.Wait) ...
    // ... Full implementation in GitHub Repo ...
}

优点:
* 写入性能极高。
* 降低主数据源压力。

缺点:
* 数据丢失风险。
* 最终一致性。
* 实现复杂度高。

使用场景: 对写性能要求极高，写操作非常频繁，能容忍数据丢失风险和最终一致性。

5. Write-Around (绕写缓存)

核心思想：写操作直接绕过缓存，只写入主数据源。读操作时才将数据写入缓存（通常结合 Cache-Aside）。

工作流程：

写路径: 应用发起写请求，直接将数据写入主数据源。
读路径 (通常是Cache-Aside): 应用需要读取数据时，先检查缓存。如果未命中，则从主数据源读取，然后将数据存入缓存，最后返回。

Go 示例 (核心实现 – strategy/writearound/writearound.go):

package writearound

import (
    "context"
    "fmt"
    "log"
    "time"

    "cachestrategysdemo/internal/cache"
    "cachestrategysdemo/internal/database"
    "cachestrategysdemo/internal/models"
)

const logCacheKeyPrefix = "log:" // Example prefix for logs

// WriteLog writes log entry directly to DB, bypassing cache.
func WriteLog(ctx context.Context, entry *models.LogEntry, db database.Database) error {
    // 1. Write directly to DB
    log.Printf("[Write-Around Write] Writing log directly to DB (ID: %s)\n", entry.ID)
    err := db.InsertLogEntry(ctx, entry) // Use the appropriate DB method
    if err != nil {
        return fmt.Errorf("failed to write log to DB: %w", err)
    }
    return nil
}

// GetLog retrieves log entry, using Cache-Aside for reading.
func GetLog(ctx context.Context, logID string, db database.Database, memCache cache.Cache, ttl time.Duration) (*models.LogEntry, error) {
    cacheKey := logCacheKeyPrefix + logID

    // 1. Check cache (Cache-Aside read path)
    if cachedVal, found := memCache.Get(cacheKey); found {
        if entry, ok := cachedVal.(*models.LogEntry); ok {
            log.Println("[Write-Around Read] Cache Hit for log:", logID)
            return entry, nil
        }
        memCache.Delete(cacheKey)
    }

    // 2. Cache Miss
    log.Println("[Write-Around Read] Cache Miss for log:", logID)

    // 3. Fetch from Database
    entry, err := db.GetLogByID(ctx, logID) // Use the appropriate DB method
    if err != nil { return nil, fmt.Errorf("failed to get log from DB: %w", err) }
    if entry == nil { return nil, nil /* Not found */ }

    // 4. Store data into cache
    memCache.Set(cacheKey, entry, ttl)
    log.Println("[Write-Around Read] Log stored in cache:", logID)

    // 5. Return data
    return entry, nil
}

优点:
* 避免缓存污染。
* 写性能好。

缺点:
* 首次读取延迟高。
* 可能存在数据不一致（读路径上的 Cache-Aside 固有）。

使用场景: 写密集型，且写入的数据不太可能在短期内被频繁读取的场景。

总结与选型

没有银弹！ 选择哪种缓存策略，最终取决于你的具体业务场景对性能、数据一致性、可靠性和实现复杂度的权衡。

本文涉及的完整可运行示例代码已托管至GitHub，你可以通过这个链接访问。

希望这篇详解能帮助你在 Go 项目中更自信地选择和使用缓存策略。你最常用哪种缓存策略？在 Go 中实现时遇到过哪些坑？欢迎在评论区分享交流！

>注：本文代码由AI生成，可编译运行，但仅用于演示和辅助文章理解，切勿用于生产！

原「Gopher部落」已重装升级为「Go & AI 精进营」知识星球，快来加入星球，开启你的技术跃迁之旅吧！

我们致力于打造一个高品质的 Go 语言深度学习 与 AI 应用探索 平台。在这里，你将获得：

体系化 Go 核心进阶内容: 深入「Go原理课」、「Go进阶课」、「Go避坑课」等独家深度专栏，夯实你的 Go 内功。
前沿 Go+AI 实战赋能: 紧跟时代步伐，学习「Go+AI应用实战」、「Agent开发实战课」，掌握 AI 时代新技能。
星主 Tony Bai 亲自答疑: 遇到难题？星主第一时间为你深度解析，扫清学习障碍。
高活跃 Gopher 交流圈: 与众多优秀 Gopher 分享心得、讨论技术，碰撞思想火花。
独家资源与内容首发: 技术文章、课程更新、精选资源，第一时间触达。

衷心希望「Go & AI 精进营」能成为你学习、进步、交流的港湾。让我们在此相聚，享受技术精进的快乐！欢迎你的加入！

img{512x368}

商务合作方式：撰稿、出书、培训、在线课程、合伙创业、咨询、广告合作。如有需求，请扫描下方公众号二维码，与我私信联系。

惊！Go在十亿次循环和百万任务中表现不如Java，究竟为何？

十二月 2, 2024
2 条评论

本文永久链接 – https://tonybai.com/2024/12/02/why-go-sucks

编程语言比较的话题总是能吸引程序员的眼球！

近期外网的两篇编程语言对比的文章在国内程序员圈里引起热议。一篇是由Ben Dicken (@BenjDicken) 做的语言性能测试，对比了十多种主流语言在执行10亿次循环(一个双层循环：1万 * 10 万)的速度；另一篇则是一个名为hez2010的开发者做的内存开销测试，对比了多种语言在处理百万任务时的内存开销。

下面是这两项测试的结果示意图：

10亿循环测试结果

百万任务内存开销测试结果

我们看到：在这两项测试中，Go的表现不仅远不及NonGC的C/Rust，甚至还落后于Java，尤其是在内存开销测试中，Go的内存使用显著高于以“吃内存”著称的Java。这一结果让许多开发者感到意外，因为Go通常被认为是轻量级的语言，然而实际的测试结果却揭示了其在高并发场景下的“内存效率不足”。

那么究竟为何在这两项测试中，Go的表现都不及预期呢？在这篇文章中，我将探讨可能的原因，以供大家参考。

我们先从十亿次循环测试开始。

1. 循环测试跑的慢，都因编译器优化还不够

下面是作者给出的Go测试程序：

// why-go-sucks/billion-loops/go/code.go 

package main

import (
    "fmt"
    "math/rand"
    "os"
    "strconv"
)

func main() {
    input, e := strconv.Atoi(os.Args[1]) // Get an input number from the command line
    if e != nil {
        panic(e)
    }
    u := int32(input)
    r := int32(rand.Intn(10000))        // Get a random number 0 <= r < 10k
    var a [10000]int32                  // Array of 10k elements initialized to 0
    for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
        for j := int32(0); j < 100000; j++ { // 100k inner loop iterations, per outer loop iteration
            a[i] = a[i] + j%u // Simple sum
        }
        a[i] += r // Add a random value to each element in array
    }
    fmt.Println(a[r]) // Print out a single element from the array
}

这段代码通过命令行参数获取一个整数，然后生成一个随机数，接着通过两层循环对一个数组的每个元素进行累加，最终输出该数组中以随机数为下标对应的数组元素的值。

我们再来看一下”竞争对手”的测试代码。C测试代码如下：

// why-go-sucks/billion-loops/c/code.c

#include "stdio.h"
#include "stdlib.h"
#include "stdint.h"

int main (int argc, char** argv) {
  int u = atoi(argv[1]);               // Get an input number from the command line
  int r = rand() % 10000;              // Get a random integer 0 <= r < 10k
  int32_t a[10000] = {0};              // Array of 10k elements initialized to 0
  for (int i = 0; i < 10000; i++) {    // 10k outer loop iterations
    for (int j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
      a[i] = a[i] + j%u;               // Simple sum
    }
    a[i] += r;                         // Add a random value to each element in array
  }
  printf("%d\n", a[r]);                // Print out a single element from the array
}

下面是Java的测试代码：

// why-go-sucks/billion-loops/java/code.java

package jvm;

import java.util.Random;

public class code {

    public static void main(String[] args) {
        var u = Integer.parseInt(args[0]); // Get an input number from the command line
        var r = new Random().nextInt(10000); // Get a random number 0 <= r < 10k
        var a = new int[10000]; // Array of 10k elements initialized to 0
        for (var i = 0; i < 10000; i++) { // 10k outer loop iterations
            for (var j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
                a[i] = a[i] + j % u; // Simple sum
            }
            a[i] += r; // Add a random value to each element in array
        }
        System.out.println(a[r]); // Print out a single element from the array
    }
}

你可能不熟悉C或Java，但从代码的形式上来看，C、Java与Go的代码确实处于“同等条件”。这不仅意味着它们在相同的硬件和软件环境中运行，更包括它们采用了相同的计算逻辑和算法，以及一致的输入参数处理等方面的相似性。

为了确认一下原作者的测试结果，我在一台阿里云ECS上(amd64，8c32g，CentOS 7.9)对上面三个程序进行了测试(使用time命令测量计算耗时)，得到一个基线结果。我的环境下，C、Java和Go的编译器版本如下：

$go version
go version go1.23.0 linux/amd64

$java -version
openjdk version "17.0.9" 2023-10-17 LTS
OpenJDK Runtime Environment Zulu17.46+19-CA (build 17.0.9+8-LTS)
OpenJDK 64-Bit Server VM Zulu17.46+19-CA (build 17.0.9+8-LTS, mixed mode, sharing)

$gcc -v
使用内建 specs。
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
目标：x86_64-redhat-linux
配置为：../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
线程模型：posix
gcc 版本 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

测试步骤与结果如下：

Go代码测试：

$cd why-go-sucks/billion-loops/go
$go build -o code code.go
$time ./code 10
456953

real    0m3.766s
user    0m3.767s
sys 0m0.007s

C代码测试：

$cd why-go-sucks/billion-loops/c
$gcc -O3 -std=c99 -o code code.c
$time ./code 10
459383

real    0m3.005s
user    0m3.005s
sys 0m0.000s

Java代码测试：

$javac -d . code.java
$time java -cp . jvm.code 10
456181

real    0m3.105s
user    0m3.092s
sys 0m0.027s

从测试结果看到(基于real时间)：采用-O3优化的C代码最快，Java落后一个身位，而Go则比C慢了25%，比Java慢了21%。

注：time命令的输出结果通常包含三个主要部分：real、user和sys。real是从命令开始执行到结束所经过的实际时间（墙钟时间），我们依次指标为准。user是程序在用户模式下执行所消耗的CPU时间。sys则是程序在内核模式下执行所消耗的CPU时间（系统调用）。如果总时间（real）略低于用户时间（user），这表明程序可能在某些时刻被调度或等待，而不是持续占用CPU。这种情况可能是由于输入输出操作、等待资源等原因。如果real时间显著小于user时间，这种情况通常发生在并发程序中，其中多个线程或进程在不同的时间段执行，导致总的用户CPU时间远大于实际的墙钟时间。sys时间保持较低，说明系统调用的频率较低，程序主要是执行计算而非进行大量的系统交互。

这时作为Gopher的你可能会说：原作者编写的Go测试代码不够优化，我们能优化到比C还快！

大家都知道原代码是不够优化的，随意改改计算逻辑就能带来大幅提升。但我们不能忘了“同等条件”这个前提。你采用的优化方法，其他语言（C、Java）也可以采用。

那么，在不改变“同等条件”的前提下，我们还能优化点啥呢？本着能提升一点是一点的思路，我们尝试从下面几个点优化一下，看看效果：

去除不必要的if判断
使用更快的rand实现
关闭边界检查
避免逃逸

下面是修改之后的代码：

// why-go-sucks/billion-loops/go/code_optimize.go 

package main

import (
    "fmt"
    "math/rand"
    "os"
    "strconv"
)

func main() {
    input, _ := strconv.Atoi(os.Args[1]) // Get an input number from the command line
    u := int32(input)
    r := int32(rand.Uint32() % 10000)   // Use Uint32 for faster random number generation
    var a [10000]int32                  // Array of 10k elements initialized to 0
    for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
        for j := int32(0); j < 100000; j++ { // 100k inner loop iterations, per outer loop iteration
            a[i] = a[i] + j%u // Simple sum
        }
        a[i] += r // Add a random value to each element in array
    }
    z := a[r]
    fmt.Println(z) // Print out a single element from the array
}

我们编译并运行一下测试：

$cd why-go-sucks/billion-loops/go
$go build -o code_optimize -gcflags '-B' code_optimize.go
$time ./code_optimize 10
459443

real    0m3.761s
user    0m3.759s
sys 0m0.011s

对比一下最初的测试结果，这些“所谓的优化”没有什么卵用，优化前你估计也能猜测到这个结果，因为除了关闭边界检查，其他优化都没有处于循环执行的热路径之上。

注：rand.Uint32() % 10000的确要比rand.Intn(10000)快，我自己的benchmark结果是快约1倍。

那Go程序究竟慢在哪里呢？在“同等条件”下，我能想到的只能是Go编译器后端在代码优化方面优化做的还不够，相较于GCC、Java等老牌编译器还有明显差距。

比如说，原先的代码中在内层循环中频繁访问a[i]，导致数组访问的读写操作较多（从内存加载a[i]，更新值后写回）。GCC和Java编译器在后端很可能做了这样的优化：将数组元素累积到一个临时变量中，并在外层循环结束后写回数组，这样做可以减少内层循环中的内存读写操作，充分利用CPU缓存和寄存器，加速数据处理。

注：数组从内存或缓存读，而一个临时变量很大可能是从寄存器读，那读取速度相差还是很大的。

如果我们手工在Go中实施这一优化，看看能达到什么效果呢？我们改一下最初版本的Go代码(code.go)，新代码如下：

// why-go-sucks/billion-loops/go/code_local_var.go 

package main

import (
    "fmt"
    "math/rand"
    "os"
    "strconv"
)

func main() {
    input, e := strconv.Atoi(os.Args[1]) // Get an input number from the command line
    if e != nil {
        panic(e)
    }
    u := int32(input)
    r := int32(rand.Intn(10000))        // Get a random number 0 <= r < 10k
    var a [10000]int32                  // Array of 10k elements initialized to 0
    for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
        temp := a[i]
        for j := int32(0); j < 100000; j++ { // 100k inner loop iterations, per outer loop iteration
            temp += j % u // Simple sum
        }
        temp += r // Add a random value to each element in array
        a[i] = temp
    }
    fmt.Println(a[r]) // Print out a single element from the array
}

编译并运行测试：

$go build -o code_local_var code_local_var.go
$time ./code_local_var 10
459169

real    0m3.017s
user    0m3.017s
sys 0m0.007s

我们看到，测试结果直接就比Java略好一些了。显然Go编译器没有做这种优化，从code.go的汇编也大致可以看出来：

使用lensm生成的汇编与go源码对应关系

而Java显然做了这类优化，我们在原Java代码版本上按上述优化逻辑修改了一下：

// why-go-sucks/billion-loops/java/code_local_var.java

package jvm;

import java.util.Random;

public class code {

    public static void main(String[] args) {
        var u = Integer.parseInt(args[0]); // 获取命令行输入的整数
        var r = new Random().nextInt(10000); // 生成随机数 0 <= r < 10000
        var a = new int[10000]; // 定义长度为10000的数组a

        for (var i = 0; i < 10000; i++) { // 10k外层循环迭代
            var temp = a[i]; // 使用临时变量存储 a[i] 的值
            for (var j = 0; j < 100000; j++) { // 100k内层循环迭代，每次外层循环迭代
                temp += j % u; // 更新临时变量的值
            }
            a[i] = temp + r; // 将临时变量的值加上 r 并写回数组
        }
        System.out.println(a[r]); // 输出 a[r] 的值
    }
}

但从运行这个“优化”后的程序的结果来看，其对java代码的提升幅度几乎可以忽略不计：

$time java -cp . jvm.code 10
450375

real    0m3.043s
user    0m3.028s
sys 0m0.027s

这也直接证明了即便采用的是原版java代码，java编译器也会生成带有抽取局部变量这种优化的可执行代码，java程序员无需手工进行此类优化。

像这种编译器优化，还有不少，比如大家比较熟悉的循环展开(Loop Unrolling)也可以提升Go程序的性能：

// why-go-sucks/billion-loops/go/code_loop_unrolling.go

package main

import (
    "fmt"
    "math/rand"
    "os"
    "strconv"
)

func main() {
    input, e := strconv.Atoi(os.Args[1]) // Get an input number from the command line
    if e != nil {
        panic(e)
    }
    u := int32(input)
    r := int32(rand.Intn(10000))        // Get a random number 0 <= r < 10k
    var a [10000]int32                  // Array of 10k elements initialized to 0
    for i := int32(0); i < 10000; i++ { // 10k outer loop iterations
        var sum int32
        // Unroll inner loop in chunks of 4 for optimization
        for j := int32(0); j < 100000; j += 4 {
            sum += j % u
            sum += (j + 1) % u
            sum += (j + 2) % u
            sum += (j + 3) % u
        }
        a[i] = sum + r // Add the accumulated sum and random value
    }

    fmt.Println(a[r]) // Print out a single element from the array
}

运行这个Go测试程序，性能如下：

$go build -o code_loop_unrolling code_loop_unrolling.go
$time ./code_loop_unrolling 10
458908

real    0m2.937s
user    0m2.940s
sys 0m0.002s

循环展开可以增加指令级并行性，因为展开后的代码块中可以有更多的独立指令，比如示例中的计算j % u、(j+1) % u、(j+2) % u和(j+3) % u，这些计算操作是独立的，可以并行执行，打破了依赖链，从而更好地利用处理器的并行流水线。而原版Go代码中，每次迭代都会根据前一次迭代的结果更新a[i]，形成一个依赖链，这种顺序依赖性迫使处理器只能按顺序执行这些指令，导致流水线停顿。

不过其他语言也可以做同样的手工优化，比如我们对C代码做同样的优化(why-go-sucks/billion-loops/c/code_loop_unrolling.c)，c测试程序的性能可以提升至2.7s水平，这也证明了初版C程序即便在-O3的情况下编译器也没有自动为其做这个优化：

$time ./code_loop_unrolling 10
459383

real    0m2.723s
user    0m2.722s
sys 0m0.001s

到这里我们就不再针对这个10亿次循环的性能问题做进一步展开了，从上面的探索得到的初步结论就是Go编译器优化做的还不到位所致，期待后续Go团队能在编译器优化方面投入更多精力，争取早日追上GCC/Clang、Java这些成熟的编译器优化水平。

下面我们再来看Go在百万任务场景下内存开销大的“问题”。

2. 内存占用高，问题出在Goroutine实现原理

我们先来看第二个问题的测试代码：

package main

import (
    "fmt"
    "os"
    "strconv"
    "sync"
    "time"
)

func main() {
    numRoutines := 100000
    if len(os.Args) > 1 {
        n, err := strconv.Atoi(os.Args[1])
        if err == nil {
            numRoutines = n
        }
    }

    var wg sync.WaitGroup
    for i := 0; i < numRoutines; i++ {
        wg.Add(1)
        go func() {
            time.Sleep(10 * time.Second)
            wg.Done()
        }()
    }
    wg.Wait()
}

这个代码其实就是根据传入的task数量启动等同数量的goroutine，然后每个goroutine模拟工作负载sleep 10s，这等效于百万长连接的场景，只有连接，但没有收发消息。

相对于上一个问题，这个问题更好解释一些。

Go使用的groutine是一种有栈协程，文章中使用的是每个task一个goroutine的模型，且维护百万任务一段时间，这会真实创建百万个goroutine（G数据结构），并为其分配栈空间(2k起步)，这样你可以算一算，不考虑其他结构的占用，仅每个goroutine的栈空间所需的内存都是极其可观的：

mem = 1000000 * 2000 Bytes = 2000000000 Bytes = 2G Bytes

所以启动100w goroutine，保底就2GB内存出去了，这与原作者测试的结果十分契合(原文是2.5GB多)。并且，内存还会随着goroutine数量增长而线性增加。

那么如何能减少内存使用呢？如果采用每个task一个goroutine的模型，这个内存占用很难省去，除非将来Go团队对goroutine实现做大修。

如果task是网络通信相关的，可以使用类似gnet这样的直接基于epoll建构的框架，其主要的节省在于不会启动那么多goroutine，而是通过一个goroutine池来处理数据，每个池中的goroutine负责一批网络连接或网络请求。

在一些Gopher的印象中，Goroutine一旦分配就不回收，这会使他们会误认为一旦分配了100w goroutine，这2.5G内存空间将始终被占用，真实情况是这样么？我们用一个示例程序验证一下就好了：

// why-go-sucks/million-tasks/million-tasks.go

package main

import (
    "fmt"
    "log"
    "os"
    "os/signal"
    "runtime"
    "sync"
    "syscall"
    "time"
)

// 打印当前内存使用情况和相关信息
func printMemoryUsage() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)

    // 获取当前 goroutine 数量
    numGoroutines := runtime.NumGoroutine()

    // 获取当前线程数量
    numThreads := runtime.NumCPU() // Go runtime 不直接提供线程数量，但可以通过 NumCPU 获取逻辑处理器数量

    fmt.Printf("======>\n")
    fmt.Printf("Alloc = %v MiB", bToMb(m.Alloc))
    fmt.Printf("\tTotalAlloc = %v MiB", bToMb(m.TotalAlloc))
    fmt.Printf("\tSys = %v MiB", bToMb(m.Sys))
    fmt.Printf("\tNumGC = %v", m.NumGC)
    fmt.Printf("\tNumGoroutines = %v", numGoroutines)
    fmt.Printf("\tNumThreads = %v\n", numThreads)
    fmt.Printf("<======\n\n")
}

// 将字节转换为 MB
func bToMb(b uint64) uint64 {
    return b / 1024 / 1024
}

func main() {
    const signal1Goroutines = 900000
    const signal2Goroutines = 90000
    const signal3Goroutines = 10000

    // 用于接收退出信号
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

    // 控制 goroutine 的退出
    signal1Chan := make(chan struct{})
    signal2Chan := make(chan struct{})
    signal3Chan := make(chan struct{})

    var wg sync.WaitGroup
    ticker := time.NewTicker(5 * time.Second)
    go func() {
        for range ticker.C {
            printMemoryUsage()
        }
    }()

    // 等待退出信号
    go func() {
        count := 0
        for {
            <-sigChan
            count++
            if count == 1 {
                log.Println("收到第一类goroutine退出信号")
                close(signal1Chan) // 关闭 signal1Chan，通知第一类 goroutine 退出
                continue
            }
            if count == 2 {
                log.Println("收到第二类goroutine退出信号")
                close(signal2Chan) // 关闭 signal2Chan，通知第二类 goroutine 退出
                continue
            }
            log.Println("收到第三类goroutine退出信号")
            close(signal3Chan) // 关闭 signal3Chan，通知第三类 goroutine 退出
            return
        }
    }()

    // 启动第一类 goroutine（在收到 signal1 时退出）
    log.Println("开始启动第一类goroutine...")
    for i := 0; i < signal1Goroutines; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            // 模拟工作
            for {
                select {
                case <-signal1Chan:
                    return
                default:
                    time.Sleep(10 * time.Second) // 模拟一些工作
                }
            }
        }(i)
    }
    log.Println("启动第一类goroutine(900000) ok")

    time.Sleep(time.Second * 5)

    // 启动第二类 goroutine（在收到 signal2 时退出）
    log.Println("开始启动第二类goroutine...")
    for i := 0; i < signal2Goroutines; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            // 模拟工作
            for {
                select {
                case <-signal2Chan:
                    return
                default:
                    time.Sleep(10 * time.Second) // 模拟一些工作
                }
            }
        }(i)
    }
    log.Println("启动第二类goroutine(90000) ok")

    time.Sleep(time.Second * 5)

    // 启动第三类goroutine（随程序退出而退出）
    log.Println("开始启动第三类goroutine...")
    for i := 0; i < signal3Goroutines; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            // 模拟工作
            for {
                select {
                case <-signal3Chan:
                    return
                default:
                    time.Sleep(10 * time.Second) // 模拟一些工作
                }
            }
        }(i)
    }
    log.Println("启动第三类goroutine(90000) ok")

    // 等待所有 goroutine 完成
    wg.Wait()
    fmt.Println("所有 goroutine 已退出，程序结束")
}

这个程序我就不详细解释了。大致分三类goroutine，第一类90w个，在我发送第一个ctrl+c信号后退出，第二类9w个，在我发送第二个ctrl+c信号后退出，最后一类1w个，随着程序退出而退出。

在我的执行环境下编译和执行一下这个程序，并结合runtime输出以及使用top -p pid的方式查看其内存占用：

$go build million-tasks.go
$./million-tasks 

2024/12/01 22:07:03 开始启动第一类goroutine...
2024/12/01 22:07:05 启动第一类goroutine(900000) ok
======>
Alloc = 511 MiB TotalAlloc = 602 MiB    Sys = 2311 MiB  NumGC = 9   NumGoroutines = 900004  NumThreads = 8
<======

2024/12/01 22:07:10 开始启动第二类goroutine...
2024/12/01 22:07:11 启动第二类goroutine(90000) ok
======>
Alloc = 577 MiB TotalAlloc = 668 MiB    Sys = 2553 MiB  NumGC = 9   NumGoroutines = 990004  NumThreads = 8
<======

2024/12/01 22:07:16 开始启动第三类goroutine...
2024/12/01 22:07:16 启动第三类goroutine(90000) ok
======>
Alloc = 597 MiB TotalAlloc = 688 MiB    Sys = 2593 MiB  NumGC = 9   NumGoroutines = 1000004 NumThreads = 8
<======

======>
Alloc = 600 MiB TotalAlloc = 690 MiB    Sys = 2597 MiB  NumGC = 9   NumGoroutines = 1000004 NumThreads = 8
<======
... ...

======>
Alloc = 536 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 10  NumGoroutines = 1000004 NumThreads = 8
<======

100w goroutine全部创建ok后，我们查看一下top输出：

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5800 root      20   0 3875556   2.5g    988 S  54.0  8.2   0:30.92 million-tasks

我们看到RES为2.5g，和我们预期的一致！

接下来，我们停掉第一批90w个goroutine，看RES是否会下降，何时会下降！

输入ctrl+c，停掉第一批90w goroutine：

^C2024/12/01 22:10:15 收到第一类goroutine退出信号
======>
Alloc = 536 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 10  NumGoroutines = 723198  NumThreads = 8
<======

======>
Alloc = 536 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 10  NumGoroutines = 723198  NumThreads = 8
<======

======>
Alloc = 536 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 10  NumGoroutines = 100004  NumThreads = 8
<======

======>
Alloc = 536 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 10  NumGoroutines = 100004  NumThreads = 8
<======
... ...

但同时刻的top显示RES并没有变化：

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5800 root      20   0 3875812   2.5g    988 S   0.0  8.2   0:56.38 million-tasks

等待两个GC间隔的时间后(大约4分)，Goroutine的栈空间被释放：

======>
Alloc = 465 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 12  NumGoroutines = 100004  NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 12  NumGoroutines = 100004  NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 12  NumGoroutines = 100004  NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 12  NumGoroutines = 100004  NumThreads = 8
<======

top显示RES从2.5g下降为大概700多MB（RES的单位是KB）：

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5800 root      20   0 3875812 764136    992 S   0.0  2.4   1:01.87 million-tasks

接下来，我们再停掉第二批9w goroutine：

^C2024/12/01 22:16:21 收到第二类goroutine退出信号
======>
Alloc = 465 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 13  NumGoroutines = 100004  NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 13  NumGoroutines = 100004  NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 13  NumGoroutines = 10004   NumThreads = 8
<======

======>
Alloc = 465 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 13  NumGoroutines = 10004   NumThreads = 8
<======

此时，top值也没立即改变：

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5800 root      20   0 3875812 764136    992 S   0.0  2.4   1:05.99 million-tasks

大约等待一个GC间隔(2分钟)后，top中RES下降：

======>
Alloc = 458 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 14  NumGoroutines = 10004   NumThreads = 8
<======

======>
Alloc = 458 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 14  NumGoroutines = 10004   NumThreads = 8
<======

======>
Alloc = 458 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 14  NumGoroutines = 10004   NumThreads = 8
<======

RES变为不到700M：

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5800 root      20   0 3875812 699156    992 S   0.0  2.2   1:06.75 million-tasks

第三次按下ctrl+c，程序退出：

^C2024/12/01 22:18:46 收到第三类goroutine退出信号
======>
Alloc = 458 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 14  NumGoroutines = 10003   NumThreads = 8
<======

======>
Alloc = 458 MiB TotalAlloc = 695 MiB    Sys = 2606 MiB  NumGC = 14  NumGoroutines = 10003   NumThreads = 8
<======

所有 goroutine 已退出，程序结束

我们看到Go是会回收goroutine占用的内存空间的，并且归还给OS，只是这种归还比较lazy。尤其是，第二次停止goroutine前，go程序剩下10w goroutine，按理论来讲需占用大约200MB的空间，实际上却是700多MB；第二次停止goroutine后，goroutine数量降为1w，理论占用应该在20MB，但实际却是600多MB，我们看到go运行时这种lazy归还OS内存的行为可能也是“故意为之”，是为了避免反复从OS申请和归还内存。

3. 小结

本文主要探讨了Go语言在十亿次循环和百万任务的测试中的表现令人意外地逊色于Java和C语言的原因。我认为Go在循环执行中的慢速表现，主要是其编译器优化不足，影响了执行效率。而在内存开销方面，Go的Goroutine实现是使得内存使用量大幅增加的“罪魁祸首”，这是由于每个Goroutine初始都会分配固定大小的栈空间。

通过本文的探讨，我的主要目的是希望大家不要以讹传讹，而是要搞清楚背后的真正原因，并正视Go在某些方面的不足，以及其当前在某些应用上下文中的局限性。同时，也希望Go开发团队在编译器优化方面进行更多投入，以提升Go在高性能计算领域的竞争力。

本文涉及的源码可以在这里下载。

4. 参考资料

Billion nested loop iterations – https://benjdd.com/languages/
How Much Memory Do You Need in 2024 to Run 1 Million Concurrent Tasks? – https://hez2010.github.io/async-runtimes-benchmarks-2024/

Gopher部落知识星球在2024年将继续致力于打造一个高品质的Go语言学习和交流平台。我们将继续提供优质的Go技术文章首发和阅读体验。同时，我们也会加强代码质量和最佳实践的分享，包括如何编写简洁、可读、可测试的Go代码。此外，我们还会加强星友之间的交流和互动。欢迎大家踊跃提问，分享心得，讨论技术。我会在第一时间进行解答和交流。我衷心希望Gopher部落可以成为大家学习、进步、交流的港湾。让我相聚在Gopher部落，享受coding的快乐! 欢迎大家踊跃加入！

img{512x368}

著名云主机服务厂商DigitalOcean发布最新的主机计划，入门级Droplet配置升级为：1 core CPU、1G内存、25G高速SSD，价格5$/月。有使用DigitalOcean需求的朋友，可以打开这个链接地址：https://m.do.co/c/bff6eed92687 开启你的DO主机之路。

Gopher Daily(Gopher每日新闻) – https://gopherdaily.tonybai.com

我的联系方式：