Golang | Tony Bai

标签 Golang 下的文章

Gopher Daily改版了

八月 6, 2023
0 条评论

本文永久链接 – https://tonybai.com/2023/08/06/gopherdaily-revamped

已经记不得GopherDaily是何时创建的了，翻了一下GopherDaily项目的commit history，才发现我的这个个人项目是2019年9月创建的，最初内容组织很粗糙，但我的编辑制作的热情很高，基本能坚持每日一发，甚至节假日也不停刊：

该项目的初衷就是为广大Gopher带来新鲜度较高的Go语言技术资料。项目创建以来得到了很多Gopher的支持，甚至经常收到催刊邮件/私信以及主动report订阅列表问题的情况。

不过近一年多，订阅GopherDaily的Gopher可能会发现：GopherDaily已经做不到“Daily”了！究其原因还是个人精力有限，每刊编辑都要花费很多时间。但个人又不想暂停该项目，怎么办呢？近段时间我就在着手思考提升GopherDaily制作效率的问题。

一个可行的方案就是“半自动化”！在这次从“纯人工”到“半自动化”的过程中，顺便对GopherDaily做了一次“改版”。

在这篇文章中，我就来说说结合大语言模型和Go技术栈实现GopherDaily制作的“半自动化”以及GopherDaily“改版”的历程。

1. “半自动化”的制作流程

当前的GopherDaily每刊的制作过程十分费时费力，下面是图示的制作过程：

这里面所有步骤都是人工处理，且收集资料、阅读摘要以及选优最为耗时。

那么这些环节中哪些可以自动化呢？收集、摘要、翻译、生成与发布都可以自动化，只有“选优”需要人工干预，下面是改进后的“半自动化”流程：

我们看到整个过程分为三个阶段：

第一阶段(stage1)：自动化的收集资料，并生成第二阶段的输入issue-20230805-stage1.json(以2023年8月5日为例)。
第二阶段(stage2)：对输入的issue-20230805-stage1.json中的资料进行选优，删掉不适合或质量不高的资料，当然也可以手工加入一些自动化收集阶段未找到的优秀资料；然后基于选优后的内容生成issue-20230805-stage2.json，作为第三阶段的输入。
第三阶段(stage3)：这一阶段也都是自动化的，程序基于第二阶段的输出issue-20230805-stage2.json中内容，逐条生成摘要，并将文章标题和摘要翻译为中文，最后生成两个文件：issue-20230805.html和issue-20230805.md，前者将被发布到邮件列表和gopherdaily github page上，而后者则会被上传到传统的GopherDaily归档项目中。

我个人的目标是将改进后的整个“半自动化”过程缩短在半小时以内，从试运行效果来看，基本达成！

下面我就来简要聊聊各个自动化步骤是如何实现的。

2. Go技术资料自动收集

GopherDaily制作效率提升的一个大前提就是可以将最耗时的“资料收集”环节自动化了！而要做到这一点，下面两方面不可或缺：

资料源集合
针对资料源的最新文章的感知和拉取

2.1 资料源的来源

资料源从哪里来呢？答案是以往的GopherDaily issues中！四年来积累了数千篇文章的URL，从这些issue中提取URL并按URL中域名/域名+一级路径的出现次数做个排序，得到GopherDaily改版后的初始资料源集合。虽然这个方案并不完美，但至少可以满足改版后的初始需求，后续还可以对资料源做渐进的手工优化。

提取文本中URL的方法有很多种，常用的一种方法是使用正则表达式，下面是一个从markdown或txt文件中提取url并输出的例子：

// extract-url/main.go

package main

import (
    "bufio"
    "fmt"
    "os"
    "path/filepath"
    "regexp"
)

func main() {
    var allURLs []string

    err := filepath.Walk("/Users/tonybai/blog/gitee.com/gopherdaily", func(path string, info os.FileInfo, err error) error {
        if err != nil {
            return err
        }

        if info.IsDir() {
            return nil
        }

        if filepath.Ext(path) != ".txt" && filepath.Ext(path) != ".md" {
            return nil
        }

        file, err := os.Open(path)
        if err != nil {
            return err
        }
        defer file.Close()

        scanner := bufio.NewScanner(file)
        urlRegex := regexp.MustCompile(`https?://[^\s]+`)

        for scanner.Scan() {
            urls := urlRegex.FindAllString(scanner.Text(), -1)
            allURLs = append(allURLs, urls...)
        }

        return scanner.Err()
    })

    if err != nil {
        fmt.Println(err)
        return
    }

    for _, url := range allURLs {
        fmt.Printf("%s\n", url)
    }
    fmt.Println(len(allURLs))
}

我将提取并分析后得到的URL放入一个临时文件中，因为仅提取URL还不够，要做为资料源，我们需要的是对应站点的feed地址。那么如何提取出站点的feed地址呢？我们看下面这个例子：

// extract_rss/main.go

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
    "regexp"
)

var (
    rss  = regexp.MustCompile(`<link[^>]*type="application/rss\+xml"[^>]*href="([^"]+)"`)
    atom = regexp.MustCompile(`<link[^>]*type="application/atom\+xml"[^>]*href="([^"]+)"`)
)

func main() {
    var sites = []string{
        "http://research.swtch.com",
        "https://tonybai.com",
        "https://benhoyt.com/writings",
    }

    for _, url := range sites {
        resp, err := http.Get(url)
        if err != nil {
            fmt.Println("Error fetching URL:", err)
            continue
        }
        defer resp.Body.Close()

        body, err := ioutil.ReadAll(resp.Body)
        if err != nil {
            fmt.Println("Error reading response body:", err)
            continue
        }

        matches := rss.FindAllStringSubmatch(string(body), -1)
        if len(matches) == 0 {
            matches = atom.FindAllStringSubmatch(string(body), -1)
            if len(matches) == 0 {
                continue
            }
        }

        fmt.Printf("\"%s\" -> rss: \"%s\"\n", url, matches[0][1])
    }
}

执行上述程序，我们得到如下结果：

"http://research.swtch.com" -> rss: "http://research.swtch.com/feed.atom"
"https://tonybai.com" -> rss: "https://tonybai.com/feed/"
"https://benhoyt.com/writings" -> rss: "/writings/rss.xml"

我们看到不同站点的rss地址值着实不同，有些是完整的url地址，有些则是相对于主站点url的路径，这个还需要进一步判断与处理，但这里就不赘述了。

我们将提取和处理后的feed地址放入feeds.toml中作为资料源集合。每天开始制作Gopher Daily时，就从读取这个文件中的资料源开始。

2.2 感知和拉取资料源的更新

有了资料源集合后，我们接下来要做的就是定期感知和拉取资料源的最新更新（暂定24小时以内的），再说白点就是拉取资料源的feed数据，解析内容，得到资料源的最新文章信息。针对feed拉取与解析，Go社区有现成的工具，比如gofeed就是其中功能较为齐全且表现稳定的一个。

下面是使用Gofeed抓取feed地址并获取文章信息的例子：

// gofeed/main.go

package main

import (
    "fmt"

    "github.com/mmcdole/gofeed"
)

func main() {

    var feeds = []string{
        "https://research.swtch.com/feed.atom",
        "https://tonybai.com/feed/",
        "https://benhoyt.com/writings/rss.xml",
    }

    fp := gofeed.NewParser()
    for _, feed := range feeds {
        feedInfo, err := fp.ParseURL(feed)
        if err != nil {
            fmt.Printf("parse feed [%s] error: %s\n", feed, err.Error())
            continue
        }
        fmt.Printf("The info of feed url: %s\n", feed)
        for _, item := range feedInfo.Items {
            fmt.Printf("\t title: %s\n", item.Title)
            fmt.Printf("\t link: %s\n", item.Link)
            fmt.Printf("\t published: %s\n", item.Published)
        }
        fmt.Println("")
    }
}

该程序分别解析三个feed地址，并分别输出得到的文章信息，包括标题、url和发布时间。运行上述程序我们将得到如下结果：

$go run main.go
The info of feed url: https://research.swtch.com/feed.atom
     title: Coroutines for Go
     link: http://research.swtch.com/coro
     published: 2023-07-17T14:00:00-04:00
     title: Storing Data in Control Flow
     link: http://research.swtch.com/pcdata
     published: 2023-07-11T14:00:00-04:00
     title: Opting In to Transparent Telemetry
     link: http://research.swtch.com/telemetry-opt-in
     published: 2023-02-24T08:59:00-05:00
     title: Use Cases for Transparent Telemetry
     link: http://research.swtch.com/telemetry-uses
     published: 2023-02-08T08:00:03-05:00
     title: The Design of Transparent Telemetry
     link: http://research.swtch.com/telemetry-design
     published: 2023-02-08T08:00:02-05:00
     title: Transparent Telemetry for Open-Source Projects
     link: http://research.swtch.com/telemetry-intro
     published: 2023-02-08T08:00:01-05:00
     title: Transparent Telemetry
     link: http://research.swtch.com/telemetry
     published: 2023-02-08T08:00:00-05:00
     title: The Magic of Sampling, and its Limitations
     link: http://research.swtch.com/sample
     published: 2023-02-04T12:00:00-05:00
     title: Go’s Version Control History
     link: http://research.swtch.com/govcs
     published: 2022-02-14T10:00:00-05:00
     title: What NPM Should Do Today To Stop A New Colors Attack Tomorrow
     link: http://research.swtch.com/npm-colors
     published: 2022-01-10T11:45:00-05:00
     title: Our Software Dependency Problem
     link: http://research.swtch.com/deps
     published: 2019-01-23T11:00:00-05:00
     title: What is Software Engineering?
     link: http://research.swtch.com/vgo-eng
     published: 2018-05-30T10:00:00-04:00
     title: Go and Dogma
     link: http://research.swtch.com/dogma
     published: 2017-01-09T09:00:00-05:00
     title: A Tour of Acme
     link: http://research.swtch.com/acme
     published: 2012-09-17T11:00:00-04:00
     title: Minimal Boolean Formulas
     link: http://research.swtch.com/boolean
     published: 2011-05-18T00:00:00-04:00
     title: Zip Files All The Way Down
     link: http://research.swtch.com/zip
     published: 2010-03-18T00:00:00-04:00
     title: UTF-8: Bits, Bytes, and Benefits
     link: http://research.swtch.com/utf8
     published: 2010-03-05T00:00:00-05:00
     title: Computing History at Bell Labs
     link: http://research.swtch.com/bell-labs
     published: 2008-04-09T00:00:00-04:00
     title: Using Uninitialized Memory for Fun and Profit
     link: http://research.swtch.com/sparse
     published: 2008-03-14T00:00:00-04:00
     title: Play Tic-Tac-Toe with Knuth
     link: http://research.swtch.com/tictactoe
     published: 2008-01-25T00:00:00-05:00
     title: Crabs, the bitmap terror!
     link: http://research.swtch.com/crabs
     published: 2008-01-09T00:00:00-05:00

The info of feed url: https://tonybai.com/feed/
     title: Go语言开发者的Apache Arrow使用指南：读写Parquet文件
     link: https://tonybai.com/2023/07/31/a-guide-of-using-apache-arrow-for-gopher-part6/
     published: Mon, 31 Jul 2023 13:07:28 +0000
     title: Go语言开发者的Apache Arrow使用指南：扩展compute包
     link: https://tonybai.com/2023/07/22/a-guide-of-using-apache-arrow-for-gopher-part5/
     published: Sat, 22 Jul 2023 13:58:57 +0000
     title: 使用testify包辅助Go测试指南
     link: https://tonybai.com/2023/07/16/the-guide-of-go-testing-with-testify-package/
     published: Sun, 16 Jul 2023 07:09:56 +0000
     title: Go语言开发者的Apache Arrow使用指南：数据操作
     link: https://tonybai.com/2023/07/13/a-guide-of-using-apache-arrow-for-gopher-part4/
     published: Thu, 13 Jul 2023 14:41:25 +0000
     title: Go语言开发者的Apache Arrow使用指南：高级数据结构
     link: https://tonybai.com/2023/07/08/a-guide-of-using-apache-arrow-for-gopher-part3/
     published: Sat, 08 Jul 2023 15:27:54 +0000
     title: Apache Arrow：驱动列式分析性能和连接性的提升[译]
     link: https://tonybai.com/2023/07/01/arrow-columnar-analytics/
     published: Sat, 01 Jul 2023 14:42:29 +0000
     title: Go语言开发者的Apache Arrow使用指南：内存管理
     link: https://tonybai.com/2023/06/30/a-guide-of-using-apache-arrow-for-gopher-part2/
     published: Fri, 30 Jun 2023 14:00:59 +0000
     title: Go语言开发者的Apache Arrow使用指南：数据类型
     link: https://tonybai.com/2023/06/25/a-guide-of-using-apache-arrow-for-gopher-part1/
     published: Sat, 24 Jun 2023 20:43:38 +0000
     title: Go语言包设计指南
     link: https://tonybai.com/2023/06/18/go-package-design-guide/
     published: Sun, 18 Jun 2023 15:03:41 +0000
     title: Go GC：了解便利背后的开销
     link: https://tonybai.com/2023/06/13/understand-go-gc-overhead-behind-the-convenience/
     published: Tue, 13 Jun 2023 14:00:16 +0000

The info of feed url: https://benhoyt.com/writings/rss.xml
     title: The proposal to enhance Go's HTTP router
     link: https://benhoyt.com/writings/go-servemux-enhancements/
     published: Mon, 31 Jul 2023 08:00:00 +1200
     title: Scripting with Go: a 400-line Git client that can create a repo and push itself to GitHub
     link: https://benhoyt.com/writings/gogit/
     published: Sat, 29 Jul 2023 16:30:00 +1200
     title: Names should be as short as possible while still being clear
     link: https://benhoyt.com/writings/short-names/
     published: Mon, 03 Jul 2023 21:00:00 +1200
     title: Lookup Tables (Forth Dimensions XIX.3)
     link: https://benhoyt.com/writings/forth-lookup-tables/
     published: Sat, 01 Jul 2023 22:10:00 +1200
     title: For Python packages, file structure != API
     link: https://benhoyt.com/writings/python-api-file-structure/
     published: Fri, 30 Jun 2023 22:50:00 +1200
     title: Designing Pythonic library APIs
     link: https://benhoyt.com/writings/python-api-design/
     published: Sun, 18 Jun 2023 21:00:00 +1200
     title: From Go on EC2 to Fly.io: +fun, −$9/mo
     link: https://benhoyt.com/writings/flyio/
     published: Mon, 27 Feb 2023 10:00:00 +1300
     title: Code coverage for your AWK programs
     link: https://benhoyt.com/writings/goawk-coverage/
     published: Sat, 10 Dec 2022 13:41:00 +1300
     title: I/O is no longer the bottleneck
     link: https://benhoyt.com/writings/io-is-no-longer-the-bottleneck/
     published: Sat, 26 Nov 2022 22:20:00 +1300
     title: microPledge: our startup that (we wish) competed with Kickstarter
     link: https://benhoyt.com/writings/micropledge/
     published: Mon, 14 Nov 2022 20:00:00 +1200
     title: Rob Pike's simple C regex matcher in Go
     link: https://benhoyt.com/writings/rob-pike-regex/
     published: Fri, 12 Aug 2022 14:00:00 +1200
     title: Tools I use to build my website
     link: https://benhoyt.com/writings/tools-i-use-to-build-my-website/
     published: Tue, 02 Aug 2022 19:00:00 +1200
     title: Modernizing AWK, a 45-year old language, by adding CSV support
     link: https://benhoyt.com/writings/goawk-csv/
     published: Tue, 10 May 2022 09:30:00 +1200
     title: Prig: like AWK, but uses Go for "scripting"
     link: https://benhoyt.com/writings/prig/
     published: Sun, 27 Feb 2022 18:20:00 +0100
     title: Go performance from version 1.2 to 1.18
     link: https://benhoyt.com/writings/go-version-performance/
     published: Fri, 4 Feb 2022 09:30:00 +1300
     title: Optimizing GoAWK with a bytecode compiler and virtual machine
     link: https://benhoyt.com/writings/goawk-compiler-vm/
     published: Thu, 3 Feb 2022 22:25:00 +1300
     title: AWKGo, an AWK-to-Go compiler
     link: https://benhoyt.com/writings/awkgo/
     published: Mon, 22 Nov 2021 00:10:00 +1300
     title: Improving the code from the official Go RESTful API tutorial
     link: https://benhoyt.com/writings/web-service-stdlib/
     published: Wed, 17 Nov 2021 07:00:00 +1300
     title: Simple Lists: a tiny to-do list app written the old-school way (server-side Go, no JS)
     link: https://benhoyt.com/writings/simple-lists/
     published: Mon, 4 Oct 2021 07:30:00 +1300
     title: Structural pattern matching in Python 3.10
     link: https://benhoyt.com/writings/python-pattern-matching/
     published: Mon, 20 Sep 2021 19:30:00 +1200
     title: Mugo, a toy compiler for a subset of Go that can compile itself
     link: https://benhoyt.com/writings/mugo/
     published: Mon, 12 Apr 2021 20:30:00 +1300
     title: How to implement a hash table (in C)
     link: https://benhoyt.com/writings/hash-table-in-c/
     published: Fri, 26 Mar 2021 20:30:00 +1300
     title: Performance comparison: counting words in Python, Go, C++, C, AWK, Forth, and Rust
     link: https://benhoyt.com/writings/count-words/
     published: Mon, 15 Mar 2021 20:30:00 +1300
     title: The small web is beautiful
     link: https://benhoyt.com/writings/the-small-web-is-beautiful/
     published: Tue, 2 Mar 2021 06:50:00 +1300
     title: Coming in Go 1.16: ReadDir and DirEntry
     link: https://benhoyt.com/writings/go-readdir/
     published: Fri, 29 Jan 2021 10:00:00 +1300
     title: Fuzzing in Go
     link: https://lwn.net/Articles/829242/
     published: Tue, 25 Aug 2020 08:00:00 +1200
     title: Searching code with Sourcegraph
     link: https://lwn.net/Articles/828748/
     published: Mon, 17 Aug 2020 08:00:00 +1200
     title: Different approaches to HTTP routing in Go
     link: https://benhoyt.com/writings/go-routing/
     published: Fri, 31 Jul 2020 08:00:00 +1200
     title: Go filesystems and file embedding
     link: https://lwn.net/Articles/827215/
     published: Fri, 31 Jul 2020 00:00:00 +1200
     title: The sad, slow-motion death of Do Not Track
     link: https://lwn.net/Articles/826575/
     published: Wed, 22 Jul 2020 11:00:00 +1200
     title: What's new in Lua 5.4
     link: https://lwn.net/Articles/826134/
     published: Wed, 15 Jul 2020 11:00:00 +1200
     title: Hugo: a static-site generator
     link: https://lwn.net/Articles/825507/
     published: Wed, 8 Jul 2020 11:00:00 +1200
     title: Generics for Go
     link: https://lwn.net/Articles/824716/
     published: Wed, 1 Jul 2020 11:00:00 +1200
     title: More alternatives to Google Analytics
     link: https://lwn.net/Articles/824294/
     published: Wed, 24 Jun 2020 11:00:00 +1200
     title: Lightweight Google Analytics alternatives
     link: https://lwn.net/Articles/822568/
     published: Wed, 17 Jun 2020 11:00:00 +1200
     title: An intro to Go for non-Go developers
     link: https://benhoyt.com/writings/go-intro/
     published: Wed, 10 Jun 2020 23:38:00 +1200
     title: ZZT in Go (using a Pascal-to-Go converter)
     link: https://benhoyt.com/writings/zzt-in-go/
     published: Fri, 29 May 2020 17:25:00 +1200
     title: Testing in Go: philosophy and tools
     link: https://lwn.net/Articles/821358/
     published: Wed, 27 May 2020 12:00:00 +1200
     title: The state of the AWK
     link: https://lwn.net/Articles/820829/
     published: Wed, 20 May 2020 12:00:00 +1200
     title: What's coming in Go 1.15
     link: https://lwn.net/Articles/820217/
     published: Wed, 13 May 2020 12:00:00 +1200
     title: Don't try to sanitize input. Escape output.
     link: https://benhoyt.com/writings/dont-sanitize-do-escape/
     published: Thu, 27 Feb 2020 19:27:00 +1200
     title: SEO for Software Engineers
     link: https://benhoyt.com/writings/seo-for-software-engineers/
     published: Thu, 20 Feb 2020 12:00:00 +1200

注：gofeed抓取的item.Description是文章的摘要。但这个摘要不一定可以真实反映文章内容的概要，很多就是文章内容的前N个字而已。

Gopher Daily半自动化改造的另外一个技术课题是对拉取的文章做自动摘要与标题摘要的翻译，下面我们继续来看一下这个课题如何攻破。

注：目前微信公众号的优质文章尚未实现自动拉取，还需手工选优。

3. 自动摘要与翻译

对一段文本提取摘要和翻译均属于自然语言处理(NLP)范畴，说实话，Go在这个范畴中并不活跃，很难找到像样的开源算法实现或工具可直接使用。我的解决方案是借助云平台供应商的NLP API来做，这里我用的是微软Azure的相关API。

在使用现成的API之前，我们需要抓取特定url上的html页面并提取出要进行摘要的文本。

3.1 提取html中的原始文本

我们通过http.Get可以获取到一个文章URL上的html页面的所有内容，但如何提取出主要文本以供后续提取摘要使用呢？每个站点上的html内容都包含了很多额外内容，比如header、footer、分栏、边栏、导航栏等，这些内容对摘要的生成具有一定影响。我们最好能将这些额外内容剔除掉。但html的解析还是十分复杂的，我的解决方案是将html转换为markdown后再提交给摘要API。

html-to-markdown是一款不错的转换工具，它最吸引我的是可以删除原HTML中的一些tag，并自定义一些rule。下面的例子就是用html-to-markdown获取文章原始本文的例子：

// get-original-text/main.go

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"

    md "github.com/JohannesKaufmann/html-to-markdown"
)

func main() {
    s, err := getOriginText("http://research.swtch.com/coro")
    if err != nil {
        panic(err)
    }
    fmt.Println(s)
}

func getOriginText(url string) (string, error) {
    resp, err := http.Get(url)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()

    body, _ := ioutil.ReadAll(resp.Body)

    converter := md.NewConverter("", true, nil).Remove("header",
        "footer", "aside", "table", "nav") //"table" is used to store code

    markdown, err := converter.ConvertString(string(body))
    if err != nil {
        return "", err
    }
    return markdown, nil
}

在这个例子中，我们删除了header、footer、边栏、导航栏等，尽可能的保留主要文本。针对这个例子我就不执行了，大家可以自行执行并查看执行结果。

3.2 提取摘要

我们通过微软Azure提供的摘要提取API进行摘要提取。微软Azure的这个API提供的免费额度，足够我这边制作Gopher Daily使用了。

注：要使用微软Azure提供的各类免费API，需要先注册Azure的账户。目前摘要提取API仅在North Europe, East US, UK South三个region提供，创建API服务时别选错Region了。我这里用的是East US。

注：Azure控制台较为难用，大家要有心理准备:)。

微软这个摘要API十分复杂，下面给出一个用curl调用API的示例。

摘要提取API的使用分为两步。第一步是请求对原始文本进行摘要处理，比如：

$curl -i -X POST https://gopherdaily-summarization.cognitiveservices.azure.com/language/analyze-text/jobs?api-version=2022-10-01-preview \
-H "Content-Type: application/json" \
-H "Ocp-Apim-Subscription-Key: your_api_key" \
-d \
'
{
  "displayName": "Document Abstractive Summarization Task Example",
  "analysisInput": {
    "documents": [
      {
        "id": "1",
        "language": "en",
        "text": "At Microsoft, we have been on a quest to advance AI beyond existing techniques, by taking a more holistic, human-centric approach to learning and understanding. As Chief Technology Officer of Azure AI services, I have been working with a team of amazing scientists and engineers to turn this quest into a reality. In my role, I enjoy a unique perspective in viewing the relationship among three attributes of human cognition: monolingual text (X), audio or visual sensory signals, (Y) and multilingual (Z). At the intersection of all three, there’s magic—what we call XYZ-code as illustrated in Figure 1—a joint representation to create more powerful AI that can speak, hear, see, and understand humans better. We believe XYZ-code will enable us to fulfill our long-term vision: cross-domain transfer learning, spanning modalities and languages. The goal is to have pre-trained models that can jointly learn representations to support a broad range of downstream AI tasks, much in the way humans do today. Over the past five years, we have achieved human performance on benchmarks in conversational speech recognition, machine translation, conversational question answering, machine reading comprehension, and image captioning. These five breakthroughs provided us with strong signals toward our more ambitious aspiration to produce a leap in AI capabilities, achieving multi-sensory and multilingual learning that is closer in line with how humans learn and understand. I believe the joint XYZ-code is a foundational component of this aspiration, if grounded with external knowledge sources in the downstream AI tasks."
      }
    ]
  },
  "tasks": [
    {
      "kind": "AbstractiveSummarization",
      "taskName": "Document Abstractive Summarization Task 1",
      "parameters": {
        "sentenceCount": 1
      }
    }
  ]
}
'

请求成功后，我们将得到一段应答，应答中包含类似operation-location的一段地址：

Operation-Location:[https://gopherdaily-summarization.cognitiveservices.azure.com/language/analyze-text/jobs/66e7e3a1-697c-4fad-864c-d84c647682b4?api-version=2022-10-01-preview]

这段地址就是第二步的请求地址，第二步是从这个地址获取摘要后的本文：

$curl -X GET https://gopherdaily-summarization.cognitiveservices.azure.com/language/analyze-text/jobs/66e7e3a1-697c-4fad-864c-d84c647682b4\?api-version\=2022-10-01-preview \
-H "Content-Type: application/json" \
-H "Ocp-Apim-Subscription-Key: your_api_key"
{"jobId":"66e7e3a1-697c-4fad-864c-d84c647682b4","lastUpdatedDateTime":"2023-07-27T11:09:45Z","createdDateTime":"2023-07-27T11:09:44Z","expirationDateTime":"2023-07-28T11:09:44Z","status":"succeeded","errors":[],"displayName":"Document Abstractive Summarization Task Example","tasks":{"completed":1,"failed":0,"inProgress":0,"total":1,"items":[{"kind":"AbstractiveSummarizationLROResults","taskName":"Document Abstractive Summarization Task 1","lastUpdateDateTime":"2023-07-27T11:09:45.8892126Z","status":"succeeded","results":{"documents":[{"summaries":[{"text":"Microsoft has been working to advance AI beyond existing techniques by taking a more holistic, human-centric approach to learning and understanding, and the Chief Technology Officer of Azure AI services, who enjoys a unique perspective in viewing the relationship among three attributes of human cognition: monolingual text, audio or visual sensory signals, and multilingual, has created XYZ-code, a joint representation to create more powerful AI that can speak, hear, see, and understand humans better.","contexts":[{"offset":0,"length":1619}]}],"id":"1","warnings":[]}],"errors":[],"modelVersion":"latest"}}]}}%

大家可以根据请求和应答的JSON结构，结合一些json-to-struct工具自行实现Azure摘要API的Go代码。

3.3 翻译

Azure的翻译API相对于摘要API要简单的多。

下面是使用curl演示翻译API的示例：

$curl -X POST "https://api.cognitive.microsofttranslator.com/translate?api-version=3.0&to=zh" \
     -H "Ocp-Apim-Subscription-Key:your_api_key" \
     -H "Ocp-Apim-Subscription-Region:westcentralus" \
     -H "Content-Type: application/json" \
     -d "[{'Text':'Hello, what is your name?'}]"

[{"detectedLanguage":{"language":"en","score":1.0},"translations":[{"text":"你好，你叫什么名字？","to":"zh-Hans"}]}]%

大家可以根据请求和应答的JSON结构，结合一些json-to-struct工具自行实现Azure翻译API的Go代码。

对于源文章是中文的，我们可以无需调用该API进行翻译，下面是一个判断字符串是否为中文的函数：

func isChinese(s string) bool {
    for _, r := range s {
        if unicode.Is(unicode.Scripts["Han"], r) {
            return true
        }
    }
    return false
}

4. 页面样式设计与html生成

这次Gopher Daily改版，我为Gopher Daily提供了Web版和邮件列表版，但页面设计是我最不擅长的。好在，和四年前相比，IT技术又有了进一步的发展，以ChatGPT为代表的大语言模型如雨后春笋般层出不穷，我可以借助大模型的帮助来为我设计和实现一个简单的html页面了。下图就是这次改版后的第一版页面：

整个页面分为四大部分：Go、云原生(与Go关系紧密，程序员相关，架构相关的内容也放在这部分)、AI(当今流行)以及热门工具与项目(目前主要是github trending中每天Go项目的top列表中的内容)。

每一部分每个条目都包含文章标题、文章链接和文章的摘要，摘要的增加可以帮助大家更好的预览文章内容。

html和markdown的生成都是基于Go的template技术，template也是借助claude.ai设计与实现的，这里就不赘述了。

5. 服务器选型

以前的Gopher Daily仅是在github上的一个开源项目，大家通过watch来订阅。此外，Basten Gao维护着一个第三方的邮件列表，在此也对Basten Gao对Gopher Daily的长期支持表示感谢。

如今改版后，我原生提供了Gopher Daily的Web版，我需要为Gopher Daily选择服务器。

简单起见，我选用了github page来承载Gopher Daily的Web版。

至于邮件列表的订阅、取消订阅，我则是开发了一个小小的服务，跑在Digital Ocean的VPS上。

在选择反向代理web服务器时，我放弃了nginx，选择了同样Go技术栈实现的Caddy。Caddy最大好处就是易上手，且默认自动支持HTTPS，我无需自行用工具向免费证书机构(如 Let’s Encrypt或ZeroSSL)去申请和维护证书。

6 小结

这次改版后的Gopher Daily应得上那句话：“麻雀虽小，五脏俱全”：我为此开发了三个工具，一个服务。

当然Gopher Daily还在持续优化，后续也会根据Gopher们的反馈作适当调整。

摘要和翻译目前使用Azure API，后续可能会改造为使用类ChatGPT的API。

此外，知识星球Gopher部落的星友们依然拥有“先睹为快”的权益。

本文示例代码可以在这里下载。

Gopher Daily网页版 – https://gopherdaily.tonybai.com
Gopher Daily邮件列表订阅 – https://gopherdaily.tonybai.com/subscribe
Gopher Daily项目归档(markdown版本) – https://github.com/bigwhite/gopherdaily

“Gopher部落”知识星球旨在打造一个精品Go学习和进阶社群！高品质首发Go技术文章，“三天”首发阅读权，每年两期Go语言发展现状分析，每天提前1小时阅读到新鲜的Gopher日报，网课、技术专栏、图书内容前瞻，六小时内必答保证等满足你关于Go语言生态的所有需求！2023年，Gopher部落将进一步聚焦于如何编写雅、地道、可读、可测试的Go代码，关注代码质量并深入理解Go核心技术，并继续加强与星友的互动。欢迎大家加入！

img{512x368}

著名云主机服务厂商DigitalOcean发布最新的主机计划，入门级Droplet配置升级为：1 core CPU、1G内存、25G高速SSD，价格5$/月。有使用DigitalOcean需求的朋友，可以打开这个链接地址：https://m.do.co/c/bff6eed92687 开启你的DO主机之路。

Gopher Daily(Gopher每日新闻) – https://gopherdaily.tonybai.com

我的联系方式：

微博(暂不可用)：https://weibo.com/bigwhite20xx
微博2：https://weibo.com/u/6484441286
博客：tonybai.com
github: https://github.com/bigwhite
Gopher Daily归档 – https://github.com/bigwhite/gopherdaily

商务合作方式：撰稿、出书、培训、在线课程、合伙创业、咨询、广告合作。

Go语言开发者的Apache Arrow使用指南：读写Parquet文件

七月 31, 2023
0 条评论

本文永久链接 – https://tonybai.com/2023/07/31/a-guide-of-using-apache-arrow-for-gopher-part6

Apache Arrow是一种开放的、与语言无关的列式内存格式，在本系列文章的前几篇中，我们都聚焦于内存表示与内存操作。

但对于一个数据库系统或大数据分析平台来说，数据不能也无法一直放在内存中，虽说目前内存很大也足够便宜了，但其易失性也决定了我们在特定时刻还是要将数据序列化后存储到磁盘或一些低成本的存储服务上(比如AWS的S3等)。

那么将Arrow序列化成什么存储格式呢？CSV、JSON？显然这些格式都不是为最大限度提高空间效率以及数据检索能力而设计的。在数据分析领域，Apache Parquet是与Arrow相似的一种开放的、面向列的数据存储格式，它被设计用于高效的数据编码和检索并最大限度提高空间效率。

和Arrow是一种内存格式不同，Parquet是一种数据文件格式。此外，Arrow和Parquet在设计上也做出了各自的一些取舍。Arrow旨在由矢量化计算内核对数据进行操作，提供对任何数组索引的 O(1) 随机访问查找能力；而Parquet为了最大限度提高空间效率，采用了可变长度编码方案和块压缩来大幅减小数据大小，这些技术都是以丧失高性能随机存取查找为代价的。

Parquet也是Apache的顶级项目，大多数实现了Arrow的编程语言也都提供了支持Arrow格式与Parquet文件相互转换的库实现，Go也不例外。在本文中，我们就来粗浅看一下如何使用Go实现Parquet文件的读写，即Arrow和Parquet的相互转换。

注：关于Parquet文件的详细格式(也蛮复杂)，我可能会在后续文章中说明。

1. Parquet简介

如果不先说一说Parquet文件格式，后面的内容理解起来会略有困难的。下面是一个Parquet文件的结构示意图：

图来自https://www.uber.com/blog/cost-efficiency-big-data

我们看到Parquet格式的文件被分为多个row group，每个row group由每一列的列块(column chunk)组成。考虑到磁盘存储的特点，每个列块又分为若干个页。这个列块中的诸多同构类型的列值可以在编码和压缩后存储在各个页中。下面是Parquet官方文档中Parquet文件中数据存储的具体示意图：

我们看到Parquet按row group顺序向后排列，每个row group中column chunk也是依column次序向后排列的。

注：关于上图中repetion level和definition level这样的高级概念，不会成为理解本文内容的障碍，我们将留到后续文章中系统说明。

2. Arrow Table <-> Parquet

有了上面Parquet文件格式的初步知识后，接下来我们就来看看如何使用Go在Arrow和Parquet之间进行转换。

在《高级数据结构》一文中，我们学习了Arrow Table和Record Batch两种高级结构。接下来我们就来看看如何将Table或Record与Parquet进行转换。一旦像Table、Record Batch这样的高级结构的转换搞定了，那Arrow中的那些简单数据类型)也就不在话下了。况且在实际项目中，我们面对更多的也是Arrow的高级数据结构(Table或Record)与Parquet的转换。

我们先来看看Table。

2.1 Table -> Parquet

通过在《高级数据结构》一文，我们知道了Arrow Table的每一列本质上就是Schema+Chunked Array，这和Parquet的文件格式具有较高的适配度。

Arrow Go的parquet实现提供对了Table的良好支持，我们通过一个WriteTable函数就可以将内存中的Arrow Table持久化为Parquet格式的文件，我们来看看下面这个示例：

// flat_table_to_parquet.go

package main

import (
    "os"

    "github.com/apache/arrow/go/v13/arrow"
    "github.com/apache/arrow/go/v13/arrow/array"
    "github.com/apache/arrow/go/v13/arrow/memory"
    "github.com/apache/arrow/go/v13/parquet/pqarrow"
)

func main() {
    schema := arrow.NewSchema(
        []arrow.Field{
            {Name: "col1", Type: arrow.PrimitiveTypes.Int32},
            {Name: "col2", Type: arrow.PrimitiveTypes.Float64},
            {Name: "col3", Type: arrow.BinaryTypes.String},
        },
        nil,
    )

    col1 := func() *arrow.Column {
        chunk := func() *arrow.Chunked {
            ib := array.NewInt32Builder(memory.DefaultAllocator)
            defer ib.Release()

            ib.AppendValues([]int32{1, 2, 3}, nil)
            i1 := ib.NewInt32Array()
            defer i1.Release()

            ib.AppendValues([]int32{4, 5, 6, 7, 8, 9, 10}, nil)
            i2 := ib.NewInt32Array()
            defer i2.Release()

            c := arrow.NewChunked(
                arrow.PrimitiveTypes.Int32,
                []arrow.Array{i1, i2},
            )
            return c
        }()
        defer chunk.Release()

        return arrow.NewColumn(schema.Field(0), chunk)
    }()
    defer col1.Release()

    col2 := func() *arrow.Column {
        chunk := func() *arrow.Chunked {
            fb := array.NewFloat64Builder(memory.DefaultAllocator)
            defer fb.Release()

            fb.AppendValues([]float64{1.1, 2.2, 3.3, 4.4, 5.5}, nil)
            f1 := fb.NewFloat64Array()
            defer f1.Release()

            fb.AppendValues([]float64{6.6, 7.7}, nil)
            f2 := fb.NewFloat64Array()
            defer f2.Release()

            fb.AppendValues([]float64{8.8, 9.9, 10.0}, nil)
            f3 := fb.NewFloat64Array()
            defer f3.Release()

            c := arrow.NewChunked(
                arrow.PrimitiveTypes.Float64,
                []arrow.Array{f1, f2, f3},
            )
            return c
        }()
        defer chunk.Release()

        return arrow.NewColumn(schema.Field(1), chunk)
    }()
    defer col2.Release()

    col3 := func() *arrow.Column {
        chunk := func() *arrow.Chunked {
            sb := array.NewStringBuilder(memory.DefaultAllocator)
            defer sb.Release()

            sb.AppendValues([]string{"s1", "s2"}, nil)
            s1 := sb.NewStringArray()
            defer s1.Release()

            sb.AppendValues([]string{"s3", "s4"}, nil)
            s2 := sb.NewStringArray()
            defer s2.Release()

            sb.AppendValues([]string{"s5", "s6", "s7", "s8", "s9", "s10"}, nil)
            s3 := sb.NewStringArray()
            defer s3.Release()

            c := arrow.NewChunked(
                arrow.BinaryTypes.String,
                []arrow.Array{s1, s2, s3},
            )
            return c
        }()
        defer chunk.Release()

        return arrow.NewColumn(schema.Field(2), chunk)
    }()
    defer col3.Release()

    var tbl arrow.Table
    tbl = array.NewTable(schema, []arrow.Column{*col1, *col2, *col3}, -1)
    defer tbl.Release()

    f, err := os.Create("flat_table.parquet")
    if err != nil {
        panic(err)
    }
    defer f.Close()

    err = pqarrow.WriteTable(tbl, f, 1024, nil, pqarrow.DefaultWriterProps())
    if err != nil {
        panic(err)
    }
}

我们基于arrow的Builder模式以及NewTable创建了一个拥有三个列的Table(该table的创建例子来自于《高级数据结构》一文)。有了table后，我们直接调用pqarrow的WriteTable函数即可将table写成parquet格式的文件。

我们来运行一下上述代码：

$go run flat_table_to_parquet.go

执行完上面命令后，当前目录下会出现一个flat_table.parquet的文件！

我们如何查看该文件内容来验证写入的数据是否与table一致呢？arrow go的parquet实现提供了一个parquet_reader的工具可以帮助我们做到这点，你可以执行如下命令安装这个工具：

$go install github.com/apache/arrow/go/v13/parquet/cmd/parquet_reader@latest

之后我们就可以执行下面命令查看我们刚刚生成的flat_table.parquet文件的内容了：

$parquet_reader flat_table.parquet
File name: flat_table.parquet
Version: v2.6
Created By: parquet-go version 13.0.0-SNAPSHOT
Num Rows: 10
Number of RowGroups: 1
Number of Real Columns: 3
Number of Columns: 3
Number of Selected Columns: 3
Column 0: col1 (INT32/INT_32)
Column 1: col2 (DOUBLE)
Column 2: col3 (BYTE_ARRAY/UTF8)
--- Row Group: 0  ---
--- Total Bytes: 396  ---
--- Rows: 10  ---
Column 0
 Values: 10, Min: 1, Max: 10, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 111, Compressed Size: 111
Column 1
 Values: 10, Min: 1.1, Max: 10, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 169, Compressed Size: 169
Column 2
 Values: 10, Min: [115 49], Max: [115 57], Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 116, Compressed Size: 116
--- Values ---
col1              |col2              |col3              |
1                 |1.100000          |s1                |
2                 |2.200000          |s2                |
3                 |3.300000          |s3                |
4                 |4.400000          |s4                |
5                 |5.500000          |s5                |
6                 |6.600000          |s6                |
7                 |7.700000          |s7                |
8                 |8.800000          |s8                |
9                 |9.900000          |s9                |
10                |10.000000         |s10               |

parquet_reader列出了parquet文件的meta数据和每个row group中的column列的值，从输出来看，与我们arrow table的数据是一致的。

我们再回头看一下WriteTable函数，它的原型如下：

func WriteTable(tbl arrow.Table, w io.Writer, chunkSize int64,
                props *parquet.WriterProperties, arrprops ArrowWriterProperties) error

这里说一下WriteTable的前三个参数，第一个是通过NewTable得到的arrow table结构，第二个参数也容易理解，就是一个可写的文件描述符，我们通过os.Create可以轻松拿到，第三个参数为chunkSize，这个chunkSize是什么呢？会对parquet文件的写入结果有影响么？其实这个chunkSize就是每个row group中的行数。同时parquet通过该chunkSize也可以计算出arrow table转parquet文件后有几个row group。

我们示例中的chunkSize值为1024，因此整个parquet文件只有一个row group。下面我们将其值改为5，再来看看输出的parquet文件内容：

$parquet_reader flat_table.parquet
File name: flat_table.parquet
Version: v2.6
Created By: parquet-go version 13.0.0-SNAPSHOT
Num Rows: 10
Number of RowGroups: 2
Number of Real Columns: 3
Number of Columns: 3
Number of Selected Columns: 3
Column 0: col1 (INT32/INT_32)
Column 1: col2 (DOUBLE)
Column 2: col3 (BYTE_ARRAY/UTF8)
--- Row Group: 0  ---
--- Total Bytes: 288  ---
--- Rows: 5  ---
Column 0
 Values: 5, Min: 1, Max: 5, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 86, Compressed Size: 86
Column 1
 Values: 5, Min: 1.1, Max: 5.5, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 122, Compressed Size: 122
Column 2
 Values: 5, Min: [115 49], Max: [115 53], Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 80, Compressed Size: 80
--- Values ---
col1              |col2              |col3              |
1                 |1.100000          |s1                |
2                 |2.200000          |s2                |
3                 |3.300000          |s3                |
4                 |4.400000          |s4                |
5                 |5.500000          |s5                |

--- Row Group: 1  ---
--- Total Bytes: 290  ---
--- Rows: 5  ---
Column 0
 Values: 5, Min: 6, Max: 10, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 86, Compressed Size: 86
Column 1
 Values: 5, Min: 6.6, Max: 10, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 122, Compressed Size: 122
Column 2
 Values: 5, Min: [115 49 48], Max: [115 57], Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 82, Compressed Size: 82
--- Values ---
col1              |col2              |col3              |
6                 |6.600000          |s6                |
7                 |7.700000          |s7                |
8                 |8.800000          |s8                |
9                 |9.900000          |s9                |
10                |10.000000         |s10               |

当chunkSize值为5后，parquet文件的row group变成了2，然后parquet_reader工具会按照两个row group的格式分别输出它们的meta信息和列值信息。

接下来，我们再来看一下如何从生成的parquet文件中读取数据并转换为arrow table。

2.2 Table <- Parquet

和WriteTable函数对应，arrow提供了ReadTable函数读取parquet文件并转换为内存中的arrow table，下面是代码示例：

// flat_table_from_parquet.go
func main() {
    f, err := os.Open("flat_table.parquet")
    if err != nil {
        panic(err)
    }
    defer f.Close()

    tbl, err := pqarrow.ReadTable(context.Background(), f, parquet.NewReaderProperties(memory.DefaultAllocator),
        pqarrow.ArrowReadProperties{}, memory.DefaultAllocator)
    if err != nil {
        panic(err)
    }

    dumpTable(tbl)
}

func dumpTable(tbl arrow.Table) {
    s := tbl.Schema()
    fmt.Println(s)
    fmt.Println("------")

    fmt.Println("the count of table columns=", tbl.NumCols())
    fmt.Println("the count of table rows=", tbl.NumRows())
    fmt.Println("------")

    for i := 0; i < int(tbl.NumCols()); i++ {
        col := tbl.Column(i)
        fmt.Printf("arrays in column(%s):\n", col.Name())
        chunk := col.Data()
        for _, arr := range chunk.Chunks() {
            fmt.Println(arr)
        }
        fmt.Println("------")
    }
}

我们看到ReadTable使用起来非常简单，由于parquet文件中包含meta信息，我们调用ReadTable时，一些参数使用默认值或零值即可。

我们运行一下上述代码：

$go run flat_table_from_parquet.go
schema:
  fields: 3
    - col1: type=int32
      metadata: ["PARQUET:field_id": "-1"]
    - col2: type=float64
      metadata: ["PARQUET:field_id": "-1"]
    - col3: type=utf8
      metadata: ["PARQUET:field_id": "-1"]
------
the count of table columns= 3
the count of table rows= 10
------
arrays in column(col1):
[1 2 3 4 5 6 7 8 9 10]
------
arrays in column(col2):
[1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 10]
------
arrays in column(col3):
["s1" "s2" "s3" "s4" "s5" "s6" "s7" "s8" "s9" "s10"]
------

2.3 Table -> Parquet(压缩)

前面提到，Parquet文件格式的设计充分考虑了空间利用效率，再加上其是面向列存储的格式，Parquet支持列数据的压缩存储，并支持为不同列选择不同的压缩算法。

前面示例中调用的WriteTable在默认情况下是不对列进行压缩的，这从parquet_reader读取到的列的元信息中也可以看到(比如下面的Compression: UNCOMPRESSED)：

Column 0
 Values: 10, Min: 1, Max: 10, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 111, Compressed Size: 111

我们在WriteTable时也可以通过parquet.WriterProperties参数来为每个列指定压缩算法，比如下面示例：

// flat_table_to_parquet_compressed.go

var tbl arrow.Table
tbl = array.NewTable(schema, []arrow.Column{*col1, *col2, *col3}, -1)
defer tbl.Release()

f, err := os.Create("flat_table_compressed.parquet")
if err != nil {
    panic(err)
}
defer f.Close()

wp := parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy),
    parquet.WithCompressionFor("col1", compress.Codecs.Brotli))
err = pqarrow.WriteTable(tbl, f, 1024, wp, pqarrow.DefaultWriterProps())
if err != nil {
    panic(err)
}

在这段代码中，我们通过parquet.NewWriterProperties构建了新的WriterProperties，这个新的Properties默认所有列使用Snappy压缩，针对col1列使用Brotli算法压缩。我们将压缩后的数据写入flat_table_compressed.parquet文件。使用go run运行flat_table_to_parquet_compressed.go，然后使用parquet_reader查看文件flat_table_compressed.parquet得到如下结果：

$go run flat_table_to_parquet_compressed.go
$parquet_reader flat_table_compressed.parquet
File name: flat_table_compressed.parquet
Version: v2.6
Created By: parquet-go version 13.0.0-SNAPSHOT
Num Rows: 10
Number of RowGroups: 1
Number of Real Columns: 3
Number of Columns: 3
Number of Selected Columns: 3
Column 0: col1 (INT32/INT_32)
Column 1: col2 (DOUBLE)
Column 2: col3 (BYTE_ARRAY/UTF8)
--- Row Group: 0  ---
--- Total Bytes: 352  ---
--- Rows: 10  ---
Column 0
 Values: 10, Min: 1, Max: 10, Null Values: 0, Distinct Values: 0
 Compression: BROTLI, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 111, Compressed Size: 98
Column 1
 Values: 10, Min: 1.1, Max: 10, Null Values: 0, Distinct Values: 0
 Compression: SNAPPY, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 168, Compressed Size: 148
Column 2
 Values: 10, Min: [115 49], Max: [115 57], Null Values: 0, Distinct Values: 0
 Compression: SNAPPY, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 116, Compressed Size: 106
--- Values ---
col1              |col2              |col3              |
1                 |1.100000          |s1                |
2                 |2.200000          |s2                |
3                 |3.300000          |s3                |
4                 |4.400000          |s4                |
5                 |5.500000          |s5                |
6                 |6.600000          |s6                |
7                 |7.700000          |s7                |
8                 |8.800000          |s8                |
9                 |9.900000          |s9                |
10                |10.000000         |s10               |

从parquet_reader的输出，我们可以看到：各个Column的Compression信息不再是UNCOMPRESSED了，并且三个列在经过压缩后的Size与未压缩对比都有一定的减小：

Column 0:
    Compression: BROTLI, Uncompressed Size: 111, Compressed Size: 98
Column 1:
    Compression: SNAPPY, Uncompressed Size: 168, Compressed Size: 148
Column 2:
    Compression: SNAPPY, Uncompressed Size: 116, Compressed Size: 106

从文件大小对比也能体现出压缩算法的作用：

-rw-r--r--   1 tonybai  staff   786  7 22 08:06 flat_table.parquet
-rw-r--r--   1 tonybai  staff   742  7 20 13:19 flat_table_compressed.parquet

Go的parquet实现支持多种压缩算法：

// github.com/apache/arrow/go/parquet/compress/compress.go

var Codecs = struct {
    Uncompressed Compression
    Snappy       Compression
    Gzip         Compression
    // LZO is unsupported in this library since LZO license is incompatible with Apache License
    Lzo    Compression
    Brotli Compression
    // LZ4 unsupported in this library due to problematic issues between the Hadoop LZ4 spec vs regular lz4
    // see: http://mail-archives.apache.org/mod_mbox/arrow-dev/202007.mbox/%3CCAAri41v24xuA8MGHLDvgSnE+7AAgOhiEukemW_oPNHMvfMmrWw@mail.gmail.com%3E
    Lz4  Compression
    Zstd Compression
}{
    Uncompressed: Compression(parquet.CompressionCodec_UNCOMPRESSED),
    Snappy:       Compression(parquet.CompressionCodec_SNAPPY),
    Gzip:         Compression(parquet.CompressionCodec_GZIP),
    Lzo:          Compression(parquet.CompressionCodec_LZO),
    Brotli:       Compression(parquet.CompressionCodec_BROTLI),
    Lz4:          Compression(parquet.CompressionCodec_LZ4),
    Zstd:         Compression(parquet.CompressionCodec_ZSTD),
}

你只需要根据你的列的类型选择最适合的压缩算法即可。

2.4 Table <- Parquet(压缩)

接下来，我们来读取这个数据经过压缩的Parquet。读取压缩的Parquet是否需要在ReadTable时传入特殊的Properties呢？答案是不需要！因为Parquet文件中存储了元信息(metadata)，可以帮助ReadTable使用对应的算法解压缩并提取信息：

// flat_table_from_parquet_compressed.go

func main() {
    f, err := os.Open("flat_table_compressed.parquet")
    if err != nil {
        panic(err)
    }
    defer f.Close()

    tbl, err := pqarrow.ReadTable(context.Background(), f, parquet.NewReaderProperties(memory.DefaultAllocator),
        pqarrow.ArrowReadProperties{}, memory.DefaultAllocator)
    if err != nil {
        panic(err)
    }

    dumpTable(tbl)
}

运行这段程序，我们就可以读取压缩后的parquet文件了：

$go run flat_table_from_parquet_compressed.go
schema:
  fields: 3
    - col1: type=int32
      metadata: ["PARQUET:field_id": "-1"]
    - col2: type=float64
      metadata: ["PARQUET:field_id": "-1"]
    - col3: type=utf8
      metadata: ["PARQUET:field_id": "-1"]
------
the count of table columns= 3
the count of table rows= 10
------
arrays in column(col1):
[1 2 3 4 5 6 7 8 9 10]
------
arrays in column(col2):
[1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 10]
------
arrays in column(col3):
["s1" "s2" "s3" "s4" "s5" "s6" "s7" "s8" "s9" "s10"]
------

接下来，我们来看看Arrow中的另外一种高级数据结构Record Batch如何实现与Parquet文件格式的转换。

3. Arrow Record Batch <-> Parquet

注：大家可以先阅读/温习一下《高级数据结构》一文来了解一下Record Batch的概念。

3.1 Record Batch -> Parquet

Arrow Go实现将一个Record Batch作为一个Row group来对应。下面的程序向Parquet文件中写入了三个record，我们来看一下：

// flat_record_to_parquet.go

func main() {
    var records []arrow.Record
    schema := arrow.NewSchema(
        []arrow.Field{
            {Name: "archer", Type: arrow.BinaryTypes.String},
            {Name: "location", Type: arrow.BinaryTypes.String},
            {Name: "year", Type: arrow.PrimitiveTypes.Int16},
        },
        nil,
    )

    rb := array.NewRecordBuilder(memory.DefaultAllocator, schema)
    defer rb.Release()

    for i := 0; i < 3; i++ {
        postfix := strconv.Itoa(i)
        rb.Field(0).(*array.StringBuilder).AppendValues([]string{"tony" + postfix, "amy" + postfix, "jim" + postfix}, nil)
        rb.Field(1).(*array.StringBuilder).AppendValues([]string{"beijing" + postfix, "shanghai" + postfix, "chengdu" + postfix}, nil)
        rb.Field(2).(*array.Int16Builder).AppendValues([]int16{1992 + int16(i), 1993 + int16(i), 1994 + int16(i)}, nil)
        rec := rb.NewRecord()
        records = append(records, rec)
    }

    // write to parquet
    f, err := os.Create("flat_record.parquet")
    if err != nil {
        panic(err)
    }

    props := parquet.NewWriterProperties()
    writer, err := pqarrow.NewFileWriter(schema, f, props,
        pqarrow.DefaultWriterProps())
    if err != nil {
        panic(err)
    }
    defer writer.Close()

    for _, rec := range records {
        if err := writer.Write(rec); err != nil {
            panic(err)
        }
        rec.Release()
    }
}

和调用WriteTable完成table到parquet文件的写入不同，这里我们创建了一个FileWriter，通过FileWriter将构建出的Record Batch逐个写入。运行上述代码生成flat_record.parquet文件并使用parquet_reader展示该文件的内容：

$go run flat_record_to_parquet.go
$parquet_reader flat_record.parquet
File name: flat_record.parquet
Version: v2.6
Created By: parquet-go version 13.0.0-SNAPSHOT
Num Rows: 9
Number of RowGroups: 3
Number of Real Columns: 3
Number of Columns: 3
Number of Selected Columns: 3
Column 0: archer (BYTE_ARRAY/UTF8)
Column 1: location (BYTE_ARRAY/UTF8)
Column 2: year (INT32/INT_16)
--- Row Group: 0  ---
--- Total Bytes: 255  ---
--- Rows: 3  ---
Column 0
 Values: 3, Min: [97 109 121 48], Max: [116 111 110 121 48], Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 79, Compressed Size: 79
Column 1
 Values: 3, Min: [98 101 105 106 105 110 103 48], Max: [115 104 97 110 103 104 97 105 48], Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 99, Compressed Size: 99
Column 2
 Values: 3, Min: 1992, Max: 1994, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 77, Compressed Size: 77
--- Values ---
archer            |location          |year              |
tony0             |beijing0          |1992              |
amy0              |shanghai0         |1993              |
jim0              |chengdu0          |1994              |

--- Row Group: 1  ---
--- Total Bytes: 255  ---
--- Rows: 3  ---
Column 0
 Values: 3, Min: [97 109 121 49], Max: [116 111 110 121 49], Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 79, Compressed Size: 79
Column 1
 Values: 3, Min: [98 101 105 106 105 110 103 49], Max: [115 104 97 110 103 104 97 105 49], Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 99, Compressed Size: 99
Column 2
 Values: 3, Min: 1993, Max: 1995, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 77, Compressed Size: 77
--- Values ---
archer            |location          |year              |
tony1             |beijing1          |1993              |
amy1              |shanghai1         |1994              |
jim1              |chengdu1          |1995              |

--- Row Group: 2  ---
--- Total Bytes: 255  ---
--- Rows: 3  ---
Column 0
 Values: 3, Min: [97 109 121 50], Max: [116 111 110 121 50], Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 79, Compressed Size: 79
Column 1
 Values: 3, Min: [98 101 105 106 105 110 103 50], Max: [115 104 97 110 103 104 97 105 50], Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 99, Compressed Size: 99
Column 2
 Values: 3, Min: 1994, Max: 1996, Null Values: 0, Distinct Values: 0
 Compression: UNCOMPRESSED, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 77, Compressed Size: 77
--- Values ---
archer            |location          |year              |
tony2             |beijing2          |1994              |
amy2              |shanghai2         |1995              |
jim2              |chengdu2          |1996              |

我们看到parquet_reader分别输出了三个row group的元数据和列值，每个row group与我们写入的一个record对应。

那读取这样的parquet文件与ReadTable有何不同呢？我们继续往下看。

3.2 Record Batch <- Parquet

下面是用于读取

// flat_record_from_parquet.go
func main() {
    f, err := os.Open("flat_record.parquet")
    if err != nil {
        panic(err)
    }
    defer f.Close()

    rdr, err := file.NewParquetReader(f)
    if err != nil {
        panic(err)
    }
    defer rdr.Close()

    arrRdr, err := pqarrow.NewFileReader(rdr,
        pqarrow.ArrowReadProperties{
            BatchSize: 3,
        }, memory.DefaultAllocator)
    if err != nil {
        panic(err)
    }

    s, _ := arrRdr.Schema()
    fmt.Println(*s)

    rr, err := arrRdr.GetRecordReader(context.Background(), nil, nil)
    if err != nil {
        panic(err)
    }

    for {
        rec, err := rr.Read()
        if err != nil && err != io.EOF {
            panic(err)
        }
        if err == io.EOF {
            break
        }
        fmt.Println(rec)
    }
}

我们看到相对于将parquet转换为table，将parquet转换为record略为复杂一些，这里的一个关键是在调用NewFileReader时传入的ArrowReadProperties中的BatchSize字段，要想正确读取出record，这个BatchSize需适当填写。这个BatchSize会告诉Reader 每个读取的Record Batch的长度，也就是row数量。这里传入的是3，即3个row为一个Recordd batch。

下面是运行上述程序的结果：

$go run flat_record_from_parquet.go
{[{archer 0x26ccc00 false {[PARQUET:field_id] [-1]}} {location 0x26ccc00 false {[PARQUET:field_id] [-1]}} {year 0x26ccc00 false {[PARQUET:field_id] [-1]}}] map[archer:[0] location:[1] year:[2]] {[] []} 0}
record:
  schema:
  fields: 3
    - archer: type=utf8
        metadata: ["PARQUET:field_id": "-1"]
    - location: type=utf8
          metadata: ["PARQUET:field_id": "-1"]
    - year: type=int16
      metadata: ["PARQUET:field_id": "-1"]
  rows: 3
  col[0][archer]: ["tony0" "amy0" "jim0"]
  col[1][location]: ["beijing0" "shanghai0" "chengdu0"]
  col[2][year]: [1992 1993 1994]

record:
  schema:
  fields: 3
    - archer: type=utf8
        metadata: ["PARQUET:field_id": "-1"]
    - location: type=utf8
          metadata: ["PARQUET:field_id": "-1"]
    - year: type=int16
      metadata: ["PARQUET:field_id": "-1"]
  rows: 3
  col[0][archer]: ["tony1" "amy1" "jim1"]
  col[1][location]: ["beijing1" "shanghai1" "chengdu1"]
  col[2][year]: [1993 1994 1995]

record:
  schema:
  fields: 3
    - archer: type=utf8
        metadata: ["PARQUET:field_id": "-1"]
    - location: type=utf8
          metadata: ["PARQUET:field_id": "-1"]
    - year: type=int16
      metadata: ["PARQUET:field_id": "-1"]
  rows: 3
  col[0][archer]: ["tony2" "amy2" "jim2"]
  col[1][location]: ["beijing2" "shanghai2" "chengdu2"]
  col[2][year]: [1994 1995 1996]

我们看到：每3行被作为一个record读取出来了。如果将BatchSize改为5，则输出如下：

$go run flat_record_from_parquet.go
{[{archer 0x26ccc00 false {[PARQUET:field_id] [-1]}} {location 0x26ccc00 false {[PARQUET:field_id] [-1]}} {year 0x26ccc00 false {[PARQUET:field_id] [-1]}}] map[archer:[0] location:[1] year:[2]] {[] []} 0}
record:
  schema:
  fields: 3
    - archer: type=utf8
        metadata: ["PARQUET:field_id": "-1"]
    - location: type=utf8
          metadata: ["PARQUET:field_id": "-1"]
    - year: type=int16
      metadata: ["PARQUET:field_id": "-1"]
  rows: 5
  col[0][archer]: ["tony0" "amy0" "jim0" "tony1" "amy1"]
  col[1][location]: ["beijing0" "shanghai0" "chengdu0" "beijing1" "shanghai1"]
  col[2][year]: [1992 1993 1994 1993 1994]

record:
  schema:
  fields: 3
    - archer: type=utf8
        metadata: ["PARQUET:field_id": "-1"]
    - location: type=utf8
          metadata: ["PARQUET:field_id": "-1"]
    - year: type=int16
      metadata: ["PARQUET:field_id": "-1"]
  rows: 4
  col[0][archer]: ["jim1" "tony2" "amy2" "jim2"]
  col[1][location]: ["chengdu1" "beijing2" "shanghai2" "chengdu2"]
  col[2][year]: [1995 1994 1995 1996]

这次：前5行作为一个record，后4行作为另外一个record。

当然，我们也可以使用flat_table_from_parquet.go中的代码来读取flat_record.parquet(将读取文件名改为flat_record.parquet)，只不过由于将parquet数据转换为了table，其输出内容将变为：

$go run flat_table_from_parquet.go
schema:
  fields: 3
    - archer: type=utf8
        metadata: ["PARQUET:field_id": "-1"]
    - location: type=utf8
          metadata: ["PARQUET:field_id": "-1"]
    - year: type=int16
      metadata: ["PARQUET:field_id": "-1"]
------
the count of table columns= 3
the count of table rows= 9
------
arrays in column(archer):
["tony0" "amy0" "jim0" "tony1" "amy1" "jim1" "tony2" "amy2" "jim2"]
------
arrays in column(location):
["beijing0" "shanghai0" "chengdu0" "beijing1" "shanghai1" "chengdu1" "beijing2" "shanghai2" "chengdu2"]
------
arrays in column(year):
[1992 1993 1994 1993 1994 1995 1994 1995 1996]
------

3.3 Record Batch -> Parquet(压缩)

Recod同样支持压缩写入Parquet，其原理与前面table压缩存储是一致的，都是通过设置WriterProperties来实现的：

// flat_record_to_parquet_compressed.go

func main() {
    ... ...
    f, err := os.Create("flat_record_compressed.parquet")
    if err != nil {
        panic(err)
    }
    defer f.Close()

    props := parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Zstd),
        parquet.WithCompressionFor("year", compress.Codecs.Brotli))
    writer, err := pqarrow.NewFileWriter(schema, f, props,
        pqarrow.DefaultWriterProps())
    if err != nil {
        panic(err)
    }
    defer writer.Close()

    for _, rec := range records {
        if err := writer.Write(rec); err != nil {
            panic(err)
        }
        rec.Release()
    }
}

不过这次针对arrow.string类型和arrow.int16类型的压缩效果非常“差”：

$parquet_reader flat_record_compressed.parquet
File name: flat_record_compressed.parquet
Version: v2.6
Created By: parquet-go version 13.0.0-SNAPSHOT
Num Rows: 9
Number of RowGroups: 3
Number of Real Columns: 3
Number of Columns: 3
Number of Selected Columns: 3
Column 0: archer (BYTE_ARRAY/UTF8)
Column 1: location (BYTE_ARRAY/UTF8)
Column 2: year (INT32/INT_16)
--- Row Group: 0  ---
--- Total Bytes: 315  ---
--- Rows: 3  ---
Column 0
 Values: 3, Min: [97 109 121 48], Max: [116 111 110 121 48], Null Values: 0, Distinct Values: 0
 Compression: ZSTD, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 79, Compressed Size: 105
Column 1
 Values: 3, Min: [98 101 105 106 105 110 103 48], Max: [115 104 97 110 103 104 97 105 48], Null Values: 0, Distinct Values: 0
 Compression: ZSTD, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 99, Compressed Size: 125
Column 2
 Values: 3, Min: 1992, Max: 1994, Null Values: 0, Distinct Values: 0
 Compression: BROTLI, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 77, Compressed Size: 85
--- Values ---
archer            |location          |year              |
tony0             |beijing0          |1992              |
amy0              |shanghai0         |1993              |
jim0              |chengdu0          |1994              |

--- Row Group: 1  ---
--- Total Bytes: 315  ---
--- Rows: 3  ---
Column 0
 Values: 3, Min: [97 109 121 49], Max: [116 111 110 121 49], Null Values: 0, Distinct Values: 0
 Compression: ZSTD, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 79, Compressed Size: 105
Column 1
 Values: 3, Min: [98 101 105 106 105 110 103 49], Max: [115 104 97 110 103 104 97 105 49], Null Values: 0, Distinct Values: 0
 Compression: ZSTD, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 99, Compressed Size: 125
Column 2
 Values: 3, Min: 1993, Max: 1995, Null Values: 0, Distinct Values: 0
 Compression: BROTLI, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 77, Compressed Size: 85
--- Values ---
archer            |location          |year              |
tony1             |beijing1          |1993              |
amy1              |shanghai1         |1994              |
jim1              |chengdu1          |1995              |

--- Row Group: 2  ---
--- Total Bytes: 315  ---
--- Rows: 3  ---
Column 0
 Values: 3, Min: [97 109 121 50], Max: [116 111 110 121 50], Null Values: 0, Distinct Values: 0
 Compression: ZSTD, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 79, Compressed Size: 105
Column 1
 Values: 3, Min: [98 101 105 106 105 110 103 50], Max: [115 104 97 110 103 104 97 105 50], Null Values: 0, Distinct Values: 0
 Compression: ZSTD, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 99, Compressed Size: 125
Column 2
 Values: 3, Min: 1994, Max: 1996, Null Values: 0, Distinct Values: 0
 Compression: BROTLI, Encodings: RLE_DICTIONARY PLAIN RLE
 Uncompressed Size: 77, Compressed Size: 85
--- Values ---
archer            |location          |year              |
tony2             |beijing2          |1994              |
amy2              |shanghai2         |1995              |
jim2              |chengdu2          |1996              |

越压缩，parquet文件的size越大。当然这个问题不是我们这篇文章的重点，只是提醒大家选择适当的压缩算法十分重要。

3.4 Record Batch <- Parquet(压缩)

和读取table转换后的压缩parquet文件一样，读取record转换后的压缩parquet一样无需特殊设置，使用flat_record_from_parquet.go即可（需要改一下读取的文件名），这里就不赘述了。

4. 小结

本文旨在介绍使用Go进行Arrow和Parquet文件相互转换的基本方法，我们以table和record两种高级数据结构为例，分别介绍了读写parquet文件以及压缩parquet文件的方法。

当然本文中的例子都是“平坦(flat)”的简单例子，parquet文件还支持更复杂的嵌套数据，我们会在后续的深入讲解parquet格式的文章中提及。