Blogger - Tony Bai

简说GLIBC strncpy实现

四月 15, 2009

比较以下两组代码，你认为哪组运行的更快些呢？
Example1：
        int n   = 100;
        int n4 = n >> 2;
        int i   = 0;

int a[100];

        for (i = 0; i < n4 ;i += 4) {
                a[i] = i;
                a[i+1] = i+1;
                a[i+2] = i+2;
                a[i+3] = i+3;
        }

Example2：
       for (i = 0;i < 100;i++) {
             a[i] = i;
       }

其实这个问题在"代码大全2nd"中也有讨论，从"代码大全"中的统计结果来看，一般来说Example1更占有优势。我在solaris上做了测试，在未开优化的情况下：两者运行时间分别为2ms和6ms；在打开-O2优化后，两者均为1ms。这种通过减少循环次数的方法在GLIBC中也有体现，比如说strncpy的实现：

下面是strncpy的GLIBC源码：
char *
x_strncpy (s1, s2, n)
        char *s1;
        const char *s2;
        size_t n;
{
        reg_char c;
        char *s = s1;

–s1;

        if (n >= 4)
        {
                size_t n4 = n >> 2; /* n4 = n / 4， n4表示下面的循环执行的次数*/

                for (;;)
                {
                        c = *s2++;
                        *++s1 = c;
                        if (c == '')
                                break;
                        c = *s2++;
                        *++s1 = c;
                        if (c == '')
                                break;
                        c = *s2++;
                        *++s1 = c;
                        if (c == '')
                                break;
                        c = *s2++;
                        *++s1 = c;
                        if (c == '')
                                break;
                        if (–n4 == 0)
                                goto last_chars; /* 如果n = 10，s2 = "hello world"，则两轮循环后，还有"尾巴"没有copy完，在last_chars处继续处理 */
                }
                n = n – (s1 – s) – 1; /* 还没有copy完n个字节，s2就到达末尾了，跳到zero_fill处继续为s1补零 */
                if (n == 0)
       return s;
                goto zero_fill;
        }

last_chars:
        n &= 3;       /* n = n & 3 结果 n <= 3，n即为上面循环过后"尾巴字符"的数量 */
        if (n == 0)
                return s;
        do
        {
                c = *s2++;
                *++s1 = c;
                if (–n == 0)
                        return s;
        } while (c != '');

zero_fill:
        do
                *++s1 = '';
        while (–n > 0);

return s;
}

相比于strlen的实现，strncpy的实现更易理解。其字面上的逻辑就是每四个字节(n>>2)作为一组，每组逐个字节进行拷贝赋值，其内在目的则是减少循环次数，以获得性能的提升。要想知道为什么减少循环次数能提升性能的话，那就要深入到汇编层面去了，这里不再详述。另外还要一提的是GLIBC中的strncmp，strncat的实现也遵循着与上面同样的逻辑。

GLIBC strlen源代码分析

四月 11, 2009

5 条评论

直接操作C标准库提供的字符串操作函数是有一定风险的，稍有不慎就会导致内存问题。这周用业余时间写了一个小型的安全字符串操作库，但是测试之后才发现自己的实现有很大的性能缺陷。

在Solaris上初步做了一个简单的性能比对，以下是得到的性能数据(以strlen的数据为例)：
当传入的字符串长度为10时，执行100w次：
strlen 执行时间是：32762毫秒
my_strlen执行时间是：491836毫秒

当传入的字符串长度为20时，执行100w次：
strlen 执行时间是：35075毫秒
my_strlen执行时间是：770397毫秒

很显然，标准库中strlen的消耗仅是my_strlen的十分之一不到，且其性能消耗随着字符串长度的增加并未有近线性的增加，而my_strlen则是变化明显。想必大家这时也能猜到my_strlen采用了传统的实现的方式，即采用逐个字节判断是否为''方式，这也与测试出的现象相符。本着刨根问底的精神，我在网上找到了GNU提供的C标准库中strlen实现的源码，要看看GLIBC中strlen究竟采用何种技巧才达到了那么高的性能。说实话在性能优化这方面自己一直还处于比较初级的位置，这也将是自己将来努力的一个方向。

下载了全部GLIBC的代码包，这个包还真不小。在string子目录下找到strlen.c，这就是大多数UNIX平台、Linux平台以及绝大多数GNU软件使用的strlen的实现源码了。这份代码由Torbjorn Granlund(还实现了memcpy)编写，Jim Blandy和Dan Sahlin提供了帮助和注释。包括注释在内，GLIBC的strlen的代码足足有近130行，大致浏览一下，没有怎么看懂，可耐下心来细致阅读，还是有些心得的。下面是strlen源码摘要版，后面我将针对这段代码写一些我的理解：

1 /* Return the length of the null-terminated string STR. Scan for
2    the null terminator quickly by testing four bytes at a time. */
3 size_t strlen (str) const char *str;
4 {
5         const char *char_ptr;
6         const unsigned long int *longword_ptr;
7         unsigned long int longword, magic_bits, himagic, lomagic;
8
9         /* Handle the first few characters by reading one character at a time.
10            Do this until CHAR_PTR is aligned on a longword boundary. */
11
12         for (char_ptr = str; ((unsigned long int) char_ptr
13              & (sizeof (longword) – 1)) != 0;
14              ++char_ptr)
15                 if (*char_ptr == '')
16                         return char_ptr – str;
17
18         /* All these elucidatory comments refer to 4-byte longwords,
19            but the theory applies equally well to 8-byte longwords. */
20
21         longword_ptr = (unsigned long int *) char_ptr;
22
23         himagic = 0x80808080L;
24         lomagic = 0x01010101L;
25
26         if (sizeof (longword) >
27                 abort ();
28
29         /* Instead of the traditional loop which tests each character,
30            we will test a longword at a time. The tricky part is testing
31            if *any of the four* bytes in the longword in question are zero. */
32
33         for (;;)
34         {
35                 longword = *longword_ptr++;
36
37                 if ( ((longword – lomagic) & himagic) != 0)
38                 {
39                         /* Which of the bytes was the zero? If none of them were, it was
40                            a misfire; continue the search. */
41
42                         const char *cp = (const char *) (longword_ptr – 1);
43
44                         if (cp[0] == 0)
45                                 return cp – str;
46                         if (cp[1] == 0)
47                                 return cp – str + 1;
48                         if (cp[2] == 0)
49                                 return cp – str + 2;
50                         if (cp[3] == 0)
51                                 return cp – str + 3;
52                         if (sizeof (longword) > 4)
53                         {
54                                 if (cp[4] == 0)
55                                         return cp – str + 4;
56                                 if (cp[5] == 0)
57                                         return cp – str + 5;
58                                 if (cp[6] == 0)
59                                         return cp – str + 6;
60                                 if (cp[7] == 0)
61                                         return cp – str + 7;
62                         }
63                 }
64         }
65 }

从这段代码开头作者的注释我们大致可以了解到该strlen实现的原理：就是通过每次测试四个字节来代替传统实现中每次测试一个字节的方法。知道这个原理了，那么还需要解决两个难题：
1) C标准库要求有很好的移植性，在绝大部分系统体系结构下都应该能正确运行。那么每次拿出4个字节比较(unsigned long int)，就需要考虑内存对齐问题，传入的字符串的首字符地址可不一定在4对齐的地址上；
2) 如何对四个字节进行测试，找出其中某个字节为全0，这是个技巧问题。

12～21行的代码解决的就是第一个问题：
      for (char_ptr = str; ((unsigned long int) char_ptr
             & (sizeof (longword) – 1)) != 0;
             ++char_ptr)
                if (*char_ptr == '')
                        return char_ptr – str;

/* All these elucidatory comments refer to 4-byte longwords,
but the theory applies equally well to 8-byte longwords. */

longword_ptr = (unsigned long int *) char_ptr;
作者通过一个for-loop找到传入字符串中第一个地址对齐到4的字符的地址，由于该地址已经对齐到4，所以最后一行那个强制转型是安全的。虽然可以通过圆整算式直接得到该对齐地址，但是考虑到这个区间可能存在的''，一个字符一个字符比对也是不可避免的。在很多严格对齐的架构上(比如SUN的SPARC平台)，编译器一般会将字符串地址在编译器就放到对齐的地址上，这样一来，实际执行strlen时for-loop很少能执行一步。

第二个问题作者则是通过一个"带前提"的技巧来解决的。作者设定了两个掩码变量：
himagic = 0x80808080L;
lomagic = 0x01010101L;
并通过一个conditional expression完成了对四字节中全0字节的检测：((longword – lomagic) & himagic) != 0
我们将himagic和lomagic按bit展开：
himagic 1000 0000 1000 0000 1000 0000 1000 0000
lomagic 0000 0001 0000 0001 0000 0001 0000 0001

对于这样的代码，似乎没有什么理论可以遵循，需要在实践中去理解。起初我构造了一个不含全0字节的longword，比如：
longword 1000 0001 1000 0001 1000 0001 1000 0001，然后按照那个条件表达式计算后，居然也满足!=0的条件，是不是作者的逻辑有问题呢？后来转念一想，这种逻辑是有“前提条件”的。回顾一下strlen是做什么的，其输入参数是任意的么？当然不是。输入的字符串中每个字符的值都在[0, 127]的ascii码范围内，也就是说每个字节最高位的bit都是0，这样longword就应该是如下这个样子了：
longword 0xxx xxxx 0xxx xxxx 0xxx xxxx 0xxx xxxx

基于这样的前提我们考虑两种情况：
当longword中没有全0字节时，比如：
longword 0000 0001 0000 0001 0000 0001 0000 0001
这样在做完计算后，值为0，不满足条件。

当longword中有全零字节时，比如：
longword 0000 0000 0000 0001 0000 0001 0000 0001
这样在做完计算后，最高字节最高bit的值肯定为1，满足!=0条件，全0字节被检测出来。也就是说一旦有全0字节，在减去lomagic时势必会产生借位，全0的那个字节在减去lomagic后最高位bit肯定由0变1，这样与himagic一与，肯定不为0，就是这么检测出来的。

这一方法在64位平台依然适用，上面的代码摘要中省略了对64bit平台的特殊处理，为的是使代码逻辑更清晰，更易读。