Kernel | Tony Bai

十二月 6, 2007

我不是计算机科班出身。记得大学的时候旁听计算机系的网络课，当时计算机系使用教材是"计算机网络–自顶向下方法与Internet特色"的影印版，这本教材与众不同的一个地方就是作者JAMES F.KUROSE和KEITH W.ROSS采用了'自顶向下'的编排思路，先从应用层开始，最后讲到物理层。而且这本书在语言上形象生动，通俗易懂。只怪我当初没有一心一意听讲，到现在存在我的脑子中的基本概念居多，深刻理解甚少。以致于工作后遇到此类的问题，只能恶补。这不，在12月1日凌晨全国统一短信类服务接入代码的调整工作中，我就遇到了此类问题，不得不再次抱起W.Richard Stevens的'TCP详解卷一'啃了啃，回顾一下TCP协议那些事儿。

做应用层网络程序开发的，手头上都有一把利器：抓包工具，更专业的名词就是协议分析工具，常用的且功能强大的协议分析工具有：TCPDUMP(Windows平台上的叫Windump)、Ethereal等。工作中常常会遇到因应用层程序在协议字段发送和接收解析上不一致而出现'纠纷'问题，这时我们一般采用的在TCP层用协议分析工具抓取该层原始数据包作为'对峙'的证据；还有的就是在客户端与服务器端链接问题上的一些现象也需要到TCP层去分析原因，这就需要对TCP层的基本工作原理有一个清晰的认识。

首先我们要明确：TCP头部中设置的一系列域都是为了能达到分割、重传、查重、重组、流控、全双工的协议功能而设置的，这里比较重要的字段就是序列号和确认号。由于要达到重传、查重、重组、全双工这些目的，TCP层需要通过序列号和确认号来保证。序列号用来标识发送端传送数据包的顺序，并且指导接收端对多数据包进行顺序重组；发送端传送一个数据包后，它会把这个数据包放入重发队列中，同时启动计时器，如果收到了关于这个包的确认信息，便将此数据包从队列中删除；如果在计时器超时的时候仍然没有收到确认信息，则需要重新发送该数据包。

TCP层以"三次握手"建立链接而"闻名于世"，三次握手的目的：建立链接，为后续的数据流传输奠基，因为TCP是双工的，因此在握手过程需告知彼此数据包发送的起始序列号。

Client –> 置SYN标志序列号 = J，确认号 = 0 —-> Server
Client <– 置SYN标志置ACK标志序列号 = K, 确认号 = J + 1 <– Server
Clinet –> 置ACK标志序列号 = J + 1，确认号 = K + 1 –> Server

链接建立后，接下来Client端发送的数据包将从J + 1开始，Server端发送的数据包将从K + 1开始，这里要说明的是：建立链接时，Client端宣称自己的初始序列号是J，Server端宣称自己的初始序列号是K，但是建立连接后的数据包却各自中初始序列号+1开始，这是因为SYN请求本身需要占用一个序列号。

在数据传输阶段，按照常理应用层数据的传输是这样的：(我们假定建立连接阶段Client端最后的确认包中序列号 = 55555, 确认号 = 22222)
Client –> 置PSH标志，置ACK标志序列号 = 55555, 确认号 = 22222，数据包长度 = 11 —> Server
Client <– 置ACK标志，序列号 = 22222, 确认号 = 55566 (=55555 + 11)，数据包长度 = 0 <— Server
Client <– 置PSH标志，置ACK标志序列号 = 22223, 确认号 = 55566，数据包长度 = 22 <— Server
Client –> 置ACK标志，序列号 = 55566, 确认号 = 22244(=22222+22)，数据包长度 = 0 —> Server

注：PSH标志指示接收端应尽快将数据提交给应用层。从我协议分析的经历来看，在数据传输阶段，几乎所有数据包的发送都置了PSH位；而ACK标志位在数据传输阶段也是一直是置位的。

但是实际我们在分析过程看到的却都是如下这样的：
Client –> 置PSH标志，置ACK标志序列号 = 55555, 确认号 = 22222，数据包长度 = 11 —> Server
Client <– 置PSH标志，置ACK标志序列号 = 22222, 确认号 = 55566 (=55555 + 11)，数据包长度 = 22 <— Server
Client –> 置PSH标志，置ACK标志序列号 = 55566, 确认号 = 22244 (=22222 + 22)，数据包长度 = 33 —> Server
Client <– 置PSH标志，置ACK标志序列号 = 22244, 确认号 = 55599 (=55566 + 33)，数据包长度 = 44 <— Server

也就是说：数据接收端将上一个应答和自己待发送的应用层数据组合在一起发送了。TCP的传输原则是尽量减少小分组传输的数量，所以一般默认都开启"带时延的ACK"。一般实现中，时延在200ms。Nagle算法多用来实现"带时延的ACK"，它要求一个TCP连接上最多只能有一个未被确认的小分组。在该分组的确认到达之前不能发送其他小分组。也就是说：发送端在发送一个分组后，需等待这个分组的ACK确认后，才可以进行下一个分组的发送。这样一来网络的传输效率被大大降低了。对于大数据块的传输来说，这样很多时候是难以忍受的。另一种拥塞控制策略被引入，那就是TCP的滑动窗口协议，滑动窗口协议是分组发送和分组确认不再同步，发送端可以连续发送n个分组，接收端同样也可以用一个确认包来一起确认这n个分组，通常n = 2。各种OS的TCP协议栈在实现上都是综合了Nagle算法和滑动窗口协议的，TCP层对应用层数据分组大小进行多次判断(一般分组大小都是和MSS做比较的)，以在Nagle和滑动窗口协议之间抉择到底选择哪一种控制方式进行发送。"The Linux Network Architecture: Design and Implementation of Network Protocols in the Linux Kernel"一书介绍了Linux在TCP层上的设计和实现，当然最直观的还是去分析Linux源代码了。

拆除TCP连接过程用一句话表述就是：你关你的发送通道，我关我的发送通道(因为TCP是全双工)。当一方关闭发送通道后，仍可接收另一方发送过来数据，这样的情况就成为"半关闭"。然而多数情况下，"半关闭"使用的很少，而且半关闭需要SOCKET AIP支持在SOCKET上的shutdown(而不是调用close)。

正常的关闭流程是源于Fin报文的:
Client –> Fin ACK –> Server
Client <– ACK <– Server
Client <– Fin ACK — Server
Client –> ACK –> Server
发送Fin分组的一端会先将发送缓冲中的报文按序发完之后，再发出Fin；所以说Fin又叫做：orderly release。

异常的关闭流程是源于Rst报文的。一个典型的例子就是当客户端所要链接的服务器端的端口并没有程序在listen，这时服务器端的TCP层会直接发送一个Rst报文，告诉客户端重置连接。Rst报文是无需确认的。客户端在收到Rst后会通知应用层对方异常结束链接(需通过SOCKET API的设置才能得知对方是异常关闭)。

Kernel 'head.S'

After being decompressed, the kernel image starts with another ‘startup_32′ function included in $(linux-2.6.15.3_dir/arch/i386/kernel/head.S’. This ‘head.S’ is the second one in linux source package, which is also called ‘kernel head’. And it is exactly what we want to describe in this artical.

The kernel head continues to perform higher initialization operations for the first linux process(process 0). It sets up an execution environment for the kernel main routine just like what the operating system does before an application begins to start. There are two entries for CPUs in this ‘head.S’ and we only talk about the execution routine of the boot CPU.

/*
* ! $(linux2.6.3.15_dir)/arch/i386/kernel/head.S
*/
ENTRY(startup_32)

/*
* ! We still use liner address, since
* ! %ds = %es = %fs = %gs = __BOOT_DS
* ! we use the third segment which base
* ! address starts from 0×00000000
*/
cld
lgdt boot_gdt_descr – __PAGE_OFFSET
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs

/*
* ! Clear the kernel bss
*/
xorl %eax,%eax
movl $__bss_start – __PAGE_OFFSET,%edi
movl $__bss_stop – __PAGE_OFFSET,%ecx
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl

After copying the bootup parameters, it prepares to enable the paging. Before the paging enabled, some data structure should be loaded first following the ‘Intel Manual Vol3′.

/*
* ! Initialize the provisional kernel page tables
* ! which are stored starting from pg0, right after
* ! the end of the kernel’s uninitialized data segments(bss).
* ! and the provisional page global directory is
* ! contained in the swapper_pg_dir variable.
* !
* ! page_pde_offset = 0x0c00
*/
page_pde_offset = (__PAGE_OFFSET >> 20);

/*
* ! this line indicates the table starts from ‘pg0′
*/
movl $(pg0 – __PAGE_OFFSET), %edi

/*
* ! this line told us ‘swapper_pg_dir’ is the
* ! page directory start point
*/
movl $(swapper_pg_dir – __PAGE_OFFSET), %edx

/*
* ! There were 1024 entries in ‘swapper_pg_dir’
* ! since the code below:
* ! ENTRY(swapper_pg_dir)
* !     .fill 1024,4,0
* !
* ! The first mapping:
* !     both entry 0 and entry 0×300 (page_pde_offset/4) –> pg0
* !     that is (0×00000000~0x007fffff) —> pg0
* ! The second mapping:
* !     both entry 1 and entry 0×301 (page_pde_offset/4+1) –> pg1 (the page following pg0)
* !     that is (0xC0000000~0xC07fffff) —> pg1
* !
* ! The objective of this first phase of paging is to
* ! allow these 8 MB of RAM to be easily addressed
* ! both in real mode and protected mode.
*/
movl $0×007, %eax   /* 0×007 = PRESENT+RW+USER */
10:
leal 0×007(%edi),%ecx   /* Create PDE entry */
movl %ecx,(%edx)   /* Store identity PDE entry */
movl %ecx,page_pde_offset(%edx)  /* Store kernel PDE entry */
addl $4,%edx
movl $1024, %ecx
11:
stosl
addl $0×1000,%eax
loop 11b
/* End condition: we must map up to and including INIT_MAP_BEYOND_END */
/* bytes beyond the end of our own page tables; the +0×007 is the attribute bits */
leal (INIT_MAP_BEYOND_END+0×007)(%edi),%ebp
cmpl %ebp,%eax
jb 10b
movl %edi,(init_pg_tables_end – __PAGE_OFFSET)

/*
* ! here just the boot CPU go this way
*/
#ifdef CONFIG_SMP
xorl %ebx,%ebx /* This is the boot CPU (BSP) */
jmp 3f

The kernel page tables have been loaded and we can enable the paging now!

/*
  * Enable paging
  */
movl $swapper_pg_dir-__PAGE_OFFSET,%eax

/*
* ! load the table physical address into the %cr3
*/
movl %eax,%cr3  /* set the page table pointer.. */
movl %cr0,%eax
orl $0×80000000,%eax

/*
* ! Enable the paging
*/
movl %eax,%cr0  /* ..and set paging (PG) bit */

/*
* ! A relative jump after the paging enabled
*/
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */
1:
/* Set up the stack pointer */
lss stack_start,%esp

There is a relative jump instruction – ‘ljmp $(__BOOT_CS), $1f’. Maybe you wonder what the ‘$1f’ means. ’1′ is a local symbol. To define a local symbol, write a label of the form ‘N:’ (where N represents any digit). To refer to the most recent previous definition of that symbol write ‘Nb’, using the same digit as when you defined the label. To refer to the next definition of a local label, write ‘Nf’. The ‘b’ stands for "backwards" and the ‘f’ stands for "forwards".

Now we are in 32-bit protected mode with paging enable. so we still need to re-do something done in 16-bit mode for ‘real-mode’ operations.

/*
* ! Setup the interrupt descriptor table
* ! All the 256 entries are pointing to
* ! the default interrupt "handler" — ‘ignore_int’
*/
call setup_idt

….
….

setup_idt:
lea ignore_int,%edx
movl $(__KERNEL_CS << 16),%eax
movw %dx,%ax /* selector = 0×0010 = cs */
movw $0x8E00,%dx /* interrupt gate – dpl=0, present */

/*
* ! idt_table varible is defined
* ! in $(linux2.6.3.15_dir)/arch/i386/kernel/traps.c
*/
lea idt_table,%edi
mov $256,%ecx
rp_sidt:
movl %eax,(%edi)
movl %edx,4(%edi)
addl $8,%edi
dec %ecx
jne rp_sidt
ret

After checking the type of CPU, the kernel head prepare to call the kernel main function ‘start_kernel’.

/*
* ! use new descriptor table in safe place
* ! then reload segment registers after lgdt
*/
lgdt cpu_gdt_descr
lidt idt_descr
ljmp $(__KERNEL_CS),$1f
1: movl $(__KERNEL_DS),%eax # reload all the segment registers
movl %eax,%ss # after changing gdt.

movl $(__USER_DS),%eax # DS/ES contains default USER segment
movl %eax,%ds
movl %eax,%es

xorl %eax,%eax # Clear FS/GS and LDT
movl %eax,%fs
movl %eax,%gs
lldt %ax
cld # gcc2 wants the direction flag cleared at all times

…
…

/*
* ! The boot CPU will jump to execute
* ! $(linux2.6.3.15_dir)/init/main.c:start_kernel()
* ! And the start_kernel() should never return
*/
call start_kernel
L6:
jmp L6 # main should never return here, but
# just in case, we know what happens.

标签 Kernel 下的文章

回顾TCP协议那些事儿

Kernel 'head.S'

欢迎使用邮件订阅我的博客

文章

评论

分类

归档

链接

开源项目

翻译项目