原创学习Rate-Matching在高速串行通信协议中的应用

 2010-6-21 09:30  5522 4 4 分类: FPGA/CPLD

Thu Feb 25 2010 08:37:42 GMT+0800 (China Standard Time) 昨天的收获——捕获到了Gigabit Ethernet的/I2/有序集。

        配置了GXB的IP core，配置了8b10b编解码的IP core。把这两个IP
core提供的仿真模型连接到了TSE的仿真环境中，在不影响原有Testbench正常运行的情况下，实现了对TSE底层信号的监视功能。并把GXB接收同步后的10b码转换成了8b码，在确定了Gigabit Ethernet的有序集后，成功地找到了/I2/有序集（ordered set）。
        通过阅读StratixIIGX的手册，获得了一个重要收获：不同的串行协议，用来实现Rate-matching的有序集是不同的。Gigabit Ethernet中，采用了/I2/=/K28.5/D16.2/有序集，这一点是和PCIe采用Skip=/K28.5/K28.0/K28.0/K28.0/不同的。先前在TSE的长时间（65ms）仿真中查找/K28.5/K28.0/的尝试也自然会失败。

通过查找IEEE文档和相关资料，可以初步断定：Rate-matching不是协议规定要实现的部分。具备Rate-matching功能的
Elastic buffer是绝大部分生产PHY的厂商都会提供的，但是不具备Rate-matching功能的Elastic
buffer也是不违背协议规范的。
        相关链接：1000BASE-X Physical Coding Sublayer (PCS) and Physical Medium Attachment (PMA) Tutorial, 利用多字节成帧技术在多个数据通信标准中实现鲁棒的数据恢复

Fri Feb 26 2010 08:50:14 GMT+0800 (China Standard Time) 昨天的收获——成功地看到了Rate-Matching进行数据速率匹配的操作。

        由于捕获到了/I2/有序集，下一步的工作就是确认接收端能否从正常的数据流中插入或删除/I2/有序集进行数据速率匹配。
        通过重新配置监视端口接收侧的时钟输入方案，可以给Tx CMU参考时钟（本地时钟）和Rx CRU参考时钟（等效的对端时钟）提供不同的输入源，在测试平台中实现了人为的时钟不匹配。通过提高和降低监视端口的本地时钟频率，成功地在监视端口捕获的数据流中看到了插入和删除/I2/有序集的操作。由于Gigabit Ethernet的Rate-Matching可以一次增或减2个10bit字符，所以可以适应非常大的时钟频率偏差。这一点是从仿真中得出的，具体数值还需要理论推导。
        中间走了一点弯路。开始尝试从参考时钟中周期性地删除一个正脉冲来实现时钟不匹配，结果发现被删除的正脉冲对应的时钟上升沿仍然存在，没有产生效果；然后尝试一次删除两个正脉冲，结果导致PLL周期性失锁和同步，进而引发数据读取错误；尝试在VHDL中引入0.5ns的时钟偏差，未果，可能跟仿真精度设置有关，最终选择了1ns的时钟偏差。
        今天要尝试把在监视端口实验成功的Rate-Matching添加到TSE的工作数据通路中，通过Testbench的自检查功能验证Rate-Matching功能对数据正确性的影响。

Sun Feb 28 2010 16:12:20 GMT+0800 最近三天的收获——验证了Rate-Matching对数据完整性的影响。

        搭建了第一个验证数据完整性的测试环境。例化了两个GXB，分别实现发送数据通路和接收数据通路，在1b串行端口单方向对接实现数据环回；例化了一个带TBI接口的TSE PCS-only core，用于检查数据正确性；例化了一对8b/10b编解码器，用于连接GXB的8b接口到PCS的10b接口。在接收端可以调整GXB的本地时钟频率，使Rate-Matcher插入（本地频率高）或删除（本地频率低）/I2/有序集。在TBI接口的上层，利用TSE测试平台自带的数据校验功能检查发送和接收数据的完整性。该测试成功。
        在调整本地时钟的过程中，采用了ps作为时钟单位，实现了ps级别的时钟偏差控制。这是因为在vsim命令中已经指定了最小时间精度为ps。
        通过该测试，可以推导出以太网+-100ppm的时钟精度要求，在采用Rate-Matching后，近似可以支持12000字节的超长帧。对于再长的帧，由于长时间没有/I2/有序集可以插入或删除，无法进行时钟偏差补偿，会导致接收端Elastic Buffer溢出，数据完整性遭到破坏。对于再长的帧，可以通过提高时钟精度或增大Elastic Buffer的方式保证数据完整性。
        相关链接：Jumbo frame

        搭建了第二个验证数据完整性的测试环境。例化了两个GXB，每个都可以同时收发数据，在1b串行端口双向对接，其中一个实现8b环回，另一个在8b端口通过8b/10b编解码器与TSE PCS-only core的10b接口连接；例化了一个带TBI接口的TSE PCS-only core，用于检查数据正确性。通过调整环回端的本地时钟频率，可以实现两个GXB的时钟偏差。由于两个GXB都实现了接收数据通路，在两个GXB中可以分别看到/I2/有序集（在慢速端）被删除和（在快速端）被插入的操作。该测试成功。可以看到，对于两个存在时钟偏差的设备，通过Rate-matching可以实现慢速适应高速和高速适应慢速的数据速率匹配，并保证数据的完整性。
        但是，从该实验中同时可以看到，Rate-Matching有可能引入两种副作用。其一是，当快速发送端的数据以最小帧间隔到达慢速接收端时，由于/I2/有序集的删除操作，会使得接收端在GMII接口收到小于最小帧间隔的数据包，等效为接收数据带宽大于100%；当慢速环回设备环回数据包时，如果其慢速发送端只能以标准最小帧间隔发送数据包的话（等于100%），会造成环回数据通路发送带宽小于接收带宽，最终造成数据在环回设备的MAC层积累。所以，对于本地时钟有可能慢于对端设备的环回设备，需要配置其MAC层的最小帧间隔小于96bit，以大于100%的带宽环回数据，这样做是不符合标准的。其二是，由于/I2/=/K28.5/D16.2/有序集对应两个10b字符，每次删除或插入操作的粒度过大，当连续删除或插入两个/I2/有序集时，会造成相邻两个数据包之间的间隔发生大于两个10b字符的抖动，对于1Gbit Ethernet来说，会引入大于8ns*2=16ns的包延时抖动。相比之下，PCIe采用删除或插入一个/K28.0/10b字符的操作，粒度就小了一半，引入的影响也较小。这两个副作用，有可能使得Rate-Matching并不适用于以太网的物理层。
        相关链接：How does the CTC work according to the register "CC_MATCH_MODE" and the High watermark?
        仿真程序：

        在搭建第二个验证环境之前，还搭建了一个类似的验证环境。例化了两个带GXB的TSE PCS only core，在1b串行端口双向对接，其中一个PCS的参考时钟可调，以引入时钟频率偏差。但是由于集成的PCS core只有一个时钟输入端口，即使是1ps的时钟频率偏差也会引起锁相环周期性失锁，造成数据错误，并不能验证Rate-Matcher是否发挥作用。
        集成了GXB的PCS core是否使能了GXB中的Rate-Matcher，这一点还有待咨询Altera。有一种可能性是，Altera的PCS core与MorethanIP的PCS core都没有实现该功能，即使是在Altera的集成了GXB的PCS core中也没有使能该功能，为的是使PCS core的设计通用化。这是因为Rate-Matching功能只有在GXB中使能了8b/10b编解码后才能被使能，GXB实现了原本应该在PCS中实现的8b/10b编解码功能。这样一来，为了用上GXB中的Rate-Matcher，PCS core就需要相应地去除8b/10b编解码功能，造成集成与非集成的PCS core功能上的不一致，增加了IP core的HDL代码在不同使用模式下切换的复杂性。
        相关链接：Elasticity buffer insertion and deletion of framing characters,
Does HOTLink II support GMII/GPCS/TBI?

Mon Mar 01 2010 13:52:36 GMT+0800 (China Standard Time) 今天的收获——一些参考资源链接

Gigabit Ethernet Is Closely Related To Fibre Channel Technology, going back to 1988!
"clock tolerance compensation" on www.TI.com
"Elasticity Buffer" on www.National.com
"Elastic Buffer" on www.Xilinx.com
"clock correction" on www.Xilinx.com
Virtex-5 GTX RocketIO - Data errors with CLK_COR_ADJ_LEN = 1 or 3 during asynchronous operation
"ctc elastic buffer" on www.Latticesemi.com

LatticeECP2/M

? SERDES

Jitter and Signal Integrity Verification for Synchronous and Asynchronous I/Os at Multiple to 10 GHz/Gbps
Altera at 40 nm: Jitter-, Signal Integrity-, Power-, and Process-Optimized Transceivers
Stratix II GX Errata Sheet: Basic Double-Width Mode Illegal Skip Characters Insertion

Mon Jun 21 2010 09:18:25 GMT+0800 (China Standard Time) Explanation for 10G LAN-WAN speed matching -- another usage of rate-matching in PCS

// 10G LAN
line-rate at MAC and PCS layer

lanMAC2PCS_LineRate = 10*10^9;

lanPCS2PMA_LineRate = lanMAC2PCS_LineRate*(66/64) = 10.3125
Gbps;

// 10G LAN
effective-bit-rate is equal to 10G LAN line-rate

lanMAC2PCS_BitRate_max = lanMAC2PCS_LineRate = 10.000000
Gbps;

lanPCS2PMA_BitRate_max =
lanPCS2PMA_LineRate;

// OC-192
line-rate and effective-bit-rate calculation

oc192_LineRate = 192*51.84*10^6 = 9.95328
Gbps;

oc192_SpeBitRate =
oc192_LineRate*(261/270);

oc192_PayloadBitRate = oc192_SpeBitRate*(16640/16704) = 9.58464
Gbps;

// There is
no line-rate difference between LAN and WAN at MAC layer. Line-rate difference
is introduced in PCS layer.

wanMAC2PCS_LineRate =
lanMAC2PCS_LineRate;

wanPCS2WIS_LineRate =
oc192_PayloadBitRate;

wanWIS2PMA_LineRate = oc192_LineRate;

// 10G WAN
effective-bit-rate (theoretical) per OC-192
effective-bit-rate

wanMAC2PCS_BitRate_max = oc192_PayloadBitRate*(64/66) =
9.294196
Gbps;

wanPCS2WIS_BitRate_max =
oc192_PayloadBitRate;

// 10G WAN
effective-bit-rate (implemented) per IEEE802.3 IFS-Stretching at 10G WAN MAC
layer

wanMAC2PCS_BitRate_ifs = lanMAC2PCS_LineRate*(13/14) =
9.285714
Gbps;

wanPCS2WIS_BitRate_ifs =
lanPCS2PMA_LineRate*(13/14);

From the above
calculation, we can see that the theoretically maximum effective-bit-rate for
10GBase-W is 9.294196 Gbps. When the 10GBase-W is implemented according to
IEEE802.3 spec, the resulting effective-bit-rate is 9.285714 Gbps, which is
slightly less than the theoretically maximum.

The point is: any implementation
should try to approximate but not exceed the theoretically maximum.

Search for
ifsStretch or
ipgStretch in
IEEE802.3-2005 or IEEE802.3-2008 respectively, and you will find the algorithm
for IFS Stretching. The basic algorithm is to insert 1 byte into the IFS for
every 13 bytes transmitted, including original IFS. The value of 13/14
(=0.9285714) is an approximation of the product of
(9.95328Gbps/10.00000Gbps)*(261/270)*(16640/16704)*(64/66) (=0.9294196), which is
also the best approximation with integers
(14/15=0.9333333).

So the 9.285714 Gbps
effective-bit-rate is both correct and practical, although it is not the most

efficient.

On the Tx side, the stretching of IFS with ratio 14/13 is done by MAC
layer for WAN mode operation to limit the MAC layer peak bandwidth to 92.85714%,
resulting in 9.285714 Gbps effective-bit-rate. When this stretched byte stream
goes through PCS layer and is encoded with 64B/66B, a series of /I2/ ordered-set
is removed to slow the MAC-to-PCS (XGMII) line-rate down to the slower OC-192
Payload bit-rate.(The stretching of IFS in MAC layer is to make sure the removal
of /I2/ in PCS layer still maintains the minimum IFS.) IFS
stretching in MAC layer slows down the data rate, but maintains the line rate.
/I2/ ordered-set removal in PCS layer slows down the line rate, but maintains
the data rate.

On the Rx side, the insertion of /I2/ ordered-set is done by PCS to
accelerate the slower OC-192 Payload bit-rate to the faster PCS-to-MAC (XGMII)
line-rate, and the effective-bit-rate seen at the MAC layer is still 9.285714
Gbps. (This kind of /I2/ removal/insertion is a common task done in PCS layer.
For example, in rate-matching operation /I2/ ordered-set is removed/inserted to
accommodate the slight clock difference between two network equipments. In the
case of 10GBase-W, the clock difference between the MAC layer and the WIS layer is significant
but the similar mechanism is used.) /I2/
ordered-set insertion in PCS layer accelerates the line rate, but maintains the
data rate. So the IFS stretching result can be seen on the Rx
side.