DPDK gcc内联汇编

在DPDK中，使用gcc的内联汇编实现高效率的函数，比如自旋锁，cas操作等。今天简单介绍一下gcc内联汇编语法和DPDK利用内联汇编实现的函数。

gcc内联汇编

这里简单介绍一下内联汇编的语法，更详细的可以参考官方文档。

内联汇编格式如下，小括号中的参数使用分号分隔。AssemblerTemplate 中的汇编语句会从 InputOperands 读取变量值，执行结束后会将结果写到 OutputOperands 指定的变量中。

asm asm-qualifiers ( AssemblerTemplate 
                 : OutputOperands 
                 [ : InputOperands
                 [ : Clobbers ] ])

asm
是 GCC 里的关键字，或者使用 "asm"，表示内联汇编。

asm-qualifiers
asm修饰符，有三个值: volatile(指示GCC不要做优化)，inline和goto(如果使用goto，小括号中必须有参数 GotoLabels)。

AssemblerTemplate
字符串，包含一条或多条汇编语句，也可以为空。GCC不会解析具体的汇编指令，因为GCC也不知道汇编语句的作用，甚至不知道汇编语法是否正确。
在汇编语句中可以引用 output，input和goto label中的变量，可以通过 %[name] 引用，也可以通过数字 %0, %1 等引用。

OutputOperands
指定零个或多个操作数，汇编语句最终会修改这些操作数。格式如下:

    [ [asmSymbolicName] ] constraint (cvariablename)
    asmSymbolicName: 指定 cvariablename 的一个别名，可以在汇编语句中访问 %[name]。
                     如果不指定name，则可以使用基于数字的位置访问。
                     比如有三个output操作数，可以使用 %0 引用第一个，使用 %1 引用第二个，使用 %2 引用第三个。
    
    constraint: 字符串常量，指定了约束条件。输出约束必须以=(只写)或者+(可读写)开头。
    cvariablename: c的变量名，最终会修改此变量。

InputOperands
指定零个或多个变量或者表达式，汇编语句会从此读取变量值。格式和OutputOperands一样。

    [ [asmSymbolicName] ] constraint (cexpression)
    asmSymbolicName: 指定 cvariablename 的一个别名，可以在汇编语句中引用 %[name]。
    constraint: 字符串常量，指定了约束条件。输入约束不能以=(只写)或者+(可读写)开头。
    cvariablename: c的变量名或者表达式。

Clobbers
指定一个列表，告诉GCC列表中的寄存器是有其他用处的，不能被GCC使用。
除了指定寄存器还有两个特殊的Clobber: cc和memory。
cc会告诉GCC，汇编语句会修改 flags 寄存器。
memory告诉GCC，要将寄存器中的值刷新到内存，保证内存中包含正确的值，另外GCC也不要假定在执行汇编语句之前从内存读的值和执行汇编之后的值相同，有可能会被汇编语句修改，所以执行完汇编语句后要重新读取。

DPDK利用内联汇编实现的函数

下面看几个DPDK利用内联汇编实现的函数。

1. 读取处理器时间戳计数
读取处理器时间戳计数用到了一个汇编指令 rdtsc。下面介绍一下这个指令。

Reads the current value of the processor’s time-stamp counter (a 64-bit MSR) into the EDX:EAX registers. The EDX
register is loaded with the high-order 32 bits of the MSR and the EAX register is loaded with the low-order 32 bits.
(On processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.)

翻译过来就是rdtsc 指令用来读取处理器的时间戳计数(64位)，并保存到寄存器 EDX:EAX 中，EDX 保存高32位，EAX保存低32位。如果为64位处理器，则寄存器 RAX 和 RDX 的高32位都会被清空，低32分别保存计数的高32和低32位。

DPDK中的实现代码如下

static inline uint64_t
rte_rdtsc(void)
{
    union {
        uint64_t tsc_64;
        RTE_STD_C11
        struct {
            uint32_t lo_32;
            uint32_t hi_32;
        };
    } tsc;

    asm volatile("rdtsc" :
             //output
             "=a" (tsc.lo_32),  //a表示寄存器，GCC根据tsc.lo_32的类型决定使用32位还是64位，很显然这里是32位的，则使用寄存器 EAX。
             "=d" (tsc.hi_32)); //同上，d表示寄存器 EDX。
    return tsc.tsc_64;
}

最终会将EAX代表的低32位值保存到 tsc.lo_32，EDX代表的高32位值保存到 tsc.hi_32。

2. 原子操作
原子操作(以加1为例)用到了两个汇编指令: lock 和 inc。下面分别介绍这两个指令。
a. lock 指令

Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the
instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the
processor has exclusive use of any shared memory while the signal is asserted.

lock指令可以保证只有一个cpu访问内存。

b. inc 指令
inc 用来给目的操作数加1。

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

在inc指令前，必须先使用lock指令，保证原子操作。

下面看一下DPDK中如何实现原子操作。
定义了宏MPLOCKED，只在多cpu时才会使用lock指令，只有一个cpu，宏MPLOCKED为空。

#if RTE_MAX_LCORE == 1
#define MPLOCKED                        /**< No need to insert MP lock prefix. */
#else
#define MPLOCKED        "lock ; "       /**< Insert MP lock prefix. */
#endif

16位原子操作加1

typedef struct {
    volatile int16_t cnt; /**< An internal counter value. */
} rte_atomic16_t;

static inline void
rte_atomic16_inc(rte_atomic16_t *v)
{
    asm volatile(
            MPLOCKED            /* 首先使用lock指令锁住总线 */
            "incw %[cnt]"       /* 使用incw指令给cnt加1，incw中的w应该是word，表示两个字节*/
            : [cnt] "=m" (v->cnt)   /* output */ v->cnt即作为输入参数，又作为输出参数 
            : "m" (v->cnt)          /* input */
            );
}

32原子操作加1，和16位的区别是，换成了指令incl，参数v->cnt 变成了32位

typedef struct {
    volatile int32_t cnt; /**< An internal counter value. */
} rte_atomic32_t;

static inline void
rte_atomic32_inc(rte_atomic32_t *v)
{
    asm volatile(
            MPLOCKED
            "incl %[cnt]"
            : [cnt] "=m" (v->cnt)   /* output */
            : "m" (v->cnt)          /* input */
            );
}

64原子操作加1，和前面的区别是，换成了指令incq(q为quadrupl，表示8个字节)，参数v->cnt 变成了64位。

typedef struct {
    volatile int64_t cnt;  /**< Internal counter value. */
} rte_atomic64_t;

static inline void
rte_atomic64_inc(rte_atomic64_t *v)
{
    asm volatile(
            MPLOCKED
            "incq %[cnt]"
            : [cnt] "=m" (v->cnt)   /* output */
            : "m" (v->cnt)          /* input */
            );
}

3. 比较并交换操作
比较并交换操作用到了三个汇编指令: lock, cmpxchg 和 sete。下面分别介绍这三个指令。
a. lock
参考前面原子操作时的介绍。主要用来锁住总线，保证只有一个cpu访问内存。
b. cmpxchg 指令

Compares the value in the AL, AX, EAX, or RAX register with the first operand (destination operand). If the two
values are equal, the second operand (source operand) is loaded into the destination operand. Otherwise, the
destination operand is loaded into the AL, AX, EAX or RAX register. RAX register is available only in 64-bit mode.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the
interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the
comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is
written into the destination. (The processor never produces a locked read without also producing a locked write

cmpxchg 指令将第一个操作数(目的操作数)和 A 寄存器比较，如果相等，则将第二个操作数(源操作数)赋给第一个操作数(目的操作数)，并设置 ZF 为 1，否则将第一个操作数(目的操作数)赋给 A 寄存器，并设置 ZF 为0。使用此指令前也要先使用lock指令保证原子操作。

cmpxchg 实现的伪码如下:

(* Accumulator = AL, AX, EAX, or RAX depending on whether a byte, word, doubleword, or quadword comparison is being performed *)
TEMP := DEST
IF accumulator = TEMP
    THEN
        ZF := 1;
        DEST := SRC;
    ELSE
        ZF := 0;
        accumulator := TEMP;
    DEST := TEMP;
FI;

c. sete 指令
如果 ZF 为 1，则设置操作数为 1，否则设置为 0。

DPDK中的实现代码如下
dst指向一块内存，exp为此内存之前的值，现在去dst内存中最新值和exp作比较，如果相等，则将src的值赋给dst，并返回1，如果不相等，则返回0。

static inline int
rte_atomic32_cmpset(volatile uint32_t *dst, uint32_t exp, uint32_t src)
{
    uint8_t res;

    asm volatile(
            MPLOCKED
            "cmpxchgl %[src], %[dst];"
            "sete %[res];"
            /* output */
            : [res] "=a" (res),     /* 将结果0或者1写到变量res中 */
              [dst] "=m" (*dst)     /* 将src赋给dst指向的内存 */
            /* input */
            : [src] "r" (src),      /* 将src放在寄存器中，gcc会任选一个通用寄存器 */
              "a" (exp),            /* 将exp的值放到寄存器 a */
              "m" (*dst)            /* 读取dst内存值，可以在上面的汇编语句中通过%[dst]访问 */
            : "memory");            /* no-clobber list */ memory通知gcc执行汇编语句前要刷新寄存器，从内存读取数据
    return res;
}

4. 自旋锁的实现
自旋锁操作用到了多个汇编指令，下面分别介绍一下。

a. xchg

Exchanges the contents of the destination (first) and source (second) operands. The operands can be two generalpurpose
registers or a register and a memory location. If a memory operand is referenced, the processor’s locking
protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or
absence of the LOCK prefix or of the value of the IOPL. (See the LOCK prefix description in this chapter for more
information on the locking protocol.)

指令 xchg 用来交换两个操作数的内容。操作数可以是两个通用寄存器，或者是 a 寄存器，或者是内存。
如果操作数从内存取，处理器的locking协议会自动实现原子操作，不用使用lock指令来保证。

b. test

Computes the bit-wise logical AND of first operand (source 1 operand) and the second operand (source 2 operand)
and sets the SF, ZF, and PF status flags according to the result. The result is then discarded.

test指令将两个操作数相与，如果结果为0，则设置 ZF 为1，否则设置 ZF 为0。

test指令的伪码如下

TEMP := SRC1 AND SRC2;
SF := MSB(TEMP);
IF TEMP = 0
    THEN ZF := 1;
    ELSE ZF := 0;
FI:

c. jz和jnz
jz: 如果 ZF 为 1，则跳转
jnz: 如果 ZF 为 0，则跳转

d. cmp

Compares the first source operand with the second source operand and sets the status flags in the EFLAGS register
according to the results. The comparison is performed by subtracting the second operand from the first operand
and then setting the status flags in the same manner as the SUB instruction. When an immediate value is used as
an operand, it is sign-extended to the length of the first operand.

cmp 指令用来比较两个操作数的大小，如果相等，则设置 ZF 为1。

cmp指令的伪码如下

temp := SRC1 − SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags in the same manner as the SUB instruction*)

DPDK中的实现代码如下
使用一个 volatile 修饰的变量 locked，如果加锁了，locked值为1，没加锁值为0。

typedef struct {
    volatile int locked; /**< lock status 0 = unlocked, 1 = locked */
} rte_spinlock_t;

变量locked初始值为0

static inline void
rte_spinlock_init(rte_spinlock_t *sl)
{
    sl->locked = 0;
}

加锁操作。此段汇编中有三个label: 1,2和3。
在label1处，读取变量 locked，使用指令xchg和局部变量 lv 的值交换，然后使用指令test判断 lv 是否为0，即判断变量 locked 是否为0，如果为0，表示加锁成功，变量 locked 值也变成1了，则跳转到label3，如果不为0，说明变量 locked 已经被其他线程加1，即被其他线程加锁，则执行label2。

在label2处，先pause一下，然后再读取变量 locked，使用指令cmp判断是否为0，如果为0，说明其他线程已经解锁，跳转到label1处，如果不为0，则继续在label2出循环判断。

在label3处，能到label3，说明加锁成功，退出即可。

static inline void
rte_spinlock_lock(rte_spinlock_t *sl)
{
    int lock_val = 1;
    asm volatile (
            "1:\n"
            "xchg %[locked], %[lv]\n"   //locked和lv交换值
            "test %[lv], %[lv]\n"       //lv和lv相与，判断结果
            "jz 3f\n"                   //如果为0，则加锁成功，跳转到label3
            "2:\n"                      //如果不为0，说明被其他线程加锁了，则执行label2
            "pause\n"                   //暂停一下
            "cmpl $0, %[locked]\n"      //locked和0比较
            "jnz 2b\n"                  //locked不为0，说明其他线程还没有释放锁
            "jmp 1b\n"                  //locked为0，说明其他线程已经解锁，跳转到label1，和lv交换值，将locked变成1，即加锁成功
            "3:\n"
            : [locked] "=m" (sl->locked), [lv] "=q" (lock_val)
            : "[lv]" (lock_val)
            : "memory");
}

解锁操作，将sl->locked值变成0

static inline void
rte_spinlock_unlock (rte_spinlock_t *sl)
{
    int unlock_val = 0;
    asm volatile (
            "xchg %[locked], %[ulv]\n"   //locked和lv交换值，locked变成0，解锁
            : [locked] "=m" (sl->locked), [ulv] "=q" (unlock_val)
            : "[ulv]" (unlock_val)
            : "memory");
}

参考

https://cloud.tencent.com/developer/article/1520799
https://cloud.tencent.com/developer/article/1520798?from=article.detail.1520799
https://www.cnblogs.com/taek/archive/2012/02/05/2338838.html

最后编辑于：2021.08.17 21:32:42

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,271评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,275评论 2赞 380
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,151评论 0赞 336
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,550评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,553评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,559评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,924评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,580评论 0赞 257
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,826评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,578评论 2赞 320
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,661评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,363评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,940评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,926评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,156评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,872评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,391评论 2赞 342

DPDK gcc内联汇编

gcc内联汇编

DPDK利用内联汇编实现的函数

参考

推荐阅读更多精彩内容