C++PerformanceMemory

Padding、Vtable、Smart Pointer 的成本

深入記憶體佈局與object model - 理解 struct padding 如何浪費空間、virtual function 的真實代價、以及 smart pointer 不為人知的隱藏成本。

2026-03-31

Item 3 - Struct Padding & Alignment - `sizeof` 不是你想的那樣

CPU 的自然對齊要求

CPU 存取記憶體時有一個基本規則：N-byte 的 type 必須從 N 的倍數地址開始。例如 int（4 bytes）必須放在 4 的倍數地址，double（8 bytes）必須放在 8 的倍數地址。如果不滿足這個條件，編譯器會自動插入 padding bytes 來對齊。

這意味著 sizeof 的結果往往比你直覺計算的「所有欄位大小總和」更大。

記憶體佈局比較

同樣的三個欄位，排列順序不同，sizeof 可以差很多：

cpp

// ❌ Bad：char a, double b, char c
struct Bad {
  char   a;   // 1 byte
  double b;   // 8 bytes
  char   c;   // 1 byte
};
// sizeof(Bad) = 24

Bad 的記憶體佈局（每格 = 1 byte）：

地址:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
     [a][  padding (7 bytes)  ][      double b (8 bytes)      ][c][  pad (7)  ]

a 佔 1 byte → 為了讓 double b 對齊到 8 的倍數，插入 7 bytes padding
c 佔 1 byte → struct 整體大小必須是最大對齊值（8）的倍數，再補 7 bytes

cpp

// ✅ Good：double b, char a, char c
struct Good {
  double b;   // 8 bytes
  char   a;   // 1 byte
  char   c;   // 1 byte
};
// sizeof(Good) = 16

Good 的記憶體佈局：

地址:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
     [      double b (8 bytes)      ][a][c][ pad (6) ]

b 已經對齊 → a, c 連續放 → 尾部補 6 bytes → 總共 16 bytes

同樣的欄位，只是換了順序，就從 24 bytes 降到 16 bytes，省了 33% 的空間。

用 sizeof 和 offsetof 驗證

cpp

#include <cstddef>
#include <cstdio>

struct Bad  { char a; double b; char c; };
struct Good { double b; char a; char c; };

int main() {
  printf("sizeof(Bad)  = %zu\n", sizeof(Bad));   // 24
  printf("sizeof(Good) = %zu\n", sizeof(Good));  // 16

  printf("Bad::a  offset = %zu\n", offsetof(Bad, a));   // 0
  printf("Bad::b  offset = %zu\n", offsetof(Bad, b));   // 8
  printf("Bad::c  offset = %zu\n", offsetof(Bad, c));   // 16

  printf("Good::b offset = %zu\n", offsetof(Good, b));  // 0
  printf("Good::a offset = %zu\n", offsetof(Good, a));  // 8
  printf("Good::c offset = %zu\n", offsetof(Good, c));  // 9
}

經驗法則

把較大的欄位放前面，較小的放後面。讓編譯器需要插入的 padding 最少。

`#pragma pack(1)` 的取捨

你可以用 #pragma pack(1) 強制取消所有 padding：

cpp

#pragma pack(push, 1)
struct Packed {
  char   a;   // offset 0
  double b;   // offset 1（未對齊！）
  char   c;   // offset 9
};
#pragma pack(pop)
// sizeof(Packed) = 10

看起來省了空間，但代價不小：

x86：未對齊存取會產生效能懲罰（可能需要兩次記憶體存取）
ARM：某些 ARM 架構會直接產生 bus fault（程式 crash）
網路協議、檔案格式等需要精確控制佈局時才適合使用

編譯器警告

使用 -Wpadded flag，編譯器會在插入 padding 時發出警告，幫助你發現可以優化的 struct：

bash

g++ -Wpadded main.cpp
# warning: padding struct 'Bad' with 7 bytes to align 'b'

Quiz

Q3：以下兩個 struct 的 sizeof 分別是多少？（假設 x86-64）

A. struct A { char x; int y; char z; };
B. struct B { int y; char x; char z; };
(a) A=6, B=6
(b) A=12, B=8
(c) A=8, B=8
(d) A=12, B=12

Show Answer

(b) A=12, B=8

struct A：char(1) + pad(3) + int(4) + char(1) + pad(3) = 12。struct B：int(4) + char(1) + char(1) + pad(2) = 8。把 int 放前面省了 4 bytes。

Item 4 - Vtable - `virtual` function 的真實代價

加一個 virtual 會怎樣？

cpp

struct NoVirtual {
  int x;
};
// sizeof(NoVirtual) = 4

struct WithVirtual {
  int x;
  virtual void foo() {}
};
// sizeof(WithVirtual) = 16

只是加了一個 virtual，sizeof 就從 4 跳到 16。多出來的空間是：vptr（8 bytes）+ int（4 bytes）+ padding（4 bytes）= 16 bytes。

Vtable 機制

當 class 有 virtual function 時，編譯器會做兩件事：

生成一張 vtable（virtual table）- 一個 function pointer 的陣列，存放在 read-only data section
在每個物件開頭插入一個 vptr（8 bytes）- 指向該 class 的 vtable

記憶體佈局：

物件 (WithVirtual)              Vtable（.rodata section）
+------------------+           +-------------------+
| vptr (8 bytes) --|---------->| &WithVirtual::foo |
+------------------+           +-------------------+
| int x  (4 bytes) |
+------------------+
| padding (4 bytes)|
+------------------+
  sizeof = 16

繼承時：
Derived 物件                    Derived Vtable
+------------------+           +-------------------+
| vptr (8 bytes) --|---------->| &Derived::foo     |  ← override 版本
+------------------+           +-------------------+
| int x  (4 bytes) |
+------------------+
| padding (4 bytes)|
+------------------+

Virtual Call 的過程

每次呼叫 virtual function，CPU 需要：

obj->foo();

1. 讀取 obj 的 vptr          → 第 1 次 memory indirection
2. 用 index 查 vtable         → 第 2 次 memory indirection
3. 呼叫 function pointer      → 實際跳轉

相比之下，non-virtual call：
obj->bar();                   → 直接跳轉到固定地址（編譯期已知）

真正的代價：失去 inline 機會

兩次額外的 memory indirection 本身不算太慢。真正的效能殺手是：編譯器無法 inline virtual function。

Non-virtual function 在編譯期就知道要呼叫哪個函數，編譯器可以把函數體直接嵌入呼叫處（inline），省去函數呼叫的 overhead。Virtual function 要到 runtime 才知道呼叫哪個版本，編譯器無法做這個優化。

在 hot loop 中，inline 與否的差距可達 5-10 倍。代價不在 indirection 本身，而在失去 inline 優化的機會。

Devirtualization：用 `final` 拿回效能

cpp

class Base {
public:
  virtual void process() = 0;
};

class Impl final : public Base {  // ← final：不會再被繼承
public:
  void process() override { /* ... */ }
};

void hot_loop(Impl& impl) {
  for (int i = 0; i < 1000000; ++i) {
    impl.process();  // 編譯器知道 Impl 是 final → 可以 devirtualize → 可以 inline
  }
}

final 告訴編譯器「這個 class 不會再被繼承」，所以 virtual call 可以被還原為 direct call，重新獲得 inline 的機會。 LTO（Link-Time Optimization）也能在某些情況下做到類似效果。

CRTP：靜態多型的替代方案

cpp

// Curiously Recurring Template Pattern
template <typename Derived>
class Base {
public:
  void interface() {
    static_cast<Derived*>(this)->implementation();  // 編譯期決定呼叫哪個版本
  }
};

class Impl : public Base<Impl> {
public:
  void implementation() { /* ... */ }  // 不需要 virtual，可以 inline
};

CRTP 讓你在不使用 virtual 的情況下實現多型行為。所有 dispatch 在編譯期完成，零 runtime 成本。缺點是語法較複雜，且無法做到 runtime 多型（例如異質容器）。

std::function 內部也使用類似 vtable 的 type erasure 機制，有類似的 overhead。如果只在編譯期知道 callable 的 type，用 template 參數代替 std::function。

Quiz

Q4：以下兩個 struct 的 sizeof 分別是多少？（x86-64）

A. struct Plain { int x; };
B. struct Virtual { int x; virtual void foo() {} };
(a) 4, 8
(b) 4, 12
(c) 4, 16
(d) 8, 16

Show Answer

Plain 只有 int，sizeof = 4。Virtual 加了 virtual → 編譯器插入 vptr（8 bytes），加上 int（4 bytes）+ padding（4 bytes），sizeof = 16。

Item 5 - Smart Pointer 的隱藏成本

`unique_ptr`：真正的零成本抽象

std::unique_ptr 是 C++ 中少見的「名副其實」的零成本抽象：

cpp

#include <memory>

printf("sizeof(int*)              = %zu\n", sizeof(int*));               // 8
printf("sizeof(unique_ptr<int>)   = %zu\n", sizeof(std::unique_ptr<int>)); // 8
printf("sizeof(shared_ptr<int>)   = %zu\n", sizeof(std::shared_ptr<int>)); // 16

unique_ptr 的 sizeof 跟 raw pointer 完全一樣 - 8 bytes。沒有 control block，沒有 reference counting，編譯器可以把它優化到跟 raw pointer 一模一樣的機器碼。

unique_ptr<Widget> 的記憶體佈局：

+-----------------+
| raw pointer (8B)|          ← 就只有一個指標，沒有任何額外開銷
+-----------------+
  sizeof = 8

`shared_ptr`：隱藏的 Control Block

shared_ptr 就不同了。每個 shared_ptr 佔 16 bytes（兩個指標），而且背後還有一個隱藏的 control block：

shared_ptr<Widget> 的記憶體佈局：

shared_ptr 物件本身 (16B)         Control Block（heap 上）
+--------------------+           +-------------------+
| ptr to Widget (8B) |           | strong_count (4B) |  ← atomic
+--------------------+           | weak_count   (4B) |  ← atomic
| ptr to control (8B)|---------→| deleter           |
+--------------------+           | allocator         |
  sizeof = 16                    +-------------------+
                                        |
                                        v
                                 +-------------------+
                                 | Widget 物件本身    |
                                 +-------------------+

每次 copy 一個 shared_ptr，都要對 strong_count 做 atomic increment；每次 destroy 都要做 atomic decrement。

為什麼 atomic 很慢？

Atomic 操作不是普通的加減法。在 x86 上，atomic increment 會生成 LOCK ADD 指令，這條指令做了這些事：

鎖住 cache line（其他 core 不能同時存取）
執行 cache coherence protocol（通知所有 core）
加上 memory barrier（防止指令重排）

效能比較：

普通 increment:     ~1 cycle
Atomic (no contention): ~10-20 cycles       ← 2-10x 慢
Atomic (contention):    ~100-1000+ cycles   ← 100x+ 慢

Single-threaded 場景下，atomic 仍然有 2-10x 的 overhead，
因為 LOCK 指令本身就有固定成本。

避免不必要的 atomic：傳 const reference

cpp

// ❌ 每次呼叫都 copy shared_ptr → atomic inc/dec
void process(std::shared_ptr<Widget> w) {
  w->doSomething();
}

// ✅ 傳 const reference → 零 atomic 操作
void process(const std::shared_ptr<Widget>& w) {
  w->doSomething();
}

如果函數不需要延長物件的生命週期（不需要保存一份 shared_ptr），就傳 const reference，避免無謂的 atomic 操作。

`make_shared` vs `shared_ptr(new T)`

cpp

// ❌ 兩次 heap allocation
auto p = std::shared_ptr<Widget>(new Widget());

// ✅ 一次 heap allocation
auto p = std::make_shared<Widget>();

shared_ptr(new T) - 兩次 allocation：

Heap allocation 1:          Heap allocation 2:
+-------------------+       +-------------------+
| Widget 物件        |       | Control Block     |
+-------------------+       +-------------------+

make_shared<T>() - 一次 allocation：

Heap allocation 1（連續記憶體）:
+-------------------+-------------------+
| Control Block     | Widget 物件        |
+-------------------+-------------------+

好處：
1. 少一次 heap allocation（heap allocation 很貴，通常 50-100ns）
2. 更好的 cache locality（control block 和物件相鄰）
3. 異常安全（C++17 前 shared_ptr(new T) 可能 leak）

決策樹

需要共享所有權嗎？
│
├── 是 → shared_ptr（搭配 make_shared）
│         └── 傳遞時盡量用 const reference
│
└── 否 → unique_ptr ✅
          └── 零成本，跟 raw pointer 一樣快

Quiz

Q5：以下關於 smart pointer 的敘述，哪一個是「錯誤」的？

A. unique_ptr 的 sizeof 跟 raw pointer 一樣
B. shared_ptr 的 reference count 使用 atomic 操作
C. make_shared 比 shared_ptr(new T) 多做一次 heap allocation
D. 傳 const shared_ptr& 可以避免 atomic increment

Show Answer

正好相反：make_shared 把 control block 和物件合併在一次 allocation 中，比 shared_ptr(new T) 少一次 heap allocation。

Bonus - 綜合題

Quiz

以下 Config struct 有三個效能問題，你能全部找出來嗎？

struct Config {
char flag;
double weight;
char mode;
virtual void validate() {}
std::shared_ptr<Logger> logger;
};
void init(std::shared_ptr<Logger> logger) {
// ...
}

Show Answer

三個問題：

1. Struct Padding：char, double, char 的排列浪費大量 padding。應該把 double 放前面：double weight; char flag; char mode; 2. Vtable：virtual validate() 讓每個 Config 物件多了 8 bytes 的 vptr。如果不需要多型，移除 virtual；如果需要，考慮 CRTP 或 final。 3. Smart Pointer：init() 以 value 傳入 shared_ptr<Logger>，每次呼叫都做 atomic inc/dec。應改為 const std::shared_ptr<Logger>& logger。

總結

主題	關鍵要點	行動建議
`Struct Padding`	欄位排列影響 sizeof	大的放前面，用 -Wpadded 檢查
`Vtable`	virtual 代價在於失去 inline	用 final、CRTP 或移除不必要的 virtual
`Smart Pointer`	shared_ptr 的 atomic 有隱藏成本	優先 unique_ptr，傳 const ref，用 make_shared

核心觀念

理解記憶體佈局與object model，才能寫出 對 CPU 友善的高效 C++ 程式。

← 上篇：SSO 與 Copy Elision

Padding、Vtable、Smart Pointer 的成本

Item 3 - Struct Padding & Alignment - sizeof 不是你想的那樣

CPU 的自然對齊要求

記憶體佈局比較

用 sizeof 和 offsetof 驗證

經驗法則

#pragma pack(1) 的取捨

編譯器警告

Item 4 - Vtable - virtual function 的真實代價

加一個 virtual 會怎樣？

Vtable 機制

Virtual Call 的過程

真正的代價：失去 inline 機會

Devirtualization：用 final 拿回效能

CRTP：靜態多型的替代方案

Item 5 - Smart Pointer 的隱藏成本

unique_ptr：真正的零成本抽象

shared_ptr：隱藏的 Control Block

為什麼 atomic 很慢？

避免不必要的 atomic：傳 const reference

make_shared vs shared_ptr(new T)

決策樹

Bonus - 綜合題

總結

Item 3 - Struct Padding & Alignment - `sizeof` 不是你想的那樣

`#pragma pack(1)` 的取捨

Item 4 - Vtable - `virtual` function 的真實代價

Devirtualization：用 `final` 拿回效能

`unique_ptr`：真正的零成本抽象

`shared_ptr`：隱藏的 Control Block

`make_shared` vs `shared_ptr(new T)`