輕描淡寫的低調: PostgreSQL為何一定要做Vacuum呢？

無庸置疑，PostgreSQL的更新性能是一大問題。

PostgreSQL更新處理的弱點
PostgreSQL採用追加型的架構，更新的時候，在數據庫內部刪除行之後再追加新行。刪除掉的行由VACUUM命令作為可以再利用區域回收，然後被再利用。如果頻繁進行更新處理也必須讓VACUUM頻繁運行，不然廢棄區域會不斷增加導致數據表肥大化而致使性能惡化。
VACUUM需要掃瞄數據表的全部數據，所以數據表越大處理時間越長。因此，如果數據表很大而且更新又很頻繁，那麼無論怎麼頻繁運行VACUUM也來不及回收更新處理產生的廢棄區域。
然而問題不僅如此，更新處理不僅僅是針對數據表本身，索引也必須同時更新。PostgreSQL的某個列被更新的話，關聯索引也需要全部更新。因此，擁有大量索引的數據表的更新處理量將會特別大，索引也會越來越肥大化。索引的廢棄區域也因此難於被VACUUM回收再利用，比數據表的問題更加嚴重。

假設有如下一個管理網頁訪問次數的簡單數據表，當某個URL被訪問時cnt增加計數，這是一個頻繁更新的典型案例。

CREATE TABLE t1(
url TEXT PRIMARY KEY,-- 主鍵，自動附加索引
cnt INTEGER
);

cnt更新時，url字段的關聯索引也會被更新。
考慮一下cnt發生變更而url沒有變化的情況，這時url的索引更新是沒有必要的。

使用VACUUM抑制數據表的廢棄領域
更新cnt時，這行數據被刪除（正確地說是在這個時刻變得不可見），然後追加新的數據行。被刪除的行中放置一個被刪除標誌，以及一個指向新行的指針。再次更新這行數據時，從舊行的指針很容易找到新行，然後用上邊提到的同樣方式在這裡放置指向新行的指針。這樣反覆更新的話就會形成一個指針鏈，叫做更新鏈（UPDATE chain）。
更新鏈越長檢索和更新花費的時間就越長，直到運行VACUUM才會消除這種影響，這是現在PostgreSQL的問題點。

原文出處：
The ‘MV’ in MVCC (Multiversion Concurrency Control) stands for Multi Version. This means that multiple versions of the same data will be kept any time that data changes. Oracle does this by rolling old data into an "undo log." PostgreSQL doesn't use an undo log; instead it keeps multiple versions of data in the base tables. This means that there is much less overhead when making updates, and you must occasionally remove the old versions. This is one of the things VACUUM does.

The way PostgreSQL manages these multiple versions is by storing some extra information with every row. This information is used to determine what transactions should be able to see the row. If the row is an old version, there is information that tells PostgreSQL where to find the new version of the row. This information is needed to be able to lock rows during an update.

Consider this scenario: a row is inserted into a table that has a couple indexes, and that transaction commits. Several updates happen on that row. Each update will create a new row in all indexes, even if the index key didn't change. And each update will also leave an old version of the row in the base table, one that has been updated to point to the location of the new version of the row that replaces it. All of the old data will stick around until the vacuum is run on that table. In a busy system, it doesn't take very long for all the old data to translate into a lot of wasted space. And it's very difficult to reclaim that space if it grows to an unacceptable level.

What this means to those who want to keep their PostgreSQL database performing well is that proper vacuuming is critical. This is especially true on any tables that see a heavy update (or insert/delete) load, such as a table used to implement some kind of a queue. Such tables should generally be vacuumed frequently if they are small--more frequently than autovacuum normally would provide. For more moderate loads, autovacuum will often do a good job of keeping dead space to a minimum.

輕描淡寫的低調

2008年9月23日星期二

PostgreSQL為何一定要做Vacuum呢？

沒有留言:

網誌存檔

關於我自己