Sunday, July 27, 2014

PostgreSQL database


We have been deploying a database service at work which is based on PostgreSQL.    Here's some notes I gathered for the installation and basic configuration of the service.  So far, we are quite impressed and satisfied by the stability of this "World's most advanced open source Database"  (or maybe the quote should be "Oracle wannabe") !

Memory Setting:


Global Wide Server Setting
Shared_buffers* :
This global setting defines amount of memory shared across all connections/process where all data in/out the DB pass through.  Unlike other DBMS, PS should not grab all physical RAM available, as most read/write occurs at the OS level and thus depend on allowing OS's cache to work as usual.   The only exception could be to force WAL writes to bypass the OS cache.   
There is no easy rule exist to get proper sizing of this buffer.   Too large values will exacerbate some issues (ex. "Doubly cached Data", "checkpoint I/O spike" due to large quantity of dirty data regularly stored in cache, and having too small value mean we won't benefit from PS smart eviction buffer cache rule (see clock-sweep algorithm).   But the default value set by initdb is definitely too low as define to guarantee a proper PS start-up and must be changed.
We should start with a value of at least 25% of your on- board memory for server where OS overhead is small (typically the case for any modern system with 1Gb or more), to make sure we avoid "Doubly cached Data" data.
* updating this global setting requires a database restart 

Wal_Buffers : 
Another shared memory setting which controls how much memory used for buffering write-ahead log data.  Default suggests a value of 64 KB, but this is only sufficient as long as no single transaction exceeds that value, and in practice it is recommended to set higher values (not too high as this takes up shared memory).   With current modern system, a general guideline propose to set this to : Wal_buffers = 16 MB.

Maintenance_work_mem: 
This is the total memory allocated for housekeeping activities like vacuuming (getting rid of dead records), as well as a few operations requiring larger memory than normal sorting Work_mem (ex of operations: VACUUM, CREATE INDEX, ALTER TABLE ADD FOREIGN KEY) .   There shouldn’t be many concurrent operations like these, so this value can be set to 5% of total server RAM  (50 Mb of maintenance_work_mem per GB of server RAM). 

Default_statistics_target:
This controls how much statistics are gathered about each table (ANALYZE statement) so that the Query optimization produce better execution plan.  Increasing this value only makes sense if it improves some long queries since it also increases the background overhead of databse maintenance.  The value is now fixed to 100, before increasing it, it is worthwhile to fine-tune specific Table/Column where we know Query plan could be improved (ex. indexes using the LIKE operator) with ALTER TABLE SET STATISTICS statement.

Maximum_connections:
This must be estimated generously as it represent a hard-limit and new client connections will simply be refused once reached (the maximum value estimated after on new installation is = 100).   Each connection takes a small amount of shared memory, so the memory footprint related to this will be quite small in relation to the shared_buffers and work_mem values.  However, setting this to too large value will limit the value we can set for the per-client setting of work_mem value, so we must balance the two.   

Per-client Connection Setting

Work_mem :
This controls the maximum amount of memory allocated per client connection for operation such as sorting, hash join, and others.  Based on the estimate of memory footprint of these operations, PS will more frequently decide to operate on disk instead of memory ! To avoid this one should set large enough value, however be careful that this setting is defined per Sort and not at client connection.   If you have many users connecting, but fairly simple queries, you want this to be relatively low. If you do lots of intensive processing, like building a data warehouse with few users, you want this to be high. (see http://www.depesz.com/2011/07/03/understanding-postgresql-conf-work_mem/).  
This non-shared setting can always be adjusted by client after connecting, but a rule of thumb is to estimate the free RAM after shared_buffers allocation, divided by the max_connections and take a fraction of that figure (ex. 1/3, to remain moderate since it can always be increased on specific client connection where we know we'll need more memory):
work_mem = 1/3 * (free memory - shared buffers ) / max_connection  

Effective_cache_size:
This is an estimate of how much memory you expect to be available in the OS and PostgreSQL buffer caches. It has no affect on actual allocation, but is used only by the PostgreSQL query planner to figure out whether plans under consideration would fit in RAM or not. If it’s set too low, indexes may be underutilized. If you have a dedicated PostgreSQL server, then setting this to 50% of your on-board memory would be a good start, assuming the shared_buffers is set to less than 50% of total memory.  It must be set according to the shared_buffers global setting, as the two cannot exceed total physical memory of server (a too high value may result in both the database and OS cached being disrupted by resulting large number of blocks created to satisfy some query).  

File Systems:

Tuning:

* Read-ahead 
Linux may read ahead disk-block of various size.   Default disk blocks read-ahead is too low: 265K for most DB use-case.  Increasing this value will benefit all sequential read done by PS... so should be set to 4096 or more (4K):
# blockdev --getra /dev/sda
256
# blockdev --setra 4096 /dev/sda  
(to be set for each disk device, and added in the boot script: rc.local)

* File access time 
Disable "last access time" (atime) overhead by adding: noatime to the volume mount options in /etc/fstab:   
/dev/sda1 / ext3 noatime, errors=remount-ro 0 1

* Swapping
For better predictable DB perf, should not let OS swap inactive disk pages out when running low on memory (Linux has caching for its filesystem).  This is done permanently by adding:
vm.swappiness=0 to /etc/sysctl.conf
Forcing the OS to shrink filesystem cache rather than swap.  

* Memory overallocation
Disable Linux behavior to overcommit memory by process.  This is done by adding: 
vm.overcommit_memory=2 to /etc/sysctl.conf

* Write-back Cache:
Write-cache happens at these levels:  1) fileOS cache, 2) Disk controller (or RAID cards) and 3) Disk drive, and to ensure durability of Transactions, we must:
A. Force OS to use fsync call mechanisms to write data securely
B. Use a BBWC (battery-backed write cache) at controller and monitor its health state, so you can take advantage of fast write non-volatile cache!
C. Disable write cache set at the disk drive level (which is always volatile) 
Disabling write-cache at disk drive:
sudo hdparm -W 0 /dev/sda

* Write cache sizing:
We could change both dirty_background_ratio and dirty_ratio parameter (kernel param driving how aggressively cache is written out to disk), but it seems that it is more important when the FS is under ext3.   

Tablespace definition:
We should define different tablespace for these components:
    • 1- System catalog data  
    • 2- Application data
    • 3- Index data
For application table's data (stored within file (max 1Gig) in tablespace, i.e. directory), the goal being to separate Data from Index on different drive, as the two are frequently accessed concurrently.   Furthermore, very large table (many Gbs) could leverage Table partition which stored various chunks of table into separate Tablespace.   Use temp_tablespace for temporary storage are needed to some Query evaluation (eg. when sorting is required).  This tablespace should be designed with space efficiency rather than speed and subject to disk fragmentation. 

WAL (write-ahead logging):
We should add archiving to these WAL redo log files, unless we work in failover mode.

Another proposal to define finer grain Tablespace file systems (partition) for more flexibility: 
/pgarchiveDB Archive location containing the archive log files.
/pgbackupDB Backup location containing the physical and logical backups. For logical backups (pg_dump), use EXPORTS as a sub directory. For physical backups, use FULL{date}, convention for the sub directory. However, physical backups could be handled with file system snapshots. More on this later.
/pgclusterPostgreSQL Cluster data directory (PGDATA environment variable). This will contain all of the configuration files and directories of a PostgreSQL cluster. Note: the on-line WAL files would be located in /pgcluster/pg_xlog.
/pglogThe location of the server log files.
/pgdata-systemThe location of a database’s default tablespace. This is to be used at the creation of the database. The database catalog information will be stored.
/pgdata-tempThe location of the database’s default temporary tablespace. This is needed for temporary sort information. Note: The pgsql_tmp directory within the default tablespace will be used if a temporary tablespace is not defined.
/pgdata-app_tblspcApplication tablespaces. For every schema there should be a minimum of two tablespaces. One for tables and one for indexes.

Disk Minimal Recommendation 


There are no general one-size-fit-all solution as it comes with disk layout, as it will be very much application specific.  The general recommendation is that the more Disk we have the better we can spread/parallelize the workload into separate spindles, hence increasing throughput.  

The current hardware has only two disks (in RAID-1) and cannot be considered as a viable longterm solution.   So presuming, the PostgreSQL service will host many concurrent and highly demanding applications in future, here are the minimal disk requirements we should keep in mind for future improvement.

Disk Layout
1)  Have the OS on separate disk (ex. 2 disks on RAID-1)
2)  Have the WAL transaction-log on separate disk (optimally designed for sequential write)
3)  Have the temp tablespace (used for large user sorting) on a separate disk (optimally designed for random access, here SSD could be a viable solution)
4)  Have all database data on as many disks as possible with RAID 1+0
5)  Have all index-related data on separate disk with RAID 1+0

The principles behind these recommendations are:
  1. Avoiding putting the WAL onto the same drive as OS since it has a completely different access pattern (WAL are purely sequential write)
  2. If there will be large Sorting operation, the Temporary files (temporary_tablespace) should not be stored along the DB files (table + index)
  3. Application data is accessed concurrently with index-data, so separating the two will boost perf on all occasions.


Configuration Scenario
Now how do we configure all these drives is more challenging.   

For example, for a 14 Disk scenario, we could have the following configuration per disk:
1) Scenario 1:
Location
#Disks
RAID level
Purpose
/ (root)21OS
$PGDATA/data6+1+0Database data only
$PGDATA/index2+1+0
Database index only
$PGDATA/pg_xlog2
1

WAL
Tablespace-temp2NoneTemporary files

But, we could as well have the simple configuration:
2) Scenario 2
Location
#Disks
RAID level
Purpose
/ (root)141+0OS, DB, WAL

What scenario is going to perform better is impossible to say without benchmark on the target DB with real data access pattern.
It is only possible to predict that scenario 2 will perform better in workload with more Random access, while scenario 1 will perform better with more Sequential access
However the exact mix of sequential versus random seek access is simply not predictable, so optimization of disk layout in advance is equivalent to guessing!  

Martin