Sunday, July 27, 2014

PostgreSQL database


We have been deploying a database service at work which is based on PostgreSQL.    Here's some notes I gathered for the installation and basic configuration of the service.  So far, we are quite impressed and satisfied by the stability of this "World's most advanced open source Database"  (or maybe the quote should be "Oracle wannabe") !

Memory Setting:


Global Wide Server Setting
Shared_buffers* :
This global setting defines amount of memory shared across all connections/process where all data in/out the DB pass through.  Unlike other DBMS, PS should not grab all physical RAM available, as most read/write occurs at the OS level and thus depend on allowing OS's cache to work as usual.   The only exception could be to force WAL writes to bypass the OS cache.   
There is no easy rule exist to get proper sizing of this buffer.   Too large values will exacerbate some issues (ex. "Doubly cached Data", "checkpoint I/O spike" due to large quantity of dirty data regularly stored in cache, and having too small value mean we won't benefit from PS smart eviction buffer cache rule (see clock-sweep algorithm).   But the default value set by initdb is definitely too low as define to guarantee a proper PS start-up and must be changed.
We should start with a value of at least 25% of your on- board memory for server where OS overhead is small (typically the case for any modern system with 1Gb or more), to make sure we avoid "Doubly cached Data" data.
* updating this global setting requires a database restart 

Wal_Buffers : 
Another shared memory setting which controls how much memory used for buffering write-ahead log data.  Default suggests a value of 64 KB, but this is only sufficient as long as no single transaction exceeds that value, and in practice it is recommended to set higher values (not too high as this takes up shared memory).   With current modern system, a general guideline propose to set this to : Wal_buffers = 16 MB.

Maintenance_work_mem: 
This is the total memory allocated for housekeeping activities like vacuuming (getting rid of dead records), as well as a few operations requiring larger memory than normal sorting Work_mem (ex of operations: VACUUM, CREATE INDEX, ALTER TABLE ADD FOREIGN KEY) .   There shouldn’t be many concurrent operations like these, so this value can be set to 5% of total server RAM  (50 Mb of maintenance_work_mem per GB of server RAM). 

Default_statistics_target:
This controls how much statistics are gathered about each table (ANALYZE statement) so that the Query optimization produce better execution plan.  Increasing this value only makes sense if it improves some long queries since it also increases the background overhead of databse maintenance.  The value is now fixed to 100, before increasing it, it is worthwhile to fine-tune specific Table/Column where we know Query plan could be improved (ex. indexes using the LIKE operator) with ALTER TABLE SET STATISTICS statement.

Maximum_connections:
This must be estimated generously as it represent a hard-limit and new client connections will simply be refused once reached (the maximum value estimated after on new installation is = 100).   Each connection takes a small amount of shared memory, so the memory footprint related to this will be quite small in relation to the shared_buffers and work_mem values.  However, setting this to too large value will limit the value we can set for the per-client setting of work_mem value, so we must balance the two.   

Per-client Connection Setting

Work_mem :
This controls the maximum amount of memory allocated per client connection for operation such as sorting, hash join, and others.  Based on the estimate of memory footprint of these operations, PS will more frequently decide to operate on disk instead of memory ! To avoid this one should set large enough value, however be careful that this setting is defined per Sort and not at client connection.   If you have many users connecting, but fairly simple queries, you want this to be relatively low. If you do lots of intensive processing, like building a data warehouse with few users, you want this to be high. (see http://www.depesz.com/2011/07/03/understanding-postgresql-conf-work_mem/).  
This non-shared setting can always be adjusted by client after connecting, but a rule of thumb is to estimate the free RAM after shared_buffers allocation, divided by the max_connections and take a fraction of that figure (ex. 1/3, to remain moderate since it can always be increased on specific client connection where we know we'll need more memory):
work_mem = 1/3 * (free memory - shared buffers ) / max_connection  

Effective_cache_size:
This is an estimate of how much memory you expect to be available in the OS and PostgreSQL buffer caches. It has no affect on actual allocation, but is used only by the PostgreSQL query planner to figure out whether plans under consideration would fit in RAM or not. If it’s set too low, indexes may be underutilized. If you have a dedicated PostgreSQL server, then setting this to 50% of your on-board memory would be a good start, assuming the shared_buffers is set to less than 50% of total memory.  It must be set according to the shared_buffers global setting, as the two cannot exceed total physical memory of server (a too high value may result in both the database and OS cached being disrupted by resulting large number of blocks created to satisfy some query).  

File Systems:

Tuning:

* Read-ahead 
Linux may read ahead disk-block of various size.   Default disk blocks read-ahead is too low: 265K for most DB use-case.  Increasing this value will benefit all sequential read done by PS... so should be set to 4096 or more (4K):
# blockdev --getra /dev/sda
256
# blockdev --setra 4096 /dev/sda  
(to be set for each disk device, and added in the boot script: rc.local)

* File access time 
Disable "last access time" (atime) overhead by adding: noatime to the volume mount options in /etc/fstab:   
/dev/sda1 / ext3 noatime, errors=remount-ro 0 1

* Swapping
For better predictable DB perf, should not let OS swap inactive disk pages out when running low on memory (Linux has caching for its filesystem).  This is done permanently by adding:
vm.swappiness=0 to /etc/sysctl.conf
Forcing the OS to shrink filesystem cache rather than swap.  

* Memory overallocation
Disable Linux behavior to overcommit memory by process.  This is done by adding: 
vm.overcommit_memory=2 to /etc/sysctl.conf

* Write-back Cache:
Write-cache happens at these levels:  1) fileOS cache, 2) Disk controller (or RAID cards) and 3) Disk drive, and to ensure durability of Transactions, we must:
A. Force OS to use fsync call mechanisms to write data securely
B. Use a BBWC (battery-backed write cache) at controller and monitor its health state, so you can take advantage of fast write non-volatile cache!
C. Disable write cache set at the disk drive level (which is always volatile) 
Disabling write-cache at disk drive:
sudo hdparm -W 0 /dev/sda

* Write cache sizing:
We could change both dirty_background_ratio and dirty_ratio parameter (kernel param driving how aggressively cache is written out to disk), but it seems that it is more important when the FS is under ext3.   

Tablespace definition:
We should define different tablespace for these components:
    • 1- System catalog data  
    • 2- Application data
    • 3- Index data
For application table's data (stored within file (max 1Gig) in tablespace, i.e. directory), the goal being to separate Data from Index on different drive, as the two are frequently accessed concurrently.   Furthermore, very large table (many Gbs) could leverage Table partition which stored various chunks of table into separate Tablespace.   Use temp_tablespace for temporary storage are needed to some Query evaluation (eg. when sorting is required).  This tablespace should be designed with space efficiency rather than speed and subject to disk fragmentation. 

WAL (write-ahead logging):
We should add archiving to these WAL redo log files, unless we work in failover mode.

Another proposal to define finer grain Tablespace file systems (partition) for more flexibility: 
/pgarchiveDB Archive location containing the archive log files.
/pgbackupDB Backup location containing the physical and logical backups. For logical backups (pg_dump), use EXPORTS as a sub directory. For physical backups, use FULL{date}, convention for the sub directory. However, physical backups could be handled with file system snapshots. More on this later.
/pgclusterPostgreSQL Cluster data directory (PGDATA environment variable). This will contain all of the configuration files and directories of a PostgreSQL cluster. Note: the on-line WAL files would be located in /pgcluster/pg_xlog.
/pglogThe location of the server log files.
/pgdata-systemThe location of a database’s default tablespace. This is to be used at the creation of the database. The database catalog information will be stored.
/pgdata-tempThe location of the database’s default temporary tablespace. This is needed for temporary sort information. Note: The pgsql_tmp directory within the default tablespace will be used if a temporary tablespace is not defined.
/pgdata-app_tblspcApplication tablespaces. For every schema there should be a minimum of two tablespaces. One for tables and one for indexes.

Disk Minimal Recommendation 


There are no general one-size-fit-all solution as it comes with disk layout, as it will be very much application specific.  The general recommendation is that the more Disk we have the better we can spread/parallelize the workload into separate spindles, hence increasing throughput.  

The current hardware has only two disks (in RAID-1) and cannot be considered as a viable longterm solution.   So presuming, the PostgreSQL service will host many concurrent and highly demanding applications in future, here are the minimal disk requirements we should keep in mind for future improvement.

Disk Layout
1)  Have the OS on separate disk (ex. 2 disks on RAID-1)
2)  Have the WAL transaction-log on separate disk (optimally designed for sequential write)
3)  Have the temp tablespace (used for large user sorting) on a separate disk (optimally designed for random access, here SSD could be a viable solution)
4)  Have all database data on as many disks as possible with RAID 1+0
5)  Have all index-related data on separate disk with RAID 1+0

The principles behind these recommendations are:
  1. Avoiding putting the WAL onto the same drive as OS since it has a completely different access pattern (WAL are purely sequential write)
  2. If there will be large Sorting operation, the Temporary files (temporary_tablespace) should not be stored along the DB files (table + index)
  3. Application data is accessed concurrently with index-data, so separating the two will boost perf on all occasions.


Configuration Scenario
Now how do we configure all these drives is more challenging.   

For example, for a 14 Disk scenario, we could have the following configuration per disk:
1) Scenario 1:
Location
#Disks
RAID level
Purpose
/ (root)21OS
$PGDATA/data6+1+0Database data only
$PGDATA/index2+1+0
Database index only
$PGDATA/pg_xlog2
1

WAL
Tablespace-temp2NoneTemporary files

But, we could as well have the simple configuration:
2) Scenario 2
Location
#Disks
RAID level
Purpose
/ (root)141+0OS, DB, WAL

What scenario is going to perform better is impossible to say without benchmark on the target DB with real data access pattern.
It is only possible to predict that scenario 2 will perform better in workload with more Random access, while scenario 1 will perform better with more Sequential access
However the exact mix of sequential versus random seek access is simply not predictable, so optimization of disk layout in advance is equivalent to guessing!  

Martin

Sunday, June 01, 2014

Understanding Logistic Regression


This post describes Logistic regression (LR) from Machine Learning (ML) point of view.  This is self-contained and very little prior knowledge is required.

First, contrary to its name, LR does classification and not a regression.  It is used to predict the probability of a 2-class (or binary) outcome variable, based on a number of input features.   It can also be extended to cover the more general multi-classes problem.


The problem

In ML, we are interested in estimating (learning) an unknown target function t(x) generating an output y (i.e. y = t(x)), given a number of input variables (or features) represented by vector x of size=d.   The premise is that function t is normally quite complex and noisy so that there is no hope in finding its true analytical form. Instead we resort to finding a good function estimate (h(x), as it is often called hypothesis) that should perform well in both in-sample (training) and more importantly out-of-sample (prediction)!   

The target t(x) is not a valid function in the mathematical sense: there will likely be different outcome for the same input.  Consider any real-life binary outcome that we could be interested in predicting:  ’Approve credit’, ‘Team win’, ‘Approve access’,  ‘Failure within some period’,  etc..   Most outcomes in life are not deterministic and we all accept some level of uncertainty or randomness (stochastic process). This calls for a probabilistic approach giving us the framework for dealing with this.  

Our estimate function h should give us some confidence that the outcome will be positive or negative (y {0,1}), given a new input x and parametrized by its parameters (represented by vector θ).  This is represented by the probability equations below, obtained from the fact that two outcomes cover all possibilities:

→  p(y = 1 | x; θ) [0,1]  =  hθ(x)
→  p(y = 0 | x, θ) [0,1]  =  1 - hθ(x)                (eq. 1)

In ML, we’re aiming to find hθ taken from an Hypothesis set H defining function forms, ex: a set of simple linear, a set of more complex polynomial (2nd, 3rd, 4th degree each being a diferent set), a set complex non-linear form, or even non-tractable form (ex. neural network, support vector machine).

It would be nice if we could use a polynomial form such like hθ(x) = θt x, and leverage derivation done with regression problem.  This form implies adding an extra feature xi0 which equals to 1 for all i, so that we can treat the parameter θ0 as the others during optimization.   

But there’s no polynomial functions that produce an output y [0,1] from continuous input variables x, justifying the use of an alternative function representation.   This is where a nonlinear function like the Logistic function (sigmoid form) comes to the rescue.  whose output y [0,1] with any input [-∞,+∞].  


Sigmoid function:

If we compose a sigmoid function s(u), such that it takes as input the polynomial u = θt x, and so produces the needed output s(u) = 1 / (1 + e-u), then our hypothesis will output the needed sigmoid values [0,1] :

h(x) = s(θt x) = 11 + exp(-θt x)        (eq. 2)

compacting both y classes into a single expression, leads to a convenient way to express the probability function at point (x,y):

p(y | x ; θ) =  [11 + exp(-θt x) ] y  .  [1 - 11 + exp(-θt x)] (1-y)             (eq. 3)

This function returns the probability of each class (y=1,0) given the input vector x (new observation) and parametrized by the unknown parameters θ.



The interpretation

Using the sigmoid function, it turns out we can interpret its output as genuine probability.  The problem is equivalent to “placing” the sigmoid in such a way that it will best separate the input vectors points into two classes.   With 3-D geometry, we can imagine a line boundary whose distance with any input points (x1,x2) is calculated by θt x (dot product between point and the line equation) and their estimated probability of being class-1(y=1) is obtained with its vertical distance with the sigmoid (s(θt x)).  Such a line boundary is located on the projected line at the center of the sigmoid form (at u=0 and s(u) = ½ ) as illustrated in next figure (ref. http://www.machinelearning.ru/ ):ModelLogit.png
Figure 1. Sigmoid or Logistic function.  Boundary line can be imagined by projecting the p=½ probability line (yellow color)  into the 2-D plane (x1,x2).  

Furthermore, we can easily see how points located far from this boundary will have very large θt x values, resulting in prob ≈ 1 (and conversely:  θt x → ∞ h ≈ 1, θt x → -∞ h ≈ 0), and points near the boundary will have very small distance leading to no clear class (θt x → 0 h ≈ ½ ).   

The optimal boundary (hyperplane when d>3)  will be one where a maximum number of points are correctly classified AND points are on average, located the farthest from it.

But how do we find this location, in other words what are the best estimate for parameters θ?   In regression problems, the exact solution is found analytically by minimizing the SSE Error function (sum of squared distance error).  A unique solution must exist for θ since a quadratic form is guaranteed to have a unique global minimum (convex form).  It turns out the solution found called normal equation also correspond to the most likelihood estimate MLEθ when adopting a probabilistic approach.


However,  our issue here is that the sigmoid equation is clearly nonlinear and so will be its associated standard SSE, resulting to potentiel many local minima points typival of non-convex optimization problem.   We must abandon the hope to find a closed form solution, and rather use an iterative numerical solution.  

But first let’s discover a more suited function to optimize.  SSE assumes distance are euclidian-based, but the distance between yi and hi do not makes not much sense with an exponential function like sigmoid, clearly not linear.  

To find an alternative optimization function we turn into a probabilistic approach which will give us the best estimate θ or MLE (i.e. Maximum Likelihood estimate).    The ML idea poses this question: how likely are we to observe the dataset ⅅ given the function parameters θ, expressed as p(ⅅ | θ) or simply L(θ).     This reasoning is a bit counterintuitive, especially considering that our dataset is the only one we have and observable  (but this part of a broader debate between frequentist and bayesian and may involve overfitting issues with a limited number of observations).   

The ML’s goal is to maximize L(θ) as to find the best suited parameters θ.   To calculate L(θ), we  assume that each observation point is i.i.d. (independent and identically distributed) random variables x, which is a reasonable condition for most scenario.  Note that no restriction is imposed on probability distribution form of our variables x!   

Remember the probability of a single observation given in (3), then the Likelihood function of ⅅ  is simply the product of probability of all points (each point ), leading to this simplified equation:

L(θ)  = i=1,n  p(yi | xi ; θ)

Taking log likelihood allows to convert a product over all points to a summation and simplify the rest of the derivation:

l(θ) = log( L(θ) ) = i=1,n  log( p(yi | xi ; θ) )  

Now expressing with our function we wish to learn hθ(x):

l(θ) = i=1,n  yi log(hθ(xi) ) + (1-yi) log(1 - hθ(xi))                   (eq. 4)


Maximizing log-likelihood l(θ) or equivalently minimizing its negative form will allow us to find the ML estimate θ (MLE).  The negative form divided by number of observations (n) has similar role as SSE for regression, hence interpreted as the Error or Loss function Ɛ of logistic or sometimes called cross-entropy loss:

Ɛ(θ) = -1   n i=1,n  yi log(hθ(xi) ) + (1-yi) log(1 - hθ(xi))          (eq. 5)


Let’s continue by minimizing the loss Ɛ(θ)  in order to derive the algorithm used by iterative numerical methods like Gradient Descent (or ascent for finding maximum).

Taking partial derivative of (4) leads to,

l(θ)o = l(θ)o  1   ni=1,n -yi  log(11 + exp(-θt xi) )  - (1-yi) log(1 - 11 + exp(-θt xi))  

l(θ)1 = l(θ)1  1   n i=1,n -yi  log(11 + exp(-θt xi) )  - (1-yi) log(1 - 11 + exp(-θt xi))  

l(θ)2 =  …   for and all equivalent partial derivatives for  θ23, .., θd (d+1 equations).

Now to simplify things a bit, we’ll use a single observation (x,y) to ignore the summation because differentiate summation is equivalent to summing a differential.  This idea is actually useful when we want to improve Gradient descent performance (ref Stochastic gradient descent below), but for now let’s use this just to simplify the derivation.  Re-using the composed function s(u) where u = θt x so to exploit the chained rule in the following way:

l(θ)o= ls{-y log[s(u)]}su{s(u)}uθo(θt x) +ls{-(1-y) log(1-s(u))}su{1- s(u)}uθo(θt x)
 …

Recalling the derivative of a sigmoid d(s(u))du= s (1 - s), and obviously uθo(θt x) = x0

l(θ)o =  -ys(u)×{s(u)×(1- s(u))} × x0 + (y-1)1 - s(u)×{-s(u)×(1- s(u))} × x0
...

Eliminating some terms, lead simply to,

l(θ)o =  {-y × (1- s(u)) + (y-1)× s(u) } × x0
...

Let’s replace s(u) by our hypothesis hθ(x) to have the expression given by x, and put back the summation.   After some elimination, we easily show that the partial derivatives with respect to θi turn out to be in very simple:

l(θ)o =  i=1,n  { hθ(xi) - yi } × xi0    

l(θ)1 =  i=1,n  { hθ(xi) - yi } × xi1                              (eqs. 6)
 …

where xi0 is the i-th observation of the variable x0  same for xi1 , xi2 , .. xid (feature input space in d).

Setting these equation to 0, and solving for parameters θ would give us the best estimate parameters  MLEθ.  However as said before there is no closed form for these non-linear form hθ(x).    

It is useful to look at equation involving xi0 = 1 for all n points and its solution is trivial.   We see that the sum ∑i=1,n hθ(x) is equal to the sum of  ∑i=1,n yi all class for which yi=1, in otherword the expected value of our predictor E[hθ(x)] equal to 1/n ∑i=1,n yi.  This intuitive result confirms the unbiased likelihood estimate given by the method.

Now let’s define the most simple Gradient Descent (called batch) used to resolve for all d+1 parameters θ, in a simple pseudo-code:

Set default initial parameters θt = {0,0,....}

Iterate OR until convergence
    # move each param with its gradient d/d𝜭i descending at a certain learning rate (-𝛂)   
    θ0(cur) = θ0(prev) - 𝞪 i=1,n  {yi - hθ(xi)} × xi0
    θ1(cur) = θ1(prev) - 𝞪 i=1,n  {yi - hθ(xi)} × xi1
   …. for all θi(cur)

By starting at an initial params value (any point) and keep iterating toward the direction of largest slope we continue descending by re-assigning θi values to new location.    

We refer to it as batch because at each iteration we go over the full summation over all points.  It is also advisable to calculate the loss value to appreciate the speed of convergence and adjust the learning rate parameter 𝛂 accordingly.  This rate should neither be too small (very low convergence), nor too big (may not decrease on every step or worst may no convergence as it will overshoot).  A simple heuristic, is to use small value (eg. 0.001) and increase by a factor of 3 each time (i.e. log-scale as we multiply by 3 from previous values until we converge in an optimal way:   0.01, 0.03, 0.1, 0.3, 1, etc..).  


Algorithm optimization

Similar result could be obtained by applying Gradient Descent, on the set of Gradient equations derived from the error function (equation 5) instead of the MLE.   It it worthwhile to note that same equations would be found for Gradient of regression problem, only the form or the hypothesis would differ.   In fact, this is true for every exponential and is proved by the Generalized Linear Model theory (GLM).   

Batch algorithm is the brute force, and there are alternatives:

But these are just optimization tricks (part of mathematics called numerical methods) and do not change the nature of the problem.

Martin