Showing posts with label Analytics. Show all posts
Showing posts with label Analytics. Show all posts

Friday, July 01, 2016

Redshift Cloud-based data warehousing solution


Architecture


Redshift is implemented using the ParAccel Analytical Database or PADB (ref). From ParAccel definition:

PADB is architected as a columnar, compressed and massively parallel relational database management system that is capable of all-in-memory, if desired.

The main component is the Leader Node (interface to the outside world) whose responsibilities are to manage communication between nodes, to control session, to parse and optimize queries and to schedule execution of workload (excluding participation to any data operations).

The second component are the Compute Nodes responsible in processing, storing data and sharing workload evenly among themselves. These nodes receive query compiled code from Leader node (based on the execution plan) and send back result after execution is finished.

Each node work with its own disks against its own memory and CPU, but may depend on each other when intermediate result sets are transferred during join/aggregations operation. To increase parallelization (through multi-core capability) these nodes are furthermore partitioned into slice, with each having a dedicated portion of memory and disk space, where queries and database processing are finally executed.

The other component is the Communication Fabric using custom Gigabit Ethernet and allowing for communication between Compute and Leader to be high-bandwidth, private, close proximity and very high speed. The Compute Nodes are located on an isolated network not accessible to client applications.

SAN backend is not mandatory but useful for backup, snapshot and restore when in High Availability (HA) mode. In fact, Redshift is actually optimized to work with fast direct-attached storage (DAS) as typical or Cloud-based cluster based on commodity hardware.

Source: Amazon Redshift Database Developer Guide

Key features


Redshift main selling points are

  1. Shared-nothing MPP high performance DB not requiring to build redundant data access structure. No need to create and maintain redundant structures like indexes, materialized view, physical aggregations and summary tables, ..;
  2. Suited to a broad range of schema model. No need to denormalize or build dimensional data model for improved performance;
  3. Columnar storage offering high compression ratio and efficient utilization of I/O. Only data block needed are read, as opposed to row-oriented storage DB (BI application's queries usually involve a fraction of available attributes and thus penalized row-oriented DB where all attributes are read for each block I/O).
  4. Leverage low-cost commodity computer with DAS storage (either mechanical HDD or onboard flash SSD) common to all Cloud-based environments

Schema model


The Redshift/PADB is supposedly schema agnostic, in the sense that you don't need to favor one modeling technique over another. Let me comment on that a bit.

Data warehouse implementation using RDBMS uses a variety of techniques built solely to improve performance. We add physical-based structure such as indexes and Materialized Views, also schema-based structure like denormalized Dimensional models, and potential external one using complete different physical schema (like MOLAP). All these add up both in complexity and storage requirement. Redshift moves away from this trend by only storing the needed data (typically modeled in 3rd NF) without redundancy.

The most performance gain from Redshift is coming mainly from 1) columnar orientation and 2) MPP features. First point discourages the use of highly normalized models. These models tend to have narrow tables (with few attributes) and consequently will not gain much I/O performance (for each block read) compared to row-oriented databases. Furthermore the MPP features induces locality sensitivity during join and aggregation operations. Joining rows locally is best, but performance will degrade quickly once these rows get distributed across different nodes. This again discourages highly normalized model, especially for relationships/association modeled as m-to-n structures where rows are distributed using only one foreign key.

Personal recommendation: I would favor some level of de-normalization of the data model (especially if your model was designed for integration purposes) but without going to the extreme of defining only a few wide denormalized dimensions.

Redshift Construct


There are numerous constructs in Redshift not commonly found in typical relational DB. These need to be considered during physical data modeling and are determinant for performance aspects.

Data Distribution style

As with any cluster-based MPP database, Redshift scalability is obtained by distributing data among cluster nodes. The goal is to distribute workload uniformly among all nodes and to have data collocated (data from different tables participating in joins and aggregate stored on same computer node).

Rows can be distributed in different ways:

1. EVEN Distribution: rows are distributed evenly in a round-robin fashion.
2. KEY Distribution: rows are distributed according to a specified key (known as sharding)
3. ALL Distribution: rows are simply replicated on all nodes

During query execution, the goal is to minimize the (costly) redistribution of rows needed to perform joins and aggregation. So KEY distribution for two larges tables frequently joined together should be defined with their FK/PK pairs. Smaller and rarely updated table will benefit greatly from ALL distribution without much impact on storage space. Volatility is critical as updating tables to all computer nodes can be costly. EVEN distribution is the default style used when no clear a-priori information can dictate which distribution to use.

Column Compression

The columnar storage format implies that data belonging to same column are stored sequentially on disk. This offers good compression potential (column-data has same format and often similar in content) and consequently, better read performance as more data (RS store data in 1Mb disk block) is read per I/O.

The appropriate compression encoding must be chosen for each data column. Besides the basic raw encoding that simply stores data in raw format (no compression), one can choose between:
  1. Byte-dictionary encoding (BYTEDICT)
    • Use a BYTE dictionary where each value is mapped to (up to 256) unique value
    • Applicable to all types except BOOLEAN
    • Ideal for long characters type having less than 256 values (per block)
  2. Delta encoding (DELTA, DELTA32)
    • Store only the delta difference from preceding row
    • Applicable to types with natural ordering (INT, DATE, TIMESTAMP,..)
    • Ideal for datetime or sequential number columns as delta between rows can be represented more compactly
  3. LZO encoding (LZO)
    • Compress data using LZO scheme frequently used with text
    • Applicable to all character types (CHAR, VARCHAR,..)
    • Ideal for long free form text
  4. Mostly encoding (MOSTLY8, MOSTLY16, MOSTLY32)
    • Compress "small size" values while leaving larger one in raw format
    • Applicable to number types (INT, BIGINT, DECIMAL,..)
    • Ideal when data type used is larger than most values stored (only a small fraction needs the extra space)
  5. Run-length encoding (RUNLENGTH)
    • Store values as token plus a count giving the number of consecutive repetition
    • Applicable to all types
    • Ideal when data is often repeated in sequential rows
  6. Text encoding (TEXT255, TEXT32K)
    • Store a separate dictionary of unique words (up to 245 unique values) and store other words uncompressed
    • Applicable to VARCHAR types
    • Ideal when data contains the same words frequently

Sort key

Redshift can define one or more columns per table as sort keys. Rows will be sorted during initial load accordingly. This is especially useful during read where data block can be skipped entirely to minimize I/O, typical with range-restricted predicate (where population between 1000 and 2000). This is possible for every column as Redshift keep track of minimum/maximum values for each data block.

Sort keys can be either 1) Compound or 2) Interleaved. The first is useful when we use the same prefix keys as predicate (or a subset of), i.e. if index: K1, K2, K3 then where K1 = .. and K2 = .. is ok, but not where K2 = ... The second one is used when any combination of keys are found as predicate, without worrying about which one is used.

You define sort keys based on the query that frequently same columns as equality/range predicate. For example, if recent rows are more frequently accessed, then you'd sort on the Timestamp column. If some tables are very frequently joined together, you can even define their Sort key to be also their Distribution key and let the query optimizer use sort merge join (very fast as sorting is skipped) instead of slower hash join.

Best Practices

From the Redshift developer guide, there are a few important practices to keep in mind:
  • Make sure you define a Distribution style for each table, unless the default one EVEN is satisfactory
  • Make sure you define a good Sort key
  • Let COPY command decide on the best Compression encodings
  • Define data relationship integrity constraint (will not be enforced but still help query optimizer)
  • Use smallest format size for data type (temporary table created during long queries uses uncompress data with format defined in DDL)

Martin

Note: I’m now using Markdown for all my notes, so I can convert any part of them into html using a conversion tool such as text-to-HTML tool. Writing blog post this way is such a big time saver!

Friday, June 05, 2015

Spark data processing/analytics platform


At work we are looking to leverage a large scale data processing/analytics platform called Spark.   Before doing some hands-on work, I will always do some research to help me get started and have better insights and context on a large scale platform.   So this post summarises these notes on one of the hottest techno on Big-Data which officially past the-peak-of-inflated-expectation!  

Spark is a Data analytic/processing Platform optimised for iterative algorithm and interactive data mining and batched analytics.    

Spark provides a unified platform leveraging MapReduce (MR) parallel data processing capabilities while avoiding some limitations associated to the ancestor open source platform Hadoop.   Spark came out of functional programming model like MR, however it is also a broader generalisation of that model as more than just MR-type tasks are supported.

Underneath Spark's platform architecture lies what is called Spark execution engine.  On top of that you have extra modules like Shark for SQL capability, Spark streaming, MLib for machine learning, GraphX for Graph-based computation, and language-specific libraries (3 supported: scala, java, and python).   See here for an example of a platform integration stack of one vendor.

Also a key component is its Data storage abtraction model called Resilient Distributed Data (RDD) which was designed for coarsed-grained data transformations operating over the majority of data elements.  Consequently it is not suited to operate asynchronously over fine-grained share data, typical of RDBMS.  



Core Primitives

RDD: 

RDD can be qualified as the cornerstone of Spark data management.   In essence, these are data abstraction for immutable and fault-tolerant collections of elements built explicitly for parallelism done on a cluster.   Immutability provides support to have a more efficient fault-tolerant system critical in distributed environment.

Data is processed by chaining succession of RDD's transformation each performing bulk writes on a new RDD, each of which can be easily recovered by reapplying the same transformation definition in case of node failure or straggle (slow node).   These lead to an important property called lineage obtained directly from the resulting directed acyclic graph (DAG) of all transformation definition.  

From this DAG lineage property, it follows that any RDD could be easily re-created after a node failure by regenerating source RDDs.

Simple DAG showing RDD lineage


RDD persistence is controllable, so it is up to the user to decide which steps should be cached when we may need future re-use and/or better performance.

Cached RDD are stored as Java objects for fast access within JVM, but can also be stored on disk in serialised form (could have a smaller footprint than the Java object) in case where memory in nodes becomes scarce.

Key-Value RDD or simply pair RDD, are specialized RDD to support KV pairs, where each element of the RDD is a pair tuple (k,v), hence referred to as "pair RDD".


RDD Persistence

Every RDD can further be persisted in different way, offering a sort of data hook for re-use in other context.   These are very convenient when doing ad-hoc data analysis or machine learning where same state of a dataset can be cached and re-used in various ways.   This contrasts with the more batch-oriented synchronous operations typical of Hadoop.

Storage is characterised by:
  1. The Storage levels (memory-only, memory and disk)
  2. Explicit API calls (persist(), cache())
When deciding to persist or cache, one must consider the gain obtained compared to a full re-computation and the total memory size needed vs available in nodes.


So in a nutshell, RDD has these properties :

  • immutable once constructed
  • track lineage information to be recomputed in case of lost data or node failure
  • enable operations on collection of elements in parallel

and can be instantiated:

  • by paralellizing existing collection (ex. after generating a list in Python, Spark will split it into many partitions and distribute these among the nodes) 
  • by transforming any existing RDDs (sourcing)
  • from input files (in various storage system and format) 


and also user controls behaviour through two properties:

  1. its distribution property (number of nodes to be partitioned into) 
  2. its persistence (whether or not data should be cache for faster downstream consumption). 



Operation:  


The operations defined on RDD are either Transformation or Action.

Transformation consist of some operations (map, filter, distinct, union, grouping) happening on a given RDD and leading up to the generation of a new RDD node (from-to node).   All transformations are lazy operations that get executed only when necessary (i.e. when an action is triggered or a reduce operation is necessary).

Action consist of applying operations over all elements of one RDD (associative function like reduce or others like count, first, take sample, forEeach, ...) and producing some value output.

conceptual role of Transformations and Actions 




Data source 

Spark can interact with various input format frequently seen in high-end analytics platform:
  1. Local files (file://...), directory or SequenceFile
  2. Hadoop Distributed FileSystem (hdfs://...)
  3. HBase 
  4. S3

Closure:

These are the function literal defined in Transformation or Action.  These closures are represented as Java object which are pushed to all Workers by Spark (after serialization/deserialization).  There will be one Closure per Worker and these are sent one way: driver --> worker.

Same method is used to push any global variable as well, although this is not ideal to share information/state among Workers and back to driver.  To share state, there are better approach like special variables called Accumulators and BroadcastVariable.

Shared Variables:

Another abstraction is the concept of Shared Variable that are used for parallel operations.  These are used to share state information (through variable) between nodes themselves and/or between nodes and the driver.

Broadcast variable concept
These variables are designed to help optimise sharing global resource throughout all nodes.   They are read-only variables cached at each node avoiding the need to ship copies during tasks.   Spark will efficiently distribute these as needed with the objective to reduce global communication cost.

This can be used, for example, when we want to score a predictive model applied in parallel to a very large distributed dataset.  This model should then be broadcast to all involved nodes prior to scoring.

Accumulators
These can only be added through an associative operations (to define ..).   They are only readable by the driver program although workers are changing it.  This is supported out-of-the-box for either numeric type or standard mutable collections.   To support other types one need to develop an extension.

These are useful when we need global counters and sums done in parallel.


Spark Program lifecycle:


Spark's code normally follows a series of steps: the data source definition (input RDD), some transformations applied on each RDD's and finally the Actions used to retrieve the needed information:

  1. Get a SparkContext which is the entry point to all functionalities (this is implicit in interactive Shell env with the "sc" variable)
  2. Creation of source RDDs 
    • generating a dataset manually (sc.parallelize([collections...]))
    • sourcing from file (file://..), or HDFS (hdfs://...)
    • sourcing from Hadoop Input (sc.hadoopFile()) 
  3.  Creation of a series of Transformation happening sequentially from source
    • map(), flatMap(), filter(), reduceByKey(), groupByKey(), join()
  4.  Creation of Actions to return some values to the Driver
    • collect(), take(), count(), reduce(), saveAsText()

All these steps are lazily defined and only Action will trigger any processing.   Spark will go over these steps for each new job definition :
  1. Creation of the DAG with all RDD's lineage definition 
    • a direct mapping exist between the DAG and the user code
  2. Creation of a logical Execution plan optimised for the DAG 
    • Spark will aggressively pipeline operations, when these can be generated independently of other steps (ex. a read and map task can be fused together) 
    • Spark will also split tasks into stages to better exploit pipelining and caching
  3. Schedule and Execution of individual tasks according to this logical Execution plan (see next, runtime engine model).

Runtime Engine model

Spark applications involve the execution of independent sets of processes over a cluster.  This requires quite a bit of coordination which is handled by what s called the SparkContext (SC) object.  This object is running inside the Driver program, the main entry point where any Spark application is launched.  The SC can connect to many types of cluster manager (ex. Spark own standalone, Meso, YARN, ..) which handle resource allocation over the cluster.   

After connecting, Spark will acquire Executor on each Working node.   These are run in a single JVM instance located on cluster node and spawned to serve Spark application.  Executor's role is to run computation and to store data done via Spark Tasks  (smallest unit of work) executing in different thread inside the JVM.   Executor remain active as long as the Spark application is running, even when no jobs are running.   This feature allows Spark to start up Task very quickly and process in-memory data at speed not possible with Hadoop.    

Once Executor are spawned inside each Working node, Spark will send the application code to these executors.  This application code can be JAR files or yet python files depending on how the Application was coded inside the SC.  Then finally SC can send tasks for each executors to run. 
Source: "http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf"

Important properties of this model : No communication ever happen betweens Workers!


Task scheduling

Job scheduler's role is to run user-defined job as a sequence of stages that takes into account partition having persistent RDD in memory, possibility to pipeline transformations (individual task) with narrow dependency, separating transformation with wide dependencies (require shuffling).   Scheduler also consider data locality and preferred location.   If a task needs to process a partition available on a specific node, the scheduler will send it to that node, or if a task processes a partition for which the containing RDD provides preferred location (typical of HDFS file), again the scheduler will send it to those.

For example, this "simple" job definition:

rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)

Will yield the following DAG flow:
Source: databricks (https://databricks.com/blog/2014/03/20/apache-spark-a-delight-for-developers.html)

After examination of this DAG, the Scheduler defines stages to execute in a way to combine or pipeline all narrow transformations.  Here, RDD A-B would be stage-1 and RDD C-D-E stage-2 and the final join would be stage-3.  


In summary, Scheduling has these roles and responsibilities:
  • running all task in graph (the "full" DAG in proper sequence)
  • pipelining functions when possible (ex.  a map() and filter() transformations could be pipelined together and done in one stage, as a way to avoid processing later discarded data elements)
  • reusing cache-aware data and exploiting data locality
  • being partition-aware to reduce shuffle


Main Differentiation points:


The main selling points are actually Hadoop's biggest drawbacks that lead to the creation of many extension and add-on built around Hadoop ecosystem.   Hadoop biggest complain relate to its inefficiency toward iterative computing and interactive querying (due to low latency dispatcher), its Disk-only based approach, and its complex and verbose coding (complex algo only possible through chaining many individual MR jobs).  

By contrast, Spark offers:
  • Easy to develop environment
    • A rich API using Scala, Python or Java 
    • Leverage a real functional language namely Scala which makes it more aligned to manipulate MR (MapReduce) programming model (they figure it would minimise impedance with this programming model not directly accessible with pre-1.8 Java).
    • Scala and Python being interpreted languages, this allows Spark to have its own interactive shell which is so useful in interactive analysis (pyspark and spark_shell). I'd say this time-saver has become the norm now. 
  •  Fast to run platform
    • The general execution graph (DAG) concept is key in providing speed and lazy transformation
    • In-memory (cache) and persistent storage is really allowing the platform to move away from batch-oriented paradigm to more realtime and iterative workflow.

Versatile deployment configuration with many possible installation options.  For example, the Cluster Manager type can be deployed on these scheme :
  1. Standalone local (to run simple demo/prototyping)
  2. Standalone Cluster-based  (ex. running on same nodes as your Hadoop HDFS-based cluster) 
  3. Yarn-based 
  4. Mesos-based  


Martin

Sunday, June 01, 2014

Understanding Logistic Regression


This post describes Logistic regression (LR) from Machine Learning (ML) point of view.  This is self-contained and very little prior knowledge is required.

First, contrary to its name, LR does classification and not a regression.  It is used to predict the probability of a 2-class (or binary) outcome variable, based on a number of input features.   It can also be extended to cover the more general multi-classes problem.


The problem

In ML, we are interested in estimating (learning) an unknown target function t(x) generating an output y (i.e. y = t(x)), given a number of input variables (or features) represented by vector x of size=d.   The premise is that function t is normally quite complex and noisy so that there is no hope in finding its true analytical form. Instead we resort to finding a good function estimate (h(x), as it is often called hypothesis) that should perform well in both in-sample (training) and more importantly out-of-sample (prediction)!   

The target t(x) is not a valid function in the mathematical sense: there will likely be different outcome for the same input.  Consider any real-life binary outcome that we could be interested in predicting:  ’Approve credit’, ‘Team win’, ‘Approve access’,  ‘Failure within some period’,  etc..   Most outcomes in life are not deterministic and we all accept some level of uncertainty or randomness (stochastic process). This calls for a probabilistic approach giving us the framework for dealing with this.  

Our estimate function h should give us some confidence that the outcome will be positive or negative (y {0,1}), given a new input x and parametrized by its parameters (represented by vector θ).  This is represented by the probability equations below, obtained from the fact that two outcomes cover all possibilities:

→  p(y = 1 | x; θ) [0,1]  =  hθ(x)
→  p(y = 0 | x, θ) [0,1]  =  1 - hθ(x)                (eq. 1)

In ML, we’re aiming to find hθ taken from an Hypothesis set H defining function forms, ex: a set of simple linear, a set of more complex polynomial (2nd, 3rd, 4th degree each being a diferent set), a set complex non-linear form, or even non-tractable form (ex. neural network, support vector machine).

It would be nice if we could use a polynomial form such like hθ(x) = θt x, and leverage derivation done with regression problem.  This form implies adding an extra feature xi0 which equals to 1 for all i, so that we can treat the parameter θ0 as the others during optimization.   

But there’s no polynomial functions that produce an output y [0,1] from continuous input variables x, justifying the use of an alternative function representation.   This is where a nonlinear function like the Logistic function (sigmoid form) comes to the rescue.  whose output y [0,1] with any input [-∞,+∞].  


Sigmoid function:

If we compose a sigmoid function s(u), such that it takes as input the polynomial u = θt x, and so produces the needed output s(u) = 1 / (1 + e-u), then our hypothesis will output the needed sigmoid values [0,1] :

h(x) = s(θt x) = 11 + exp(-θt x)        (eq. 2)

compacting both y classes into a single expression, leads to a convenient way to express the probability function at point (x,y):

p(y | x ; θ) =  [11 + exp(-θt x) ] y  .  [1 - 11 + exp(-θt x)] (1-y)             (eq. 3)

This function returns the probability of each class (y=1,0) given the input vector x (new observation) and parametrized by the unknown parameters θ.



The interpretation

Using the sigmoid function, it turns out we can interpret its output as genuine probability.  The problem is equivalent to “placing” the sigmoid in such a way that it will best separate the input vectors points into two classes.   With 3-D geometry, we can imagine a line boundary whose distance with any input points (x1,x2) is calculated by θt x (dot product between point and the line equation) and their estimated probability of being class-1(y=1) is obtained with its vertical distance with the sigmoid (s(θt x)).  Such a line boundary is located on the projected line at the center of the sigmoid form (at u=0 and s(u) = ½ ) as illustrated in next figure (ref. http://www.machinelearning.ru/ ):ModelLogit.png
Figure 1. Sigmoid or Logistic function.  Boundary line can be imagined by projecting the p=½ probability line (yellow color)  into the 2-D plane (x1,x2).  

Furthermore, we can easily see how points located far from this boundary will have very large θt x values, resulting in prob ≈ 1 (and conversely:  θt x → ∞ h ≈ 1, θt x → -∞ h ≈ 0), and points near the boundary will have very small distance leading to no clear class (θt x → 0 h ≈ ½ ).   

The optimal boundary (hyperplane when d>3)  will be one where a maximum number of points are correctly classified AND points are on average, located the farthest from it.

But how do we find this location, in other words what are the best estimate for parameters θ?   In regression problems, the exact solution is found analytically by minimizing the SSE Error function (sum of squared distance error).  A unique solution must exist for θ since a quadratic form is guaranteed to have a unique global minimum (convex form).  It turns out the solution found called normal equation also correspond to the most likelihood estimate MLEθ when adopting a probabilistic approach.


However,  our issue here is that the sigmoid equation is clearly nonlinear and so will be its associated standard SSE, resulting to potentiel many local minima points typival of non-convex optimization problem.   We must abandon the hope to find a closed form solution, and rather use an iterative numerical solution.  

But first let’s discover a more suited function to optimize.  SSE assumes distance are euclidian-based, but the distance between yi and hi do not makes not much sense with an exponential function like sigmoid, clearly not linear.  

To find an alternative optimization function we turn into a probabilistic approach which will give us the best estimate θ or MLE (i.e. Maximum Likelihood estimate).    The ML idea poses this question: how likely are we to observe the dataset ⅅ given the function parameters θ, expressed as p(ⅅ | θ) or simply L(θ).     This reasoning is a bit counterintuitive, especially considering that our dataset is the only one we have and observable  (but this part of a broader debate between frequentist and bayesian and may involve overfitting issues with a limited number of observations).   

The ML’s goal is to maximize L(θ) as to find the best suited parameters θ.   To calculate L(θ), we  assume that each observation point is i.i.d. (independent and identically distributed) random variables x, which is a reasonable condition for most scenario.  Note that no restriction is imposed on probability distribution form of our variables x!   

Remember the probability of a single observation given in (3), then the Likelihood function of ⅅ  is simply the product of probability of all points (each point ), leading to this simplified equation:

L(θ)  = i=1,n  p(yi | xi ; θ)

Taking log likelihood allows to convert a product over all points to a summation and simplify the rest of the derivation:

l(θ) = log( L(θ) ) = i=1,n  log( p(yi | xi ; θ) )  

Now expressing with our function we wish to learn hθ(x):

l(θ) = i=1,n  yi log(hθ(xi) ) + (1-yi) log(1 - hθ(xi))                   (eq. 4)


Maximizing log-likelihood l(θ) or equivalently minimizing its negative form will allow us to find the ML estimate θ (MLE).  The negative form divided by number of observations (n) has similar role as SSE for regression, hence interpreted as the Error or Loss function Ɛ of logistic or sometimes called cross-entropy loss:

Ɛ(θ) = -1   n i=1,n  yi log(hθ(xi) ) + (1-yi) log(1 - hθ(xi))          (eq. 5)


Let’s continue by minimizing the loss Ɛ(θ)  in order to derive the algorithm used by iterative numerical methods like Gradient Descent (or ascent for finding maximum).

Taking partial derivative of (4) leads to,

l(θ)o = l(θ)o  1   ni=1,n -yi  log(11 + exp(-θt xi) )  - (1-yi) log(1 - 11 + exp(-θt xi))  

l(θ)1 = l(θ)1  1   n i=1,n -yi  log(11 + exp(-θt xi) )  - (1-yi) log(1 - 11 + exp(-θt xi))  

l(θ)2 =  …   for and all equivalent partial derivatives for  θ23, .., θd (d+1 equations).

Now to simplify things a bit, we’ll use a single observation (x,y) to ignore the summation because differentiate summation is equivalent to summing a differential.  This idea is actually useful when we want to improve Gradient descent performance (ref Stochastic gradient descent below), but for now let’s use this just to simplify the derivation.  Re-using the composed function s(u) where u = θt x so to exploit the chained rule in the following way:

l(θ)o= ls{-y log[s(u)]}su{s(u)}uθo(θt x) +ls{-(1-y) log(1-s(u))}su{1- s(u)}uθo(θt x)
 …

Recalling the derivative of a sigmoid d(s(u))du= s (1 - s), and obviously uθo(θt x) = x0

l(θ)o =  -ys(u)×{s(u)×(1- s(u))} × x0 + (y-1)1 - s(u)×{-s(u)×(1- s(u))} × x0
...

Eliminating some terms, lead simply to,

l(θ)o =  {-y × (1- s(u)) + (y-1)× s(u) } × x0
...

Let’s replace s(u) by our hypothesis hθ(x) to have the expression given by x, and put back the summation.   After some elimination, we easily show that the partial derivatives with respect to θi turn out to be in very simple:

l(θ)o =  i=1,n  { hθ(xi) - yi } × xi0    

l(θ)1 =  i=1,n  { hθ(xi) - yi } × xi1                              (eqs. 6)
 …

where xi0 is the i-th observation of the variable x0  same for xi1 , xi2 , .. xid (feature input space in d).

Setting these equation to 0, and solving for parameters θ would give us the best estimate parameters  MLEθ.  However as said before there is no closed form for these non-linear form hθ(x).    

It is useful to look at equation involving xi0 = 1 for all n points and its solution is trivial.   We see that the sum ∑i=1,n hθ(x) is equal to the sum of  ∑i=1,n yi all class for which yi=1, in otherword the expected value of our predictor E[hθ(x)] equal to 1/n ∑i=1,n yi.  This intuitive result confirms the unbiased likelihood estimate given by the method.

Now let’s define the most simple Gradient Descent (called batch) used to resolve for all d+1 parameters θ, in a simple pseudo-code:

Set default initial parameters θt = {0,0,....}

Iterate OR until convergence
    # move each param with its gradient d/d𝜭i descending at a certain learning rate (-𝛂)   
    θ0(cur) = θ0(prev) - 𝞪 i=1,n  {yi - hθ(xi)} × xi0
    θ1(cur) = θ1(prev) - 𝞪 i=1,n  {yi - hθ(xi)} × xi1
   …. for all θi(cur)

By starting at an initial params value (any point) and keep iterating toward the direction of largest slope we continue descending by re-assigning θi values to new location.    

We refer to it as batch because at each iteration we go over the full summation over all points.  It is also advisable to calculate the loss value to appreciate the speed of convergence and adjust the learning rate parameter 𝛂 accordingly.  This rate should neither be too small (very low convergence), nor too big (may not decrease on every step or worst may no convergence as it will overshoot).  A simple heuristic, is to use small value (eg. 0.001) and increase by a factor of 3 each time (i.e. log-scale as we multiply by 3 from previous values until we converge in an optimal way:   0.01, 0.03, 0.1, 0.3, 1, etc..).  


Algorithm optimization

Similar result could be obtained by applying Gradient Descent, on the set of Gradient equations derived from the error function (equation 5) instead of the MLE.   It it worthwhile to note that same equations would be found for Gradient of regression problem, only the form or the hypothesis would differ.   In fact, this is true for every exponential and is proved by the Generalized Linear Model theory (GLM).   

Batch algorithm is the brute force, and there are alternatives:

But these are just optimization tricks (part of mathematics called numerical methods) and do not change the nature of the problem.

Martin