First steps: Linear Classification¶
Two traditional machine learning models for linear classifications are:
- The Perceptron and
- The Adaptive Linear Neuron (Adaline).
The ideas for these two models go back to:
“A logical calculus of the ideas immanent in nervous activity.”
—McCulloh & Pitts [9], 1943
“The perception, a perceiving and recognizing automaton.”
—Rosenblatt [15], 1957
“A adaptive ‘Adaline’ neuron using chemical memistors.”
—Widrow [22], 1957
These models provide a good entry point also for modern machine learning algorithms as:
- They are very simple;
- but readily generalize to the concept of deep networks.
This section is in parts inspired by the introductory chapters of the book Python Machine Learning by Sebastian Raschka [13].
Binary Classification Problems¶
As an example of a typical of a binary classification problem let us consider:
- A sequence of
- data points
;
- each having
characteristic features
;
- and the task to assign to each element
a label
;
- thereby dividing the data points into two classes labeled
and
.

Fig. 2 Labelled dimensional example data points
(
) describing the sepal length and sepal
width, i.e.,
and
), respectively, of
species of the Iris flower. The class label names ‘setosa’ and ‘other’,
i.e.,
and
, respectively, are encoded
in the colors red and blue.
The goal of the classification problem is, given some pre-labeled training data:
to make the machine find a function
that:
- predicts accurately the labels of pre-labeled training data
, i.e., for most indices
it should hold
;
- and generalizes well the remaining data points
for
or even completely unknown data.
A general approach to this problem is to specify a space of candidates for
, the hypotheses set. Then the art of the game is to find sensible
and mathematical precise objects encoding the vague expressions ‘accurately’,
‘most’, and ‘generalizes’ and to find, in that sense, an optimal functions
.
- Typically one tries to find an adequate coordinization of the hypotheses set,
so that the search for an ‘optimal’
can be recast into a search for finitely many ‘optimal’ coordinates – one often refers to the choice of coordinization and potential functions
as the ‘model’ and to the particular coordinate as the ‘parameters of the model’;
- In which sense parameters are better or worse than others is usually encoded by a non-negative function on the set of possible parameters and the entire training data set, often called ‘loss’, ‘regret’, ‘energy’ or ‘error’ function;
- The search for optimal parameters is then recast into a search of minima of this loss function.
Definition (Classification Problem) For , a set
of
labels
, and a sequence
,
, one calls the problem of finding
a function
such that
for all
an
-dimensional
classification problem with
classes.
- The set
, is called training data or prelabeled data.
- In case,
one referes to it as binary classiciation problem.
- Furthermore, the problem is called a linear classification problem if the
given data points
can be separated according to their respective labels
by means of hyperplanes. If this is not the case, one refers to the problem as non-linear.
The following plot shows the data points of the iris data set shown above with a possible hyperplane as decision boundary between the two different classes.

Fig. 3 Decission boundaries for a possible classification function . The
dots denote unknown data points, e.g.,
for
, and the crosses denote pre-labeled data points,
for
, which were used to train the model in order to
find an optimal
.
Note that in Fig. 3, although the classification of the pre-labeled data points (the crosses) seems to be perfect, the classification of the unknown data (the dots) is not. This may be due to the following reasons:
- the data is simply not separable using just a hyperplane, i.e., it is a non-linear classification problem,
- there are errors in the pre-labeled data,
- or the classifier function
is not optimal yet.
It is quite a typical situation that a perfect classification is not possible. It is therefore important to specify mathematically in which sense we allow for errors and what can be done to minimize them – this will be encoded in the mathematical sense given to the expressions ‘accurately’ and ‘generalizes’ that is usually encoded by means of choice in the loss function, as discussed above.
In the following we will specify two senses which lead to the model of the Perceptron and Adaline.
Perceptron¶
The first model we will take a look at is the so-called Perceptron Model. It is a mathematical model inspired by a nerve cell depicted in Fig. 4.

Fig. 4 A sketch of a neuron (source).
The mathematical model can be sketched as in Fig. 5.
Let
be the number of input signals;
The input signals are given as a vector
;
These input signals are weighted by the weight vector
,
and then summed by means of the inner product
.
The first coefficient in the input vector
is always assumed to be one, and thus, the first coefficient in the weight vector
is a threshold term, which renders
an affine linear as opposed to a just linear map.
Finally the signum function
is employed to infer from
discrete class labels
.
This results in a hypothesis set of functions
(1)¶
parametrized by , where we shall often drop the subscript
.
Since, our hypothesis set only contains linear functions, we may only expect it to be big enough for linear (or approximately) linear classification problems.
Note
In the previous section the data points
were assumed to be from
and
was assumed to be a
function;
Thus, an affine linear activation would amount to a function of the form
for weigths
and threshold
;
In the following absorb the threshold
into the weight vector
and therefore add the coefficient
at the first position of all data vectors
, i.e.
so that
Instead of an overset tilde, we will use the following convention to distinguish between vectors in
and
:
Example: The bitwise AND-gate
Let us pause and consider what such a simple model (1) is able to describe. This is a question of whether our hypothesis set is big enough to contain a certain function.
The bitwise AND-gate operation is given by following table:
Input 1 Input 2 Output 0 0 0 0 1 0 1 0 0 1 1 1
- In order to answer the question, whether out hypothesis set is big enough to model the AND-gate, it is helpful to represent the above table as a graph similar to the iris data above.
- The features of each data point
are the two input signals and the output value 0 and 1 are encoded by the class labels
.
- The colors: red and blue denote the output values 0 or 1 of the AND-gate.;
- Note that the data points a linearly separable;
- Note that these two classes of data points can be well separated by a hyperplane (in this case a 1d straight line). Hence, it is easy to find a good weight vector
. For instance:
(2)¶
Homework
Learning rule¶
Having settled for a hypothesis set such as the functions ,
, given in (1),
the task is to learn a good parameters, i.e., in our case a good weight
vector
, in the sense discussed in the previous section.
- This is now done by adjusting the weight vector
appropriately depending on the training data
,
- in a way that minimizes the classification errors, i.e., the number of indices
for which
.
The algorithm by which the ‘learning’ is facilitated shall be called learning rule and can be spell out as follows:
Algorithm: (Perceptron Learning Rule)
INPUT: Pre-labeled training data
STEP 1: Initialize the weight vector
to zero or conveniently distributed random coefficients.
STEP 2: Pick a data point
in the training samples at random:
Compute the output
Compare
with
:
If
, go back to STEP 2.
Else, update the weight vector
appropriately according to an update rule, and go back to STEP 2.
The following sketch is a visualization of the feedback loop for the learning rule:
The important step is the update rule which we discuss next.
Update rule¶
Let us spell out a possible update rule and then discuss why it does what we want:
Why does this update rule lead to a good choice of weights ?
Assume that in STEP 2 b. of the learning rule identified a misclassification and calls the update rule. There are two possibilities:
: This means that the model predicted
although the correct label is
.
Hence, by definition of
in (1) the value of
is too low;
This can be fixed by adjusting the weights according to (4) and (5);
Next time when this data point is examined one finds
because, as
and the square is non-negative, the last summand on the right is positive.
Hence, the new weight vector is changed in such a way that, next time, it is more likely that
will predict the label of
correctly.
: This means that the model predicted
although the correct label is
.
By the same reasoning as in case 1. one finds:
because now we have
, and again, the correction works in the right direction.
The model (1) for , i.e., hypothesis set, and this
particular learning and update rule is what defines the ‘Perceptron’.
Convergence¶
Now that we have a heuristic understanding why the learning and update rule chosen for the Perceptron works, we have a look at what can be said mathematically; see [21] for a more detailed discussion.
First let us make precise what we mean by ‘linear separability’ in our setting:
Definition: (Linear seperability) Let be two sets in
. Then:
are called linearly seperable if there is a
such that
are called absolutely linearly seperable if there is a
such that
The learning and update rule algorithm of the Perceptron can be formulated in terms of the following algorithm:
Algorithm: (Perceptron Learning and Update Rule)
PREP:
Prepare the training data. Let
and
be the sets of elements
whose class labels fulfill
and
, respectively.
START:
Initialize the weight vectorwith random numbers and set
.
STEP:
Choose
at random:
- If
: goto STEP.
- If
: goto UPDATE.
- If
: goto STEP.
- If
: goto UPDATE.
UPDATE:
- If
, then set
, increment
, and goto STEP.
- If
, then set
, increment
, and goto STEP.
- Note that for an implementation of this algorithm we will also need an exit criterion so that the algorithm does not run forever.
- This is usually done by specifying how many times the entire training set is run through STEP, a number which is often referred to as number of epochs.
- Note further, that for sake of brevity , the learning rate
was chosen to equal
; compare to (5).
Frank Rosenblatt already showed convergence of the algorithm above in the case of finite and linearly separable training data:
Theorem: (Perceptron convergence)
Letbe finite sets and linearly seperable, then the number of updates performed by the Perceptron algorithm stated above is finite.
- As a first step, we observe that since
are finite sets that are linear seperable, they are also absolutely seperable due to:
Proposition:
Let be finite sets of
:
are linearly seperable
are absolutely linearly seperable.
Furthermore, we observe that without restriction of generality (WLOG) we may assume the vectors
to be normalized because
Note that this means that for such
,
does not equal one in general, and hence, we break our convention. However, the reason for this convention was to ensure that
is an affine linear map with a potential bias term
. As long as
this is the case and any necessary scaling will be encoded into the choice of
during training.
Let us define
, i.e.,
is the union of
and the element of
times
.
Since
absolutely linearly seperable there is a
such that for all
(6)¶
And moreover, we also may WLOG assume that
is normalized.
Let us assume that some time after the -th update a point
is picked in STEP that leads to a
misclassification
so that UPDATE will be called which updates the weight vector according to
Note that both cases of UPDATE are treated with this update since in the definition of we have already included the ‘minus’ sign.
Now in order to infer a bound on the number of updates in the
Perceptron algorithm above, consider the quantity
(7)¶
To bound this quantity also from below, we consider first:
Thanks to (6) and the finiteness of , we know that
This facilitates the estimate
which, by induction, gives
(8)¶
Second, we consider the denumerator of (7):
Recall that was misclassified by weight vector
so that
. This yields the estimate
Again by induction, and recalling the assuption that was normalized, we get:
(9)¶
Both bounds, (8) and (9), together with (7), give rise to the inequalities
(10)¶
The right-hand side would grow as but has to be
smaller one. Hence,
, i.e., the number of updates, must be
bounded by a finite number.
- What is the geometrical meaning of
in (3) in the proof above?
- Consider the case
and give an upper bound on the maximum number of updates.
- Carry out the analysis above including an arbitrary learning rate
. How does
influence the number of updates?
Finally, though this result is reassuring it needs to be emphasized that it is rather academic.
- The convergence theorem only holds in the case of linear separability of the test data, which in most interesting cases is not given.
Python implementation¶
Next, we discuss an Python implementation of the Perceptron discussed above.
The mathematical model of the function
, i.e., the hypothesis set, the learning and update rule will be implemented as a Python class:
1 2 3 4 5 6 7 8 9 10 11 12
class Perceptron: def __init__(self, num): ''' initialize class for `num` input signals ''' # weights of the perceptron, initialized to zero # note the '1 + ' as the first weight entry is the threshold self.w_ = np.zeros(1 + num) return
The constructor
__init__
takes as argument the number of input signalsnum
and initializes the variablew_
which will be used to store the weight vectorwhere
num
.The constructor is called when an object of the
Perceptron
class is created, e.g., by:ppn = Perceptron(2)
In this example, it initializes a Perceptron with
.
The first method
activation_input
of the Perceptron class takes as argument an array of data pointsX
, i.e.,, and returns the array of input activations
for all
using the weight vector
stored in variable
w_
:1 2 3 4 5
def activation_input(self, X): ''' calculate the activation input of the neuron ''' return np.dot(X, self.w_[1:]) + self.w_[0]
The second method
classify
takes again an array of data pointsX
, i.e.,as argument. It uses the previous method
input_activation
to compute the input activationsand then applies the signum function to the values in the arrays:
1 2 3 4 5
def classify(self, X): ''' classify the data by sending the activation input through a step function ''' return np.where(self.activation_input(X) >= 0.0, 1, -1)
This method is the implantation of the function
in (1).
Finally, the next method implements the learning and update rule:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
def learn(self, X_train, Y_train, eta=0.01, epochs=10): ''' fit training features X_train with labels Y_train according to learning rate `eta` and total number of epochs `epochs` and log the misclassifications in errors_ ''' # reset internal list of misclassifications for the logging self.train_errors_ = [] # repeat `epochs` many times for _ in range(epochs): err = 0 # for each pair of features and corresponding label for x, y in zip(X_train, Y_train): # compute the update for the weight coefficients update = eta * ( y - self.classify(x) ) # update the weights self.w_[1:] += update * x # update the threshold self.w_[0] += update # increment the number of misclassifications if update is not zero err += int(update != 0.0) # append the number of misclassifications to the internal list self.train_errors_.append(err) return
- It takes as input arguments the training data
in form of two arrays
X_train
andY_train
, and furthermore, the learning rateeta
, i.e.,, and an additional number called
epochs
. - The latter number
epochs
specifies how many times the learning rule runs over the whole training data set – see thefor
loop in line number 11. - In the body of the first
for
loop a variableerr
is set to zero and a secondfor
loop over set of training data points is carried out. - The body of the latter
for
loop implement the update rule (3)-(5). - Note that there are two types of updates, i.e., lines 18 and 20. This is
due to the fact that above we used the convention that the first
coefficient of
was fixed to one in order to keep the notation slim.
- In line 22
err
is incremented each time a misclassification occurred. The number of misclassification per epoch is then append to the listtrain_errors
.
- It takes as input arguments the training data
After loading to training data set
into the two arrays
X_train
and Y_train
the Perceptron can be
initializes and trained as follows:
ppn = Perceptron(X.shape[1])
ppn.learn(X_train, Y_train, eta=0.1, epochs=100)
SOURCECODE: first_steps/001_iris_perceptron.ipynb
A first full implementation of the above implementation of the Perceptron.
Problems with the Perceptron¶
- As discussed, the convergence of the Perceptron algorithm is only guaranteed in the case of linearly separable test data.
- If linear separability is not provided, in each epoch will be at least one
update that will result in an oscillatory behavior in the chosen
.
- Thus, in general we need a good exit criterion for the algorithm to bound the maximum number of updates.
- The updates stop the very instant the entire test data is classified correctly, which might lead to poor generalization properties of the resulting classifier to unknown data.
SOURCECODE: first_steps/002_iris_perceptron_convergence.ipynb
In this implementation several learning behaviors of the Perceptron with the Iris data set are demonstrated.
SOURCECODE: first_steps/003_bitwise-AND-and-XOR-gates_perceptron.ipynb
Furthermore, it makes sense to look at the learning behavior in more tractable cases of both linear and non-linear separable data sets. In this implementation the learning behavior of the logical AND as well as the failure to learn the XOR gate can be observed.
1. Implement training scenarios for the Perceptron in which you can observe the qualitative behavior described above, i.e., the possible oscillations and the abrupt stop in training.
2. What effect does the learning rate have? Examine a situation is which the learning rate is too high and too low and discuss both cases.
3. What happens when the training data cannot be separated by a hyperplane? Examine problematic situation and discuss these – for example, by generating fictitious data points.
4. Note that the instant all training data was classified correctly the Perceptron stops to update the weight vector. Is this a feature or a bug?
5. Discuss the dependency of the learning success on the order in which the training data is presented to the Perceptron. How could the dependency be suppressed?
Adaline¶
The Adaline algorithm will overcome some of the short-comings of the one of Perceptron.
The basic design is almost the same:
The first difference w.r.t. to the Perceptron is the additional activation function
. We shall call
activation input and
activation output.
We will discuss different choices of activation functions later. For now let us simply use:
(11)¶
The second difference is that the activation output is used as in feedback loop for the update rule.
The advantage is that, provided
is regular enough, we may make use of analytic optimization theory in order to find an in some sense ‘optimal’ choice of weights
.
This was not possible in the case of the Perceptron because the signum function is not differentiable.
Update rule¶
Recall that an ‘optimal’ choice of weights
should fulfill two properties:
- It should ‘accurately’ classify the training data
,
- and it should ‘generalize’ well unknown data.
- It should ‘accurately’ classify the training data
In order to make use of analytic optimization theory, one may attempt to encode a measure of the optimality of weights w.r.t. these two properties in form of a function that attains smaller and smaller values the better the weights fulfill these properties.
This function is called many names, e.g., ‘loss’, ‘regret’, ‘cost’, ‘energy’, or ‘error’ function. We will use the term ‘loss function’.
Of course, depending on the classification task, there are many choices. Maybe one of the simplest examples is:
(12)¶
which is the accumulated squared euclidean distance between the particular labels of the test data
and the corresponding prediction
given by Adaline for the current weight vector
.
Note that the loss function depends not only on
, but also on the entire training data set
. The latter, however, is assumed to be fixed which is why the dependence of
on it will be suppressed in out notation.
From its definition the loss function in (12) has the desired property that it grows and decreases whenever the number of misclassification grows or decreases, respectively.
Furthermore, it does so smoothly, which allows for the use of analytic optimization theory.
Learning and update rule¶
Having encoded the desired properties of ‘optimal’ weights
as a global minimum of the function
, the only task left to do is to find this global minimum.
Depending on the function
, i.e., on the training data, this task may be arbitrarily simple or difficult.
Consider the following heuristics in order to infer a possible learning strategy:
Say, we start with a weight vector
and want to make an update
in a favourable direction
.
An informal Taylor expansion of
reveals
In order to make the update ‘favourable’ we want that
.
Neglecting the higher orders, this would mean:
(13)¶
In order to get rid of the unknown sign of
we may choose:
(14)¶
for some learning rate
.
Then, for the choice (14) the linear order (13) becomes negative and we note that
Hence, the update may work to decrease the value of
– at least in the linear order of perturbation.
Concretely, for our case we find:
(15)¶
where denotes the derivative of
. Here the
notation
denotes the gradient
and (15) makes sense as on the right-hand side is a
vector in
.
In conclusion, we may formulate the Adaline algorithm as follows:
Algorithm: (Adaline Learning and Update Rule)
INPUT: Pre-labeled training data
STEP 1: Initialize the weight vector
to zero or conveniently distributed random coefficients.
STEP 2: For a certain number of epochs:
Compute
Update the weights
according to
(16)¶
Prove that even in the linearly seperable case, the above Adaline algorithm does not need to converge. Do this by constructing a simple example of training data and a special choice of learning rate
.
SOURCECODE: 004_adaline-non-convergence.ipynb
A small implementation that show such a bad case.
What is the influence of large or small values of
? Discuss qualitatively – we will discuss that in detail in the next section.
Discuss qualitatively the advantages/disadvantages of immediate weight updates after each misclassification as it was the case for the Perceptron and the batch updates as it is the case for Adaline – also this we will discuss in more detail in the next section.
Python implementation¶
As we have already noted, the Adaline learning rule is the same as the one of
the Perceptron. Hence, for linear activation , we only need
to change the learning rule implemented in the method
learn
of the
Perceptron
class. The Adaline
class can therefore we created as
follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | class Adaline(Perceptron):
def learn(self, X_train, Y_train, eta=0.01, epochs=1000):
'''
fit training data according to eta and n_iter and log the errors in
errors_
'''
# we initialize two list, each for the misclassifications and the
# cost function
self.train_errors_ = []
self.train_loss_ = []
# for all the epoch
for _ in range(epochs):
# classify the training features
Z = self.classify(X_train)
# count the misclassifications for the logging
err = 0
for z, y in zip(Z, Y_train):
err += int(z != y)
# ans save them in the list for later use
self.train_errors_.append(err)
# compute the activation input of the entire training features
output = self.activation_input(X_train)
# and then the deviation from the labels
delta = Y_train - output
# the following is an implementation of the Adaline update rule
self.w_[1:] += eta * X_train.T.dot(delta)
self.w_[0] += eta * delta.sum()
# and finally, we record the loss function
loss = (delta ** 2).sum() / 2.0
# and save it for later use
self.train_loss_.append(loss)
return
|
- Line 1 defines the
Adaline
class and a child of thePerceptron
one. It thus inherits all the methods and variables of thePerceptron
class. - Line 11 introduces a similar variable as
train_errors
that will store the value of the loss function per epoch. - Line 14 is again the
for
loop over the epochs: - In line 16 the classification of all training data points is conducted.
- Lines 17-22 only count the number of misclassification which is then appended
to the list
train_errors_
. - The update rule is implemented in Lines 24-30. First, the input activation of
all the training data is computed and the array
delta
stores the set.
- This
delta
array is then used to compute the updated weight vector stored inw_
in lines 29-30. - The last two lines in this
for
loop compute the loss value for this epoch and store it in the listtrain_loss_
.
SOURCECODE: first_steps/004_iris_adaline.ipynb
This is a first full implementation of the Adaline discussed above.
Learning behavior¶
- We have introduced the learning rate
in an ad-hoc fashion;
- Not even in the linear separable case, we may expect convergence of the Adaline algorithm;
- We can only expect to find a good approximation of the optimal choice of
weights
;
- But the approximation will depend on the choice of the learning rate parameter as well as on
the initial choice of
.
In the figure below, the learning rate was chosen too large.
Instead of approximating the minimum value, the gradient descent algorithm even
diverges.
In the next figure, the learning rate has been chosen too small.
In case, the loss function has other local minima, the initial weight vector
is coincidently chosen near such a local minimum, and the learning
rate is too small, the gradient descent algorithm will converge too the nearest
local minimum instead of the global minimum.
In the special case of a linear activation and a quadratic
loss function such as (12), which we also used in our Python
implementation, such a behavior cannot occur due to strict convexity. For more general
activations
and loss fuctions
that we will
discuss later there are often non-trivial landscape of local minima.
Here is another bad scenario where we see that the gradient descent algorithm does not converge:
Many improvements can be made with respect to the gradient descent algorithms which tend to work well in different situations. A behavior such as in the scenario above can be tempered by adapting the learning rate. For instance, by choosing:
where are two constants and
is the
number of updates. The reasoning behind the adaption of the learning rate it
not to overshoot the minimum as the updates get smaller and smaller forcing the
updates to converge.
Other algorithms implement Newtonian dynamics including friction on the level
hypersurface defined by (so that initially there is good potential
to escape local minima due to the momentum but after the friction accumulates
the updates become smaller over time) or add a temperature to the motion (in
order not to get stuck in a local minimium). Here is a nice overview on popular
optimization algorithms used in machine learning:
An overview of gradient descent optimization algorithms by Sebastian Ruder.
SOURCECODE: first_steps/004_iris_adaline.ipynb
In this implementation several learning behaviors of the Adaline with the Iris data set are demonstrated.
SOURCECODE: first_steps/005_iris_adaline_convergence.ipynb
Again, as for the Perceptron Furthermore, it makes sense to look at the learning behavior in more tractable cases of both linear and non-linear separable data sets. In this implementation the learning behavior of the logical AND as well as the failure to learn the XOR gate can be observed.
- Implement training scenarios for the Adaline in which you can observe the qualitative behavior described above, i.e., situations in which the learning rate is too high or too low.
- What happens when the training data cannot be separated by a hyperplane? Examine problematic situation and discuss these – for example, by generating fictitious data points.
- What happens after the first epoch of training in which the data was successfully separated? Compare with your results on for the Perceptron.
Stadardization of training data¶
As can be seen from (14), the learning rate controls
the magnitude of the update of the weight vector. The scale on which the
learning rate controls the update is given by the factor
which by definition of
in (12) is given
in terms of a sum of
summands. Hence, if the size of the training set
varies the respective scales of the learning rate will in general not be
comparable. To reduce the dependence of the learning rate scale on the size of
the training data, it a good idea to replace
by the average:
Furthermore, the learning rate will depend on the fluctuations in the features of your training data set. In order, to have comparable results it is therefore a good idea to normalize the training data. A standard procedure to do this is to transform the training data according to the following map
where
is the empirical average and
(17)¶
is the standard variation. This procedure is called standardization of the training data.
A little side remark from statistics: For small samples (17) underestimates the standard deviation on avarage. A better estimate is therefore:
Prove that for independent identically distributed random variables we
get thanks to the pecuiliar factor
. However, in general our training samples will be
large enough that this goes unnoticed.
Online learning versus batch learning¶
The learning and update rule of the Perceptron and Adaline have a crucial difference:
- The Perceptron updates its weights according to inspection of a single
data point
of the training data
– this is usually referred to as ‘online’ (in the sense of ‘real-time’) or ‘stochastic’ (in the sense that training data points are chosen at random) learning.
- The Adaline conducts an update after inspecting the whole training data
by means of computing the gradient of the loss function (12) that depends on the entire set of training data – this is usually referred to as ‘batch’ learning (in the sense that the whole batch of training data is used to compute an update of the weights).
- The Perceptron updates its weights according to inspection of a single
data point
While online learning may have a strong dependence on the sequence of training data points presented to the learner and produces updates that may be extreme for extreme outliers, batch learning averages out the updates but therefore is usually computationally very expensive.
Clearly, we can also make Adaline become an online learner by receptively presenting it training data consisting only of one randomly chosen point. This method is called stochastic gradient descent.
In turn, we can make the Perceptron become a batch learner, simply by computing all the update per element in the entire training data set, computing the average update, and then performing a single update with the average.
A compromise between the two extremes, online and batch learning, is the so-called ‘mini-batch’ learning.
The entire batch of training data
is then split up into disjoint mini-batches
for an appropriate strictly increasing sequence of indices
such that
.
For each mini-batch
the mini-batch loss function is computed
and the update of the weight vector
is performed accordingly.
For appropriate chosen mini-batch sizes this mini-batch learning rule often proves to be faster than online or batch learning.