1) What is Artificial Neural Network?
An extremely simplified model of the brain
Essentially a function approximator
Transforms inputs into outputs to the best of its ability
Results Design Classes Fundamentals
Composed of many "neurons" that co-operate to perform the desired function
What Are They Used For?
Classification
Pattern recognition, feature extraction, image matching
Noise Reduction
Recognize patterns in the inputs and produce noiseless outputs
Prediction
Extrapolation based on historical data
Ability to learn
NN's figure out how to perform their function on their own
Determine their function based only upon sample inputs
Ability to generalize
i.e. produce reasonable outputs for inputs it has not been
taught how to deal with
How do Neural Networks Work?
The "building blocks" of neural networks are the neurons.
In technical systems, we also refer to them as units or nodes.
Basically, each neuron
receives input from many other neurons,
changes its internal state (activation) based on the current input,
sends one output signal to many other neurons, possibly including its input neurons (recurrent network)
. Information is transmitted as a series of electric impulses, so-called spikes.
The frequency and phase of these spikes encodes the information.
In biological systems, one neuron can be connected to as many as 10,000 other neurons.
Usually, a neuron receives its information from other neurons in a confined area, its so-called receptive field.
NNs are able to learn by adapting their connectivity patterns so that the organism improves its behavior in terms of reaching certain (evolutionary) goals.
The strength of a connection, or whether it is excitatory or inhibitory, depends on the state of a receiving neuron's synapses.
The NN achieves learning by appropriately adapting the states of its synapses
The output of a neuron is a function of the weighted sum of the inputs plus a bias
The function of the entire neural network is simply the computation of the outputs of all the neurons
An entirely deterministic calculation
2- Explain a) Gaussian Neurons
Another type of neurons overcomes this problem by using a Gaussian activation function:
101fi(neti(t))neti(t)-1
1
0
1
fi(neti(t))
neti(t)
-1
Gaussian neurons are able to realize non-linear functions.Therefore, networks of Gaussian units are in principle unrestricted with regard to the functions that they can realize. The drawback of Gaussian neurons is that we have to make sure that their net input does not exceed 1. This adds some difficulty to the learning in Gaussian networks.
b) Sigmoidal Neurons
Sigmoidal neurons accept any vectors of real numbers as input, and they output a real number between 0 and 1.
Sigmoidal neurons are the most common type of artificial neuron, especially in learning networks.
A network of sigmoidal units with m input neurons and n output neurons realizes a network function
f: Rm (0,1)n
101fi(neti(t))neti(t)-1 = 1
1
0
1
fi(neti(t))
neti(t)
-1
= 1
The parameter controls the slope of the sigmoid function, while the parameter controls the horizontal offset of the function in a way similar to the threshold neurons.
In backpropagation networks, we typically choose = 1 and = 0.
c) Correlation Learning
Hebbian Learning (1949):
"When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes place in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased."
Weight modification rule:
wi,j = cxixj
Eventually, the connection strength will reflect the correlation between the neurons' outputs.
d) Competitive Learning
Nodes compete for inputs
Node with highest activation is the winner
Winner neuron adapts its tuning (pattern of weights) even further towards the current input
Individual nodes specialize to win competition for a set of similar inputs
Process leads to most efficient neural representation of input space
Typical for unsupervised learning
e) Linear Neurons
Obviously, the fact that threshold units can only output the values 0 and 1 restricts their applicability to certain problems.
We can overcome this limitation by eliminating the threshold and simply turning fi into the identity function so that we get:
With this kind of neuron, we can build feedforward networks with m input neurons and n output neurons that compute a function f: Rm Rn
Linear neurons are quite popular and useful for applications such as interpolation.
However, they have a serious limitation: Each neuron computes a linear function, and therefore the overall network function f: Rm Rn is also linear.
This means that if an input vector x results in an output vector y, then for any factor the input x will result in the output y.
Obviously, many interesting functions cannot be realized by networks of linear neurons.
f) Gradient Descent
Gradient descent is a very common technique to find the absolute minimum of a function.
It is especially useful for high-dimensional functions. We will use it to iteratively minimizes the network's (or neuron's) error by finding the gradient of the error surface in weight-space and adjusting the weights in the opposite direction.
Gradient-descent example: Finding the absolute minimum of a one-dimensional error function f(x):
f(x)xslope: f'(x0)x1 = x0 - f'(x0)
f(x)
x
slope: f'(x0)
x1 = x0 - f'(x0)
Repeat this iteratively until for some xi, f'(xi) is sufficiently close to 0.
Gradients of two-dimensional functions:
The two-dimensional function in the left diagram is represented by contour lines in the right diagram, where arrows indicate the gradient of the function at different locations. Obviously, the gradient is always pointing in thedirection of the steepest increase of the function. In order to find the function's minimum, we should always move against the gradient.
3-Explain different layer of Neural Network?
4-Develop a Perceptron Training Algorithm?
Algorithm Perceptron;
Start with a randomly chosen weight vector w0;
Let k = 1;
while there exist input vectors that are
misclassified by wk-1, do
Let ij be a misclassified input vector;
Let xk = class(ij)ij, implying that wk-1xk < 0;
Update the weight vector to wk = wk-1 + xk;
Increment k;
end-while;
For example, for some input i with class(i) = -1,
If wi > 0, then we have a misclassification.
Then the weight vector needs to be modified to w + w
with (w + w)i < wi to possibly improve classification.
We can choose w = -i, because
(w + w)i = (w - i)i = wi - ii < wi,
and ii is the square of the length of vector i and is thus positive.
If class(i) = 1, things are the same but with opposite signs; we introduce x to unify these two cases.
5- Develop an Adaline Learning Algorithm?
The Adaline uses gradient descent to determine the weight vector that leads to minimal error.
Error is defined as the MSE between the neuron's net input netj and its desired output dj (= class(ij)) across all training samples ij.
The idea is to pick samples in random order and perform (slow) gradient descent in their individual error functions.
This technique allows incremental learning, i.e., refining of the weights as more training samples are added.
The Adaline uses gradient descent to determine the weight vector that leads to minimal error.
The gradient is then given by
For gradient descent, w should be a negative multiple of the gradient:
6-Develop a Backpropagation Learning Algorithm?
Similar to the Adaline, the goal of the Backpropagation learning algorithm is to modify the network's weights so that its output vector
op = (op,1, op,2, …, op,K)
is as close as possible to the desired output vector
dp = (dp,1, dp,2, …, dp,K)
for K output neurons and input patterns p = 1, …, P.
The set of input-output pairs (exemplars)
{(xp, dp) " p = 1, …, P} constitutes the training set.
Also similar to the Adaline, we need a cumulative error function that is to be minimized:
We can choose the mean square error (MSE) once again (but the 1/P factor does not matter):
where
For input pattern p, the i-th input layer node holds xp,i. Net input to j-th node in hidden layer:
Output of j-th node in hidden layer:
Net input to k-th node in output layer :
Output of k-th node in output layer :
Network error for p :
As E is a function of the network weights, we can use gradient descent to find those weights that result in minimal error.
For individual weights in the hidden and output layers, we should move against the error gradient (omitting index p):
Output layer: Derivative easy to calculate
Hidden layer: Derivative difficult to calculate
When computing the derivative with regard to wk,j(2,1), we can disregard any output units except ok:
Remember that ok is obtained by applying the sigmoid function S to netk(2), which is computed by:
Therefore, we need to apply the chain rule twice.
We know that: Since:
We have: Which gives us:
For the derivative with regard to wj,i(1,0), notice that E depends on it through netj(1), which influences each ok
with k = 1, …, K:
Using the chain rule of derivatives again:
This gives us the following weight changes at the output layer:
… and at the inner layer:
As you surely remember from a few minutes ago:
Then we can simplify the generalized error terms:
And:
The simplified error terms k and j use variables that are calculated in the feedforward phase of the network and can thus be calculated very efficiently.
Now let us state the final equations again and reintroduce the subscript p for the p-th pattern:
Algorithm Backpropagation;
Start with randomly chosen weights;
while MSE is above desired threshold and computational
bounds are not exceeded, do
for each input pattern xp, 1 p P,
Compute hidden node inputs;
Compute hidden node outputs;
Compute inputs to the output nodes;
Compute the network outputs;
Compute the error between output and desired output;
Modify the weights between hidden and output nodes;
Modify the weights between input and hidden nodes;
end-for
end-while.
7-Explain the difference between Internal Representation Issues and External Interpretation Issues?
Internal Representation Issues
As we said before, in all network types, the amplitude of input signals and internal signals is limited:
analog networks: values usually between 0 and 1
binary networks: only values 0 and 1allowed
bipolar networks: only values –1 and 1allowed
Without this limitation, patterns with large amplitudes would dominate the network's behavior.
A disproportionately large input signal can activate a neuron even if the relevant connection weight is very small.
External Interpretation Issues
From the perspective of the embedding application, we are concerned with the interpretation of input and output signals.
These signals constitute the interface between the embedding application and its NN component.
Often, these signals only become meaningful when we define an external interpretation for them.
This is analogous to biological neural systems: The same signal becomes completely different meaning when it is interpreted by different brain areas (motor cortex, visual cortex etc.).
Without any interpretation, we can only use standard methods to define the difference (or similarity) between signals.
For example, for binary patterns x and y, we could…
… treat them as binary numbers and compute
their difference as " x – y "
… treat them as vectors and use the cosine of the
angle between them as a measure of similarity
… count the numbers of digits that we would have
to flip in order to transform x into y (Hamming
distance)
Example: Two binary patterns x and y:
x = 00010001011111000100011001011001001
y = 10000100001000010000100001000011110
These patterns seem to be very different from each other. However, given their external interpretation…
yx…x and y actually represent the same thing.
y
x
8-Explain the process of data representation?
Most networks process information in the form of input pattern vectors.
These networks produce output pattern vectors that are interpreted by the embedding application.
All networks process one of two types of signal components: analog (continuously variable) signals or discrete (quantized) signals.
In both cases, signals have a finite amplitude; their amplitude has a minimum and a maximum value.
analogmaxmin
analog
max
min
discretemaxmin
discrete
max
min
The main question is:
How can we appropriately capture these signals and represent them as pattern vectors that we can feed into the network?
We should aim for a data representation scheme that maximizes the ability of the network to detect (and respond to) relevant features in the input pattern.
Relevant features are those that enable the network to generate the desired output pattern.
Similarly, we also need to define a set of desired outputs that the network can actually produce.
Often, a "natural" representation of the output data turns out to be impossible for the network to produce.
We are going to consider internal representation and external interpretation issues as well as specific methods for creating appropriate representations.
9-Explain the process of Multiclass Discrimination?
Often, our classification problems involve more than two classes. For example, character recognition requires at least 26 different classes. We can perform such tasks using layers of perceptrons or Adalines
A four-node perceptron for a four-class problem in n-dimensional input space
Each perceptron learns to recognize one particular class, i.e., output 1 if the input is in that class, and 0 otherwise.
The units can be trained separately and in parallel.
In production mode, the network decides that its current input is in the k-th class if and only if ok = 1, and for all j k, oj = 0, otherwise it is misclassified.
For units with real-valued output, the neuron with maximal output can be picked to indicate the class of the input.
This maximum should be significantly greater than all other outputs, otherwise the input is misclassified.
10-Explain difference between Supervised and unsupervised learning?
Supervised learning:
An archaeologist determines the gender of a human skeleton based on many past examples of male and female skeletons.
Unsupervised learning:
The archaeologist determines whether a large number of dinosaur skeleton fragments belong to the same species or multiple species. There are no previous data to guide the archaeologist, and no absolute criterion of correctness.
11. Explain different ways of representing the data in the neural network system? 10
12. Explain temporal data representations? Give example. 10
13. Write a note on following: 3+3+3
Adaptive Networks
As you know, there is no equation that would tell you the ideal number of neurons in a multi-layer network.
Ideally, we would like to use the smallest number of neurons that allows the network to do its task sufficiently accurately, because of:
the small number of weights in the system,
fewer training samples being required,
faster training,
typically, better generalization for new test samples.
So far, we have determined the number of hidden-layer units in BPNs by "trial and error."
However, there are algorithmic approaches for adapting the size of a network to a given task.
Some techniques start with a large network and then iteratively prune connections and nodes that contribute little to the network function.
Other methods start with a minimal network and then add connections and nodes until the network reaches a given performance level.
Finally, there are algorithms that combine these "pruning" and "growing" approaches.
Cascade correlation
None of these algorithms are guaranteed to produce "ideal" networks.
(It is not even clear how to define an "ideal" network.)
However, numerous algorithms exist that have been shown to yield good results for most applications.
We will take a look at one such algorithm named "cascade correlation."
It is of the "network growing" type and can be used to build multi-layer networks of adequate size.
However, these networks are not strictly feed-forward in a level-by-level manner.
This learning algorithm is much faster than backpropagation learning, because only one neuron is trained at a time.
On the other hand, its inability to retrain neurons may prevent the cascade correlation network from finding optimal weight patterns for encoding the given function.
Covariance and Correlation
For a dataset (xi, yi) with i = 1, …, n the covariance is:
Covariance tells us something about the strength and direction (directly vs. inversely proportional) of the linear relationship between x and y.
For many applications, it is useful to normalize this variable so that it ranges from -1 to 1.
The result is the correlation coefficient r, which for a dataset (xi, yi) with i = 1, …, n is given by:
In the case of high (close to 1) or low (close to -1) correlation coefficients, we can use one variable as a predictor of the other one.
To quantify the linear relationship between the two variables, we can use linear regression:
14. What are the benefits to have smallest number of neurons in the network? 4
15. Develop a cascade correlation algorithm? Why it is used for? What are its advantages? 10
We start with a minimal network consisting of only the input neurons (one of them should be a constant
offset = 1) and the output neurons, completely connected as usual.
The output neurons (and later the hidden neurons) typically use output functions that can also produce negative outputs; e.g., we can subtract 0.5 from our sigmoid function for a (-0.5, 0.5) output range.
Then we successively add hidden-layer neurons and train them to reduce the network error step by step:
Weights to each new hidden node are trained to maximize the covariance of the node's output with the current network error.
Covariance:
: vector of weights to the new node
: output of the new node to p-th input sample
: error of k-th output node for p-th input sample before the new node is added
: averages over the training set
None of these algorithms are guaranteed to produce "ideal" networks.
(It is not even clear how to define an "ideal" network.)
However, numerous algorithms exist that have been shown to yield good results for most applications.
We will take a look at one such algorithm named "cascade correlation."
It is of the "network growing" type and can be used to build multi-layer networks of adequate size.
However, these networks are not strictly feed-forward in a level-by-level manner.
Since we want to maximize S (as opposed to minimizing some error), we use gradient ascent:
: i-th input for the p-th pattern
: sign of the correlation between the node's output and the k-th network output
: learning rate
: derivative of the node's activation function with respect to its net input, evaluated at p-th pattern
If we can find weights so that the new node's output perfectly covaries with the error in each output node, we can set the new output node weights and offsets so that the new error is zero.
More realistically, there will be no perfect covariance, which means that we will set each output node weight so that the error is minimized.
To do this, we can use gradient descent or linear regression for each individual output node weight.
The next added hidden node will further reduce the remaining network error, and so on, until we reach a desired error threshold.
This learning algorithm is much faster than backpropagation learning, because only one neuron is trained at a time.
On the other hand, its inability to retrain neurons may prevent the cascade correlation network from finding optimal weight patterns for encoding the given function.
16. What are input space clusters and radial basic functions (RBFs)? 6
To achieve such local "receptive fields," we can use radial basis functions, i.e., functions whose output only depends on the Euclidean distance between the input vector and another ("weight") vector.
A typical choice is a Gaussian function:
where c determines the "width" of the Gaussian.
However, any radially symmetric, non-increasing function could be used.
17. Explain linear interpolation for one dimensional and multidimensional case? 5
For function approximation, the desired output for new (untrained) inputs could be estimated by linear interpolation.
As a simple example, how do we determine the desired output of a one-dimensional function at a new input x0 that is located between known data points x1 and x2?
which simplifies to:
with distances D1 and D2 from x0 to x1 and x2, resp.
In the multi-dimensional case, hyperplane segments connect neighboring points so that the desired output for a new input x0 is determined by the P0 known samples that surround it:
Where Dp is the Euclidean distance between x0 and xp and f(xp) is the desired output value for input xp.
Example for f:R2R1 (with desired output indicated):
For four nearest neighbors, the desired output for x0 is
18. Explain different types of learning methods? What are counter propagation networks? 10
Unsupervised/Supervised Learning ….
The counterpropagation network (CPN) is a fast-learning combination of unsupervised and supervised learning.
Although this network uses linear neurons, it can learn nonlinear functions by means of a hidden layer of competitive units.
Moreover, the network is able to learn a function and its inverse at the same time.
However, to simplify things, we will only consider the feedforward mechanism of the CPN.
19. Explain the process of learning in radial basic function networks? 5
If we are using such linear interpolation, then our radial basis function (RBF) 0 that weights an input vector based on its distance to a neuron's reference (weight) vector is 0(D) = D-1.
For the training samples xp, p = 1, …, P0, surrounding the new input x, we find for the network's output o:
(In the following, to keep things simple, we will assume that the network has only one output neuron. However, any number of output neurons could be implemented.)
Since it is difficult to define what "surrounding" should mean, it is common to consider all P training samples and use any monotonically decreasing RBF :
This, however, implies a network that has as many hidden nodes as there are training samples. This in unacceptable because of its computational complexity and likely poor generalization ability – the network resembles a look-up table.
It is more useful to have fewer neurons and accept that the training set cannot be learned 100% accurately:
Here, ideally, each reference vector i of these N neurons should be placed in the center of an input-space cluster of training samples with identical (or at least similar) desired output i.
To learn near-optimal values for the reference vectors and the output weights, we can – as usual – employ gradient descent.
20. Write a note on distance and similarity functions with respect to counterpropagation network? 5
In the hidden layer, the neuron whose weight vector is most similar to the current input vector is the "winner."
There are different ways of defining such maximal similarity, for example:
(1) Maximal cosine similarity (same as net input):
(2) Minimal Euclidean distance:
(no square root necessary for determining the winner)
21. Develop a counterpropagation network learning algorithm? 10
A simple CPN with two input neurons, three hidden neurons, and two output neurons can be described as follows:
The CPN learning process (general form for n input units and m output units):
Randomly select a vector pair (x, y) from the training set.
If you use the cosine similarity function, normalize (shrink/expand to "length" 1) the input vector x by dividing every component of x by the magnitude ""x"", where
Initialize the input neurons with the resulting vector and compute the activation of the hidden-layer units according to the chosen similarity measure.
In the hidden (competitive) layer, determine the unit W with the largest activation (the winner).
Adjust the connection weights between W and all N input-layer units according to the formula:
Repeat steps 1 to 5 until all training patterns have been processed once.
Repeat step 6 until each input pattern is consistently associated with the same competitive unit.
Select the first vector pair in the training set (the current pattern).
Repeat steps 2 to 4 (normalization, competition) for the current pattern.
Adjust the connection weights between the winning hidden-layer unit and all M output layer units according to the equation:
Repeat steps 9 and 10 for each vector pair in the training set.
Repeat steps 8 through 11 for several epochs.
22. Develop a Quickprop learning algorithm? 10
The assumption underlying Quickprop is that the network error as a function of each individual weight can be approximated by a paraboloid.
Based on this assumption, whenever we find that the gradient for a given weight switched its sign between successive epochs, we should fit a paraboloid through these data points and use its minimum as the next weight value.
Illustration (sorry for the crummy paraboloid):
Newton's method:
For the minimum of E we must have:
Notice that this method cannot be applied if the error gradient has not decreased in magnitude and has not changed its sign at the preceding time step.
In that case, we would ascent in the error function or make an infinitely large weight modification.
In most cases, Quickprop converges several times faster than standard backpropagation learning.
23. Develop an Rprop learning algorithm? 10
Resilient Backpropagation (Rprop)
The Rprop algorithm takes a very different approach to improving backpropagation as compared to Quickprop.
Instead of making more use of gradient information for better weight updates, Rprop only uses the sign of the gradient, because its size can be a poor and noisy estimator of required weight updates.
Furthermore, Rprop assumes that different weights need different step sizes for updates, which vary throughout the learning process.
The basic idea is that if the error gradient for a given weight wij had the same sign in two consecutive epochs, we increase its step size ij, because the weight's optimal value may be far away.
If, on the other hand, the sign switched, we decrease the step size.
Weights are always changed by adding or subtracting the current step size, regardless of the absolute value of the gradient.
This way we do not "get stuck" with extreme weights that are hard to change because of the shallow slope in the sigmoid function.
Formally, the step size update rules are:
Empirically, best results were obtained
with initial step sizes
of 0.1, +=1.2, -=1.2, max=50, and min=10-6
Weight updates are then performed as follows:
It is important to remember that, like in
Quickprop, in Rprop the gradient needs to be
computed across all
samples (per-epoch learning).
The performance of Rprop is comparable to Quickprop; it also considerably accelerates backpropagation learning. Compared to both the standard backpropagation algorithm and Quickprop, Rprop has one advantage:
Rprop does not require the user to estimate or empirically determine a step size parameter and its change over time. Rprop will determine appropriate step size values by itself and can thus be applied "as is" to a variety of problems without significant loss of efficiency.
24. What are Maxnets? Give example. 5
A maxnet is a recurrent, one-layer network that uses competition to determine which of its nodes has the greatest initial input value.
All pairs of nodes have inhibitory connections with the same weight -, where typically 1/(# nodes).
In addition, each node has a self-excitatory connection to itself, whose weight is typically 1.
The nodes update their net input and their output by the following equations:
All nodes update their output simultaneously.
With each iteration, the neurons' activations will decrease until only one neuron remains active.
This is the "winner" neuron that had the greatest initial input.
Maxnet is a biologically plausible implementation of a maximum-finding function.
In parallel hardware, it can be more efficient than a corresponding serial function.
We can add maxnet connections to the hidden layer of a CPN to find the winner neuron.
Example of a Maxnet with five neurons and = 1, = 0.2:
25. Write a note on Kohonen maps? 5
Self-Organizing Maps (Kohonen Maps)
As you may remember, the counterpropagation network employs a combination of supervised and unsupervised learning. We will now study Self-Organizing Maps (SOMs) as examples for completely unsupervised learning (Kohonen, 1980). This type of artificial neural network is particularly similar to biological systems (as far as we understand them).
In the human cortex, multi-dimensional sensory input spaces (e.g., visual input, tactile input) are represented by two-dimensional maps.
The projection from sensory inputs onto such maps is topology conserving.
This means that neighboring areas in these maps represent neighboring areas in the sensory input space.
For example, neighboring areas in the sensory cortex are responsible for the arm and hand regions.
Such topology-conserving mapping can be achieved by SOMs:
Two layers: input layer and output (map) layer
Input and output layers are completely connected.
Output neurons are interconnected within a defined
neighborhood.
A topology (neighborhood relation) is defined on
the output layer.
Network structure:
Common output-layer structures:
A neighborhood function (i, k) indicates how closely neurons i and k in the output layer are connected to each other. Usually, a Gaussian function on the distance between the two neurons in the layer is used:
26. Describe Adaptive resonance theory with an example? 10
Adaptive Resonance Theory (ART) networks perform completely unsupervised learning.
Their competitive learning algorithm is similar to the first (unsupervised) phase of CPN learning.
However, ART networks are able to grow additional neurons if a new input cannot be categorized appropriately with the existing neurons.
A vigilance parameter determines the tolerance of this matching process.
A greater value of leads to more, smaller clusters (= input samples associated with the same winner neuron).
ART networks consist of an input layer and an output layer.
We will only discuss ART-1 networks, which receive binary input vectors.
Bottom-up weights are used to determine output-layer candidates that may best match the current input.
Top-down weights represent the "prototype" for the cluster defined by each output neuron.
A close match between input and prototype is necessary for categorizing the input.
Finding this match can require multiple signal exchanges between the two layers in both directions until "resonance" is established or a new neuron is added.
ART networks tackle the stability-plasticity dilemma:
Plasticity: They can always adapt to unknown inputs (by creating a new cluster with a new weight vector) if the given input cannot be classified by existing clusters.
Stability: Existing clusters are not deleted by the introduction of new inputs (new clusters will just be created in addition to the old ones).
Problem: Clusters are of fixed size, depending on .
Initialize each top-down weight tl,j (0) = 1;
Initialize bottom-up weight bj,l (0) = ;
C. While the network has not stabilized, do
Present a randomly chosen pattern x = (x1,…,xn) for learning
Let the active set A contain all nodes; calculate
yj = bj,1 x1 +…+bj,n xn for each node j A;
3. Repeat
Let j* be a node in A with largest yj, with ties being broken arbitrarily;
Compute s* = (s*1,…,s*n ) where s*l = tl,j* xl ;
Compare similarity between s* and x with the given vigilance parameter r :
if < r then remove j* from set A
else associate x with node j* and update weights:
bj*l (new) = tl,j* (new) =
Until A is empty or x has been associated with some node j
4. If A is empty, then create new node whose weight vector coincides with current input
pattern x;
end-while