Another interesting fact about modern AIs, including LLMs like DeepSeek, is that the main internal functional block is the
Perceptron, a simulated neuron invented back in the 1950s (originally implemented using analogue electronics) which performs a weighted-sum of its inputs followed by some non-linear transfer function to clip the output to the range 0.0-1.0.
Although Perceptrons are now digital, and modern AIs contain huge numbers of them, the basic concept hasn't changed. Apart from sheer scale, the only major ways in which a modern AI differs from the original Perceptron is that Multi-Layer Perceptrons are now used (the original single-layer Perceptron could only solve
linearly-separable problems) and that an extra stage called Attention is added to the process.
The complete AI works by alternately applying Attention and Perceptron steps in the hope that the result will iteratively converge to the required answer (in the case of an LLM that will be a single token or word, which is then output). By some 'magic' the scaling-up process imbues the AI with more capabilities than you might expect from its construction.
The simplest possible Multi-Layer Perceptron is one with
two inputs and
two layers. Famously, this can learn to carry out an
exclusive-or operation on the inputs, a problem which is not linearly-separable and and cannot be solved by a single-layer Perceptron. I asked DeepSeek to write BBC BASIC code for a two-input, two-layer Perceptron and this is what it produced. It worked first time:
Code: Select all
10 DATA 0,0,0
20 DATA 0,1,1
30 DATA 1,0,1
40 DATA 1,1,0
50 DIM inputs(4,2), targets(4)
60 FOR i% = 1 TO 4
70 READ inputs(i%,1), inputs(i%,2), targets(i%)
80 NEXT
90 REM Initialize weights
100 DIM W1(3,2), W2(3,1)
110 FOR i% = 1 TO 3
120 FOR j% = 1 TO 2
130 W1(i%,j%) = RND(1) - 0.5
140 NEXT
150 NEXT
160 FOR i% = 1 TO 3
170 W2(i%,1) = RND(1) - 0.5
180 NEXT
190 eta = 0.5
200 epochs% = 10000
210 FOR epoch% = 1 TO epochs%
220 FOR ex% = 1 TO 4
230 x1 = inputs(ex%,1)
240 x2 = inputs(ex%,2)
250 t = targets(ex%)
260 REM Forward pass to hidden layer
270 h1_sum = x1*W1(1,1) + x2*W1(2,1) + W1(3,1)
280 h1_act = 1 / (1 + EXP(-h1_sum))
290 h2_sum = x1*W1(1,2) + x2*W1(2,2) + W1(3,2)
300 h2_act = 1 / (1 + EXP(-h2_sum))
310 REM Forward pass to output
320 o_sum = h1_act*W2(1,1) + h2_act*W2(2,1) + W2(3,1)
330 o_act = 1 / (1 + EXP(-o_sum))
340 REM Compute output delta
350 error = t - o_act
360 delta_output = error * o_act * (1 - o_act)
370 REM Compute hidden deltas
380 delta_h1 = delta_output * W2(1,1) * h1_act * (1 - h1_act)
390 delta_h2 = delta_output * W2(2,1) * h2_act * (1 - h2_act)
400 REM Update output weights
410 W2(1,1) = W2(1,1) + eta * delta_output * h1_act
420 W2(2,1) = W2(2,1) + eta * delta_output * h2_act
430 W2(3,1) = W2(3,1) + eta * delta_output
440 REM Update hidden weights (h1)
450 W1(1,1) = W1(1,1) + eta * delta_h1 * x1
460 W1(2,1) = W1(2,1) + eta * delta_h1 * x2
470 W1(3,1) = W1(3,1) + eta * delta_h1
480 REM Update hidden weights (h2)
490 W1(1,2) = W1(1,2) + eta * delta_h2 * x1
500 W1(2,2) = W1(2,2) + eta * delta_h2 * x2
510 W1(3,2) = W1(3,2) + eta * delta_h2
520 NEXT ex%
530 NEXT epoch%
540 REM Test the network
550 PRINT "Trained results:"
560 FOR ex% = 1 TO 4
570 x1 = inputs(ex%,1)
580 x2 = inputs(ex%,2)
590 h1_sum = x1*W1(1,1) + x2*W1(2,1) + W1(3,1)
600 h1_act = 1 / (1 + EXP(-h1_sum))
610 h2_sum = x1*W1(1,2) + x2*W1(2,2) + W1(3,2)
620 h2_act = 1 / (1 + EXP(-h2_sum))
630 o_sum = h1_act*W2(1,1) + h2_act*W2(2,1) + W2(3,1)
640 o_act = 1 / (1 + EXP(-o_sum))
650 PRINT "Input ";x1;" ";x2;" Output ";o_act
660 NEXT
670 END
This Perceptron has nine
weights (6 in the first layer and 3 in the second layer), which are iteratively adjusted over 10,000 repetitions to closely approximate the optimum solution for the exclusive-or problem.
I also asked DeepSeek to code a two-layer Perceptron with more than two inputs, but that seemed to give it indigestion!