Array scan challenge

Richard Russell · Post by **Richard Russell** » Sun 01 Feb 2026, 18:23

Richard Russell wrote: ↑Sun 01 Feb 2026, 18:03 Brandy BASIC is also written in C and is much faster than mine, showing that there should be room for improvement.

I'm pretty sure that Acorn versions of BBC BASIC, and possibly Brandy too, optimise the 'whole array' operations (like SUM(), which both the top two highest-scoring submissions use) more aggressively than my versions do. One reason for this, I suspect, is that the 'original' BBC BASIC v5 had only two numeric data types - 32-bit integer and 40-bit float - which meant it was perfectly sensible to code the functions independently for each type.

But my 'modern' BASICs have more numeric types, including 8-bit unsigned integers, 64-bit signed integers, 64-bit and 80-bit floats. As a result I chose to implement SUM(), MOD() etc. more generically. They share the same code between the different data types and call the same general-purpose type-agnostic addition and multiplication routines at their heart.

But whilst this made them easier (and safer) to code they are significantly slower than would be the case if each data type had its own dedicated routines.

Richard Russell · Post by **Richard Russell** » Sun 01 Feb 2026, 18:46

Richard Russell wrote: ↑Sun 01 Feb 2026, 18:23 As a result I chose to implement SUM(), MOD() etc. more generically.

Here's the C code for SUM():

Code: Select all

	v.i.t = 0 ;
	v.i.n = 0 ;
	type &= ~BIT6 ;
	while (count--)
	    {
		v = math (v, '+', loadn (ptr, type)) ; // n.b. type can be 40
		ptr += type & 15 ;
	    }

If this is running significantly more slowly than the assembly-language version, which could easily be checked in isolation, then either loadn() or math() - more probably the latter - is likely to be responsible. One change that could be tried is in-lining the addition routine by copying it from the math() function.

Richard Russell · Post by **Richard Russell** » Mon 02 Feb 2026, 01:04

Richard Russell wrote: ↑Sun 01 Feb 2026, 18:46 If this is running significantly more slowly than the assembly-language version, which could easily be checked in isolation...

Performing just such a test here, it seems that indeed the SUM() function is about twice as slow when running on the 32-bit coded-in-C interpreter and about three times as slow when running on the 64-bit coded-in-C interpreter.

Whilst compiler efficiency can probably be blamed for the performance of the 32-bit C version, I do not understand why the 64-bit C version (which after all uses exactly the same source code) is even slower. The ability to use 64-bit registers ought, I would have thought, to make it faster than the 32-bit version, not slower.

I know that there can be issues of cache efficiency because 64-bit x86 code is bigger than than the equivalent 32-bit code, potentially causing more cache misses, but that seems unlikely to be sufficient for a 50% slowdown.

Would any C experts out there care to speculate on what's going on?

Phil_Thatcham · Post by **Phil_Thatcham** » Mon 02 Feb 2026, 09:34

Thank you, Richard. I appreciate the thought you've put into this.

pt

Richard Russell · Post by **Richard Russell** » Mon 02 Feb 2026, 15:57

Richard Russell wrote: ↑Mon 02 Feb 2026, 01:04 Whilst compiler efficiency can probably be blamed for the performance of the 32-bit C version, I do not understand why the 64-bit C version (which after all uses exactly the same source code) is even slower.

Here's the (32-bit) assembly language version of the loop used for SUM():

Code: Select all

sum2:   push    eax
        push    ebp
        push    edi
        push    ebx
        push    ecx
        push    edx
        call    loadn                   ;get value to bx:ecx:edx
        pop     edi
        pop     eax
        mov     [esp+2],bx
        pop     ebx
        rol     ebx,16
        mov     ebp,'+' & 0FH
        call    op                      ;accumulate
        pop     edi
        pop     ebp
        mov     eax,[esp]
        and     eax,byte 1111B
        add     ebp,eax
        pop     eax
        dec     edi
        jnz     sum2

and here's the equivalent loop as compiled by GCC for x86-64:

Code: Select all

.L657:
	movzx	r8d, r8b
	mov	rcx, rsi
	call	loadn
	movdqa	xmm1, XMMWORD PTR -1008[rbp]
	movdqa	xmm2, XMMWORD PTR -880[rbp]
	lea	rdx, -1120[rbp]
	mov	r8d, 43
	lea	rcx, -1072[rbp]
	lea	r9, -1136[rbp]
	movaps	XMMWORD PTR -1120[rbp], xmm1
	movaps	XMMWORD PTR -1136[rbp], xmm2
	call	math
	movzx	r8d, BYTE PTR -1039[rbp]
	movdqu	xmm3, XMMWORD PTR -1072[rbp]
	mov	rdx, r8
	movaps	XMMWORD PTR -1008[rbp], xmm3
	and	edx, 31
	add	rdx, QWORD PTR -1016[rbp]
	mov	QWORD PTR -1016[rbp], rdx
	sub	ebx, 1
	jne	.L657

They are the same number of instructions, but the compiled C version makes greater use of the stack than the assembler version.

An apparent limitation of the compiler here is that it is passing structure parameters to the math() function in memory - in practice the stack - despite them being small enough (16 bytes) to be passed in registers.

According to DeepSeek AI, this is likely because the structure (actually a union) can be a long double which apparently is mandated by the ABI to be passed in memory rather than registers. This is not a limitation I have in the hand-crafted assembler code of course.

Passing parameters in memory rather than registers also adds the overhead that they must be copied first, otherwise the called subroutine could modify them (in the code above I assume that's what the copying of the parameters via xmm1 and xmm2 is all about).

So there's probably nothing much I can do about this specific code. It's not clear whether it is significantly contributing to the poor performance.

Richard Russell · Post by **Richard Russell** » Mon 02 Feb 2026, 18:16

Richard Russell wrote: ↑Mon 02 Feb 2026, 15:57 So there's probably nothing much I can do about this specific code. It's not clear whether it is significantly contributing to the poor performance.

There's no doubt that the specific issue at hand (summing a 'byte' array) could be accelerated, because it's safe to assume that you can add all the elements in such an array and not overflow a 64-bit integer accumulator. All it would need is a dedicated routine for that data type.

But I think that's the only case when SUM() is guaranteed not to require promoting the sum to a float if it overflows an integer accumulator, so the more complicated (and slower) existing routine would most straightforwardly be retained for all the other cases.

Whether it would be worth singling out the byte array for attention is doubtful. What do people think?

JeremyNicoll · Post by **JeremyNicoll** » Mon 02 Feb 2026, 19:02

Nigel Brumpton posted elsewhere:

Code: Select all

dim x&(999999)
x&(654321) = 1

A%=^x&(0)
A%?1000000=13

time=0

for T%=1 to 10000
  R%=instr($A%,chr$(1))-1
next

printR%,time/100;" msecs"

(Yes I know this code only looks for: 1.)

Does the A%?1000000=13 not put CR into memory past the end of the array? That's the sort of thing that crashes programs...

I think the code should examine x&(999999) first to see if it is the non-zero item - job done very fast if it is! If not the CR should be put into x&(999999).

Also I think the timing arithmetic is out by a factor of 10. I ran the code here with a slight change to save TIME after the end of the FOR loop. I found:

10,000 searches took 1125 cS so 11.25 secs (which matched wall-clock time) = 11.25 * 1000 mS
so 10 searches took 11.25 mS
so 1 search took 1.125 mS

The program reports the time for 10 searches not 1.

Richard Russell · Post by **Richard Russell** » Mon 02 Feb 2026, 21:24

JeremyNicoll wrote: ↑Mon 02 Feb 2026, 19:02 Does the A%?1000000=13 not put CR into memory past the end of the array? That's the sort of thing that crashes programs...

Indeed it does, and it's pretty much guaranteed to result in a crash sooner or later because there's no 'safety gap' between the end of the array and the next entry on the heap.

Also, as noted elsewhere, the coded-in-C versions of BBC BASIC limit a 'fixed string' to a maximum length of 65,535 bytes, so this approach won't work at all on those versions.

I needed to set a limit in 64-bit versions because in principle the memory containing the string could be larger than the 32-bit length field in the string descriptor. But I can't remember now why I chose that value, other than it being the maximum string length in Brandy BASIC.

Richard Russell · Post by **Richard Russell** » Tue 03 Feb 2026, 10:42

Richard Russell wrote: ↑Mon 02 Feb 2026, 21:24 I needed to set a limit in 64-bit versions because in principle the memory containing the string could be larger than the 32-bit length field in the string descriptor.

Additionally, the source string in memory - even if quite short - might not be located within the 32-bit-addressable region so in that case it's necessary for BBC BASIC to copy the string into the heap first (it will do this transparently when required). Since the copy it makes won't be freed (other than on a CLEAR or RUN) it's desirable to restrict the maximum length of the string.

But 64 Kbytes was still an arbitrary limit, I think, and could be relaxed somewhat without causing major issues. The worst likely consequence is that the program will run out of heap memory, reporting a 'No room' error, without it being obvious why. In the specific case under consideration, the source string is within the heap and no copy is necessary.

Richard Russell · Post by **Richard Russell** » Tue 03 Feb 2026, 13:50

Richard Russell wrote: ↑Sun 01 Feb 2026, 18:23 But whilst this made them easier (and safer) to code they are significantly slower than would be the case if each data type had its own dedicated routines.

I have experimentally copied the 'addition' code from the generic math() function into the SUM() routine, to see if it's the overhead of copying the parameters into - and the result out of - the called function which is responsible for the poor performance compared with the assembler code.

And it is! Here are the revised results of the challenge when run on the 64-bit version of BBCSDL (on Windows) after the modification:

challenge.png

You can see that the rankings are now similar to what they were when run in the 32-bit version. The question, of course, is whether this kind of 'micro optimisation' is sensible and whether I should update the release versions. What do you think?

BBC BASIC forum

Array scan challenge

Re: Array scan challenge

Re: Array scan challenge

Re: Array scan challenge

Re: Array scan challenge

Re: Array scan challenge

Re: Array scan challenge

Re: Array scan challenge

Re: Array scan challenge

Re: Array scan challenge

Re: Array scan challenge