Below are some alternatives that work correctly on UTF-8 or ANSI.
Here are replacements for INSTR(), MID$, LEFT$, RIGHT$, LEN$ and a function to find the number of bytes in a string.
These should work on all current BBC BASIC versions. They are slower than the original keywords, of course, so you would not normally use them unless UTF-8 strings were involved.
Code: Select all
REM Library functions (corrected to have explicit integer parameters)
DEF FN_uinstr(a$,b$,st%) :REM st% is start character number not optional. Set to 0 or 1 to search from start.
LOCAL S%, T%
S%=FN_ucount(a$,0,st%-1)+1 :REM start position converted to bytes
T%=INSTR(a$,b$,S%) :REM get match result in bytes
IF T% a$=LEFT$(a$,T%): =FN_ulen(a$) :REM slice off in bytes and measure it's length in chars
=0
DEF FN_umid(a$,st%,num%)
REM UTF-8 MID$ replacement
LOCAL S%, N%
S%=FN_ucount(a$,0,st%-1)+1
N%=FN_ucount(a$,S%-1,num%)-S%+1
=MID$(a$,S%,N%)
DEF FN_uleft(a$,num)
REM UTF-8 LEFT$ replacement
=LEFT$(a$,FN_ucount(a$,0,num))
DEF FN_uright(a$,num%):REM Search from right fastest with small selections.
LOCAL A%, L%, I%, J%
A%=!^a$
FOR I% =LENa$-1 TO 0 STEP -1
J%=A%?I%
IF (J% AND &C0) <> &80 : L%+=1 :REM Anything not a continuation is a start of char.
IF L%=num% EXIT FOR
NEXT
=RIGHT$(a$,LENa$-I%)
DEF FN_ulen(a$)
REM Finds Length in characters of UTF-8 string.
LOCAL A%, L%, I%, J%
A%=!^a$
WHILE I%<LENa$
J%=A%?I%
CASE TRUE OF
REM Character start bytes
WHEN (J% AND &E0) = &C0 : L%+=1 : I%+=2
WHEN (J% AND &80) = 0 : L%+=1 : I%+=1
WHEN (J% AND &F0) = &E0 : L%+=1 : I%+=3
WHEN (J% AND &F8) = &F0 : L%+=1 : I%+=4
WHEN (J% AND &C0) = &80 : I%+=1 : REM Continuation byte. Should never execute!
ENDCASE
ENDWHILE
=L%
DEF FN_ucount(a$, I%, nchars%)
REM I% start of count in bytes. Returns total count in bytes adding nchars to start posn.
LOCAL A%, L%, J%
A%=!^a$ :REM Address of start of string
WHILE L%<=nchars%-1 AND I%<LENa$
J%=A%?I% :REM Get next byte
CASE TRUE OF
REM Compare byte with UTF-8 start bytes. Order is most likely found.
WHEN (J% AND &E0) = &C0 : L%+=1 : I%+=2
WHEN (J% AND &80) = 0 : L%+=1 : I%+=1
WHEN (J% AND &F0) = &E0 : L%+=1 : I%+=3
WHEN (J% AND &F8) = &F0 : L%+=1 : I%+=4
WHEN (J% AND &C0) = &80 : I%+=1 : REM Continuation byte. Should never execute!
ENDCASE
ENDWHILE
=I% :REM bytes used for nchars