Table of Contents
Tokeniser
by JGH, June 2006
BBC BASIC programs are tokenised, that is, BASIC keywords are stored as one-byte values. This results in programs which execute faster and are more compact.
A tokenised line can easily be detokenised, or expanded, as there is a one-to-one mapping between token values and the expanded string. For example, code similar to the following would expand a tokenised line:
quote%=FALSE REPEAT IF (?addr%<128 AND ?addr%>31) OR quote% THEN VDU ?addr% ELSE P.token$(?addr%); IF ?addr%=34 quote%=NOT quote% addr%=addr%+1 UNTIL ?addr%=13
Tokenising, however, is more fiddly. Tokens can be abbreviated on entry and characters are only tokenised at certain parts of the line. For instance, in the following line:
ON NOON GOTO 1,2
the first 'ON' is the token ON, but the second 'ON' is part of the variable 'NOON'. The second 'ON' must be left untokenised.
The EVAL function tokenises the supplied string and evaluates it as an expression. Usefully, the tokenised string can be retrieved from where BASIC has stored it.
In Windows BASIC:
B%=EVAL("0:"+A$) token$=$(!332+2)
This code may fail if an event interrupt (e.g. ON TIME) occurs between the two statements. To avoid this use the following alternative which (in BBC BASIC for Windows version 6 only) does not allow an intervening interrupt:
IF EVAL("1:"+A$) token$=$(!332+2)
The input and output share the same memory buffer, which is OK so long as the tokenising process shortens the code (which is almost always the case) but can cause a crash if it lengthens the code. That can happen only in exceptional circumstances such as the following code:
ON A% GOTO 10,20,30,40,50
The tokenising process encodes the line numbers in a special internal format which results in the overall length increasing from 25 to 31 bytes. To reduce the chance of this causing a crash the tokenising routine can be adapted as follows:
IF EVAL("1RECTANGLE:"+A$) token$=$(!332+3)
In ARM BASIC:
SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A% B%=EVAL("0:"+A$) token$=$(A%-14)
In 6502 BASIC:
A%=EVAL("0:"+A$) token$=$((!4 AND &FFFF)-LENA$-1)
By preceding the code you want to tokenise with “0:” you can safely pass it to EVAL without provoking a Syntax error. You can then extract the tokenised code from memory, so long as you do it immediately after calling EVAL.
This can be written as functions as follows:
DEF FNTokenise_Win(A$):LOCAL A%,B% WHILELEFT$(A$,1)=" ":A$=MID$(A$,2):ENDWHILE B%=EVAL("0:"+A$):=$(!332+2) : DEF FNTokenise_ARM(A$):LOCAL A%,B% SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A% B%=EVAL("0:"+A$):=$(A%-13) : DEF FNTokenise_65(A$):LOCAL A% A%=EVAL("0:"+A$):=$((!4 AND &FFFF)-LENA$-1)
These functions are used in full in the 'Tokenise' BASIC library at mdfs.net.
A text file can then be tokenised using the following code:
in%=OPENIN(text$) out%=OPENOUT(basic$) line%=1 :REM Start from an arbitary line number REPEAT line$=FNTokenise_Win(GET$#in%) :REM Read line and tokenise it BPUT#out%,LENline$+4 :REM Output line length BPUT#out%,line%:BPUT#out%,line%DIV256 :REM Output line number BPUT#out%,line$;:BPUT#out%,13 :REM Output line and <cr> line%+=1 :REM Increment line number UNTIL EOF#in% BPUT#out%,0:BPUT#out%,&FF:BPUT#out%,&FF :REM Output program terminator CLOSE#out%:out%=0 CLOSE#in%:in%=0
Notes
Acorn BBC BASIC programs are stored slightly differently. See Format and relevant pages on Acorn-specific sites for details.
This technique may fail if the tokenised code is longer than the original text version, which can happen if it contains an ON GOTO or ON GOSUB statement. This problem may be mitigated to some extent as follows (for Windows BASIC):
B%=EVAL("0OTHERWISE:"+A$) token$=$(!332+3)
See also
Using the tokeniser on BeebWiki for details of using the tokeniser on 6502, Z80, 32000, ARM, DOS and Windows BASIC.
References
Richard Russell, “Using the tokeniser”, BBC BASIC for Windows Yahoo! group message 86.