User Tools

Site Tools


tokeniser

Tokeniser

by JGH, June 2006

BBC BASIC programs are tokenised, that is, BASIC keywords are stored as one-byte values. This results in programs which execute faster and are more compact.

A tokenised line can easily be detokenised, or expanded, as there is a one-to-one mapping between token values and the expanded string. For example, code similar to the following would expand a tokenised line:

        quote%=FALSE
        REPEAT
          IF (?addr%<128 AND ?addr%>31) OR quote% THEN VDU ?addr% ELSE P.token$(?addr%);
          IF ?addr%=34 quote%=NOT quote%
          addr%=addr%+1
        UNTIL ?addr%=13

Tokenising, however, is more fiddly. Tokens can be abbreviated on entry and characters are only tokenised at certain parts of the line. For instance, in the following line:

        ON NOON GOTO 1,2

the first 'ON' is the token ON, but the second 'ON' is part of the variable 'NOON'. The second 'ON' must be left untokenised.

The EVAL function tokenises the supplied string and evaluates it as an expression. Usefully, the tokenised string can be retrieved from where BASIC has stored it.

In Windows BASIC:

        B%=EVAL("0:"+A$)
        token$=$(!332+2)

This code may fail if an event interrupt (e.g. ON TIME) occurs between the two statements. To avoid this use the following alternative which (in BBC BASIC for Windows version 6 only) does not allow an intervening interrupt:

        IF EVAL("1:"+A$) token$=$(!332+2)

The input and output share the same memory buffer, which is OK so long as the tokenising process shortens the code (which is almost always the case) but can cause a crash if it lengthens the code. That can happen only in exceptional circumstances such as the following code:

        ON A% GOTO 10,20,30,40,50

The tokenising process encodes the line numbers in a special internal format which results in the overall length increasing from 25 to 31 bytes. To reduce the chance of this causing a crash the tokenising routine can be adapted as follows:

        IF EVAL("1RECTANGLE:"+A$) token$=$(!332+3)


In ARM BASIC:

        SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A%
        B%=EVAL("0:"+A$)
        token$=$(A%-14)


In 6502 BASIC:

        A%=EVAL("0:"+A$)
        token$=$((!4 AND &FFFF)-LENA$-1)


By preceding the code you want to tokenise with “0:” you can safely pass it to EVAL without provoking a Syntax error. You can then extract the tokenised code from memory, so long as you do it immediately after calling EVAL.

This can be written as functions as follows:

        DEF FNTokenise_Win(A$):LOCAL A%,B%
        WHILELEFT$(A$,1)=" ":A$=MID$(A$,2):ENDWHILE
        B%=EVAL("0:"+A$):=$(!332+2)
        :
        DEF FNTokenise_ARM(A$):LOCAL A%,B%
        SYS "XOS_GenerateError",0,STRING$(255,"*") TO ,A%
        B%=EVAL("0:"+A$):=$(A%-13)
        :
        DEF FNTokenise_65(A$):LOCAL A%
        A%=EVAL("0:"+A$):=$((!4 AND &FFFF)-LENA$-1)


These functions are used in full in the 'Tokenise' BASIC library at mdfs.net.

A text file can then be tokenised using the following code:

      in%=OPENIN(text$)
      out%=OPENOUT(basic$)
      line%=1                                  :REM Start from an arbitary line number
      REPEAT
        line$=FNTokenise_Win(GET$#in%)         :REM Read line and tokenise it
        BPUT#out%,LENline$+4                   :REM Output line length
        BPUT#out%,line%:BPUT#out%,line%DIV256  :REM Output line number
        BPUT#out%,line$;:BPUT#out%,13          :REM Output line and <cr>
        line%+=1                               :REM Increment line number
      UNTIL EOF#in%
      BPUT#out%,0:BPUT#out%,&FF:BPUT#out%,&FF  :REM Output program terminator
      CLOSE#out%:out%=0
      CLOSE#in%:in%=0


Notes

Acorn BBC BASIC programs are stored slightly differently. See Format and relevant pages on Acorn-specific sites for details.

This technique may fail if the tokenised code is longer than the original text version, which can happen if it contains an ON GOTO or ON GOSUB statement. This problem may be mitigated to some extent as follows (for Windows BASIC):

        B%=EVAL("0OTHERWISE:"+A$)
        token$=$(!332+3)


See also

Using the tokeniser on BeebWiki for details of using the tokeniser on 6502, Z80, 32000, ARM, DOS and Windows BASIC.

References

Richard Russell, “Using the tokeniser”, BBC BASIC for Windows Yahoo! group message 86.

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website.More information about cookies
tokeniser.txt · Last modified: 2024/01/05 00:21 by 127.0.0.1