Using regular expressions

by Richard Russell, December 2006

Regular expressions provide a means to specify a pattern of characters, or syntax rule, which a string (or part of a string) must match. Certain metacharacters have special significance; for example a dot (.) matches any single character, square brackets [] enclose a list of matching characters, a caret (^) signifies negation and so on. Here are some simple examples:

a..d
matches “abcd”, “axyd”, “a12d” etc.
[abc]
matches “a”, “b” or “c”
[a-z]
matches any lowercase letter
[^b]at
matches “cat”, “fat”, “hat” etc. but not “bat”


For more information on the syntax of regular expressions see this Wikipedia article.

You can make use of regular expressions in your BBC BASIC program by means of the gnu_regex DLL which can be downloaded from here[1]. To start with you must load the DLL in the usual way:

        SYS "LoadLibrary", "gnu_regex.dll" TO gnu_regex%
        IF gnu_regex% = 0 ERROR 100, "Cannot load gnu_regex.dll"
        SYS "GetProcAddress", gnu_regex%, "regcomp" TO regcomp%
        SYS "GetProcAddress", gnu_regex%, "regexec" TO regexec%

For this to work gnu_regex.dll needs to be in the current directory, the Windows directory (often C:\WINDOWS), the Windows system directory (often C:\WINDOWS\SYSTEM32) or one of the directories specified in the PATH environment variable. Alternatively you can copy the file to your BBC BASIC for Windows library folder and load it explicitly from there:

        SYS "LoadLibrary", @lib$+"gnu_regex.dll" TO gnu_regex%

The code below illustrates a very simple example of setting up a pattern and inputting strings from the user which are tested against this pattern:

        DIM buffer% 255
 
        pattern$ = "[abcxyz]"
        SYS regcomp%, buffer%, pattern$, 0 TO result%
        IF result% ERROR 101, "Failed to compile regular expression"
 
        REPEAT
          INPUT "Enter a string: " test$
          SYS regexec%, buffer%, test$, 0, 0, 0 TO result%
          IF result% PRINT "Not matched" ELSE PRINT "Matched"
        UNTIL FALSE

You should ensure that buffer% points to a memory buffer large enough to contain the compiled regular expression (although it's not clear how you are supposed to ascertain this!). As always, make sure you execute the DIM statement only once, or use DIM LOCAL, to avoid a memory leak and an eventual No room error.

In this example the pattern matches the characters “a”, “b”, “c”, “x”, “y” or “z” anywhere in the string. The program as listed provides no information on where in the string the match occurred. You can discover that information by amending the program as follows:

        DIM offsets{start%, finish%}
        REPEAT
          INPUT "Enter a string: " test$
          SYS regexec%, buffer%, test$, 1, offsets{}, 0 TO result%
          IF result% PRINT "Not matched" ELSE PRINT "Matched at ";offsets.start%
        UNTIL FALSE

Here offsets.start% is set to the offset from the beginning of the string of the first match.

You can specify that the matching is case insensitive by changing the final parameter of regcomp from 0 to 2 as follows:

        _REG_ICASE = 2
        SYS regcomp%, buffer%, pattern$, _REG_ICASE TO result%

You can also specify the use of extended regular expressions by setting the final parameter to 1:

        _REG_EXTENDED = 1
        SYS regcomp%, buffer%, pattern$, _REG_EXTENDED TO result%

In this mode additional metacharacters are recognised, for example the vertical bar (|) signifies alternatives:

abc|def
matches “abc” or “def”



[1] When last checked, the file gnu_regex.exe was corrupted (missing the last byte). To repair it you can use this simple BBC BASIC program:

        F% = OPENUP("gnu_regex.exe")
        PTR#F% = EXT#F%
        BPUT #F%,0
        CLOSE #F%