=====Counting the characters in a Unicode string=====

//by Richard Russell, March 2010//\\ \\ //BBC BASIC for Windows// provides native support for the [[http://en.wikipedia.org/wiki/Unicode|Unicode]] Basic Multilingual Plane, allowing you to work with, output and print a wide range of foreign-language and other character sets with very little extra effort. The main [[http://www.bbcbasic.co.uk/bbcwin/manual/bbcwin2.html#utf8|Help documentation]] describes how to enable Unicode support.\\ \\  The Unicode encoding used by //BBC BASIC for Windows// is [[http://en.wikipedia.org/wiki/UTF-8|UTF-8]]. This is used in preference to other encodings (for example UTF-16) for the following reasons:

  * UTF-8 is represented as a //byte stream//, which is compatible with BBC BASIC's string variables, functions and operators.
  * Regular 7-bit ASCII text is represented identically in UTF-8 and ANSI, making it extremely easy to work with such text.
  * UTF-8 is compatible with BBC BASIC's [[http://www.bbcbasic.co.uk/bbcwin/manual/bbcwin8.html#intro8|VDU codes]]; you can mix UTF-8 text and VDU sequences in the same string and PRINT them.
  * You can embed UTF-8 text within a program as string constants or DATA statements (although they will not display as expected in the program editor); UTF-16 cannot be used in this way.
  * UTF-8 has only one version, whereas UTF-16 is byte-order dependent (it has little-endian and big-endian versions).
  * UTF-8 is the preferred Unicode encoding for emails and web pages.

UTF-8 has only one significant disadvantage compared with UTF-16: it is a variable-length encoding. That means you cannot determine the number of characters in a string using the **LEN** function (it returns the length in bytes, not in characters). Similarly, the **COUNT** function and features that depend on it (i.e. the **WIDTH** statement and the **TAB(x)** function) won't necessarily work as expected. Note that in any case COUNT, WIDTH and TAB(x) aren't generally useful when a **proportionally spaced** font is in use.\\ \\  To overcome this disadvantage the function **FNulen** is listed below. This takes as a parameter a Unicode (UTF-8) string, and returns the length of the string in characters:

<code bb4w>
        DEF FNulen(U$)
        LOCAL L%
        CP_UTF8 = 65001
        SYS "MultiByteToWideChar", CP_UTF8, 0, U$, LEN(U$), 0, 0 TO L%
        = L%
</code>

If passed a string containing only 7-bit ASCII text, the function will return the same value as **LEN(U$)**.\\ \\  If you need to know the **extent** (that is, the physical width and height) of a Unicode (UTF-8) string, such as you might if you want to centre it on the screen or a printout, you can use the following procedure:

<code bb4w>
        DEF PROCuextent(hdc%, U$, size{})
        LOCAL L%, U%
        L% = FNulen(U$)
        DIM U% LOCAL 2*L%
        U% = (U% + 1) AND -2
        SYS "MultiByteToWideChar", CP_UTF8, 0, U$, LEN(U$), U%, L%
        SYS "GetTextExtentPoint32W", hdc%, U%, L%, size{}
        ENDPROC
</code>

If passed a string containing only 7-bit ASCII text, the procedure will return the same size as **GetTextExtentPoint32**.