=====Counting the characters in a Unicode string===== //by Richard Russell, March 2010//\\ \\ //BBC BASIC for Windows// provides native support for the [[http://en.wikipedia.org/wiki/Unicode|Unicode]] Basic Multilingual Plane, allowing you to work with, output and print a wide range of foreign-language and other character sets with very little extra effort. The main [[http://www.bbcbasic.co.uk/bbcwin/manual/bbcwin2.html#utf8|Help documentation]] describes how to enable Unicode support.\\ \\ The Unicode encoding used by //BBC BASIC for Windows// is [[http://en.wikipedia.org/wiki/UTF-8|UTF-8]]. This is used in preference to other encodings (for example UTF-16) for the following reasons: * UTF-8 is represented as a //byte stream//, which is compatible with BBC BASIC's string variables, functions and operators. * Regular 7-bit ASCII text is represented identically in UTF-8 and ANSI, making it extremely easy to work with such text. * UTF-8 is compatible with BBC BASIC's [[http://www.bbcbasic.co.uk/bbcwin/manual/bbcwin8.html#intro8|VDU codes]]; you can mix UTF-8 text and VDU sequences in the same string and PRINT them. * You can embed UTF-8 text within a program as string constants or DATA statements (although they will not display as expected in the program editor); UTF-16 cannot be used in this way. * UTF-8 has only one version, whereas UTF-16 is byte-order dependent (it has little-endian and big-endian versions). * UTF-8 is the preferred Unicode encoding for emails and web pages. UTF-8 has only one significant disadvantage compared with UTF-16: it is a variable-length encoding. That means you cannot determine the number of characters in a string using the **LEN** function (it returns the length in bytes, not in characters). Similarly, the **COUNT** function and features that depend on it (i.e. the **WIDTH** statement and the **TAB(x)** function) won't necessarily work as expected. Note that in any case COUNT, WIDTH and TAB(x) aren't generally useful when a **proportionally spaced** font is in use.\\ \\ To overcome this disadvantage the function **FNulen** is listed below. This takes as a parameter a Unicode (UTF-8) string, and returns the length of the string in characters: DEF FNulen(U$) LOCAL L% CP_UTF8 = 65001 SYS "MultiByteToWideChar", CP_UTF8, 0, U$, LEN(U$), 0, 0 TO L% = L% If passed a string containing only 7-bit ASCII text, the function will return the same value as **LEN(U$)**.\\ \\ If you need to know the **extent** (that is, the physical width and height) of a Unicode (UTF-8) string, such as you might if you want to centre it on the screen or a printout, you can use the following procedure: DEF PROCuextent(hdc%, U$, size{}) LOCAL L%, U% L% = FNulen(U$) DIM U% LOCAL 2*L% U% = (U% + 1) AND -2 SYS "MultiByteToWideChar", CP_UTF8, 0, U$, LEN(U$), U%, L% SYS "GetTextExtentPoint32W", hdc%, U%, L%, size{} ENDPROC If passed a string containing only 7-bit ASCII text, the procedure will return the same size as **GetTextExtentPoint32**.