Differences

This shows you the differences between two versions of the page.

--- counting_20the_20characters_20in_20a_20unicode_20string [2018/03/31 13:19] – external edit 127.0.0.1
+++ counting_20the_20characters_20in_20a_20unicode_20string [2024/01/05 00:22] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
 =====Counting the characters in a Unicode string=====
-//by Richard Russell, March 2010//\\ \\ //BBC BASIC for Windows// provides native support for the [[http://en.wikipedia.org/wiki/Unicode|Unicode]] Basic Multilingual Plane, allowing you to work with, output and print a wide range of foreign-language and other character sets with very little extra effort. The main [[http://www.bbcbasic.co.uk/bbcwin/manual/bbcwin2.html#utf8|Help documentation]] describes how to enable Unicode support.\\ \\  The Unicode encoding used by //BBC BASIC for Windows// is [[http://en.wikipedia.org/wiki/UTF-8|UTF-8]]. This is used in preference to other encodings (for example UTF-16) for the following reasons:\\
+//by Richard Russell, March 2010//\\ \\ //BBC BASIC for Windows// provides native support for the [[http://en.wikipedia.org/wiki/Unicode|Unicode]] Basic Multilingual Plane, allowing you to work with, output and print a wide range of foreign-language and other character sets with very little extra effort. The main [[http://www.bbcbasic.co.uk/bbcwin/manual/bbcwin2.html#utf8|Help documentation]] describes how to enable Unicode support.\\ \\  The Unicode encoding used by //BBC BASIC for Windows// is [[http://en.wikipedia.org/wiki/UTF-8|UTF-8]]. This is used in preference to other encodings (for example UTF-16) for the following reasons:
   * UTF-8 is represented as a //byte stream//, which is compatible with BBC BASIC's string variables, functions and operators.
@@ Line 9: / Line 9: @@
   * UTF-8 has only one version, whereas UTF-16 is byte-order dependent (it has little-endian and big-endian versions).
   * UTF-8 is the preferred Unicode encoding for emails and web pages.
-\\  UTF-8 has only one significant disadvantage compared with UTF-16: it is a variable-length encoding. That means you cannot determine the number of characters in a string using the **LEN** function (it returns the length in bytes, not in characters). Similarly, the **COUNT** function and features that depend on it (i.e. the **WIDTH** statement and the **TAB(x)** function) won't necessarily work as expected. Note that in any case COUNT, WIDTH and TAB(x) aren't generally useful when a **proportionally spaced** font is in use.\\ \\  To overcome this disadvantage the function **FNulen** is listed below. This takes as a parameter a Unicode (UTF-8) string, and returns the length of the string in characters:\\
+UTF-8 has only one significant disadvantage compared with UTF-16: it is a variable-length encoding. That means you cannot determine the number of characters in a string using the **LEN** function (it returns the length in bytes, not in characters). Similarly, the **COUNT** function and features that depend on it (i.e. the **WIDTH** statement and the **TAB(x)** function) won't necessarily work as expected. Note that in any case COUNT, WIDTH and TAB(x) aren't generally useful when a **proportionally spaced** font is in use.\\ \\  To overcome this disadvantage the function **FNulen** is listed below. This takes as a parameter a Unicode (UTF-8) string, and returns the length of the string in characters:
+<code bb4w>
         DEF FNulen(U$)
         LOCAL L%
@@ Line 15: / Line 18: @@
         SYS "MultiByteToWideChar", CP_UTF8, 0, U$, LEN(U$), 0, 0 TO L%
         = L%
-If passed a string containing only 7-bit ASCII text, the function will return the same value as **LEN(U$)**.\\ \\  If you need to know the **extent** (that is, the physical width and height) of a Unicode (UTF-8) string, such as you might if you want to centre it on the screen or a printout, you can use the following procedure:\\
+</code>
+If passed a string containing only 7-bit ASCII text, the function will return the same value as **LEN(U$)**.\\ \\  If you need to know the **extent** (that is, the physical width and height) of a Unicode (UTF-8) string, such as you might if you want to centre it on the screen or a printout, you can use the following procedure:
+<code bb4w>
         DEF PROCuextent(hdc%, U$, size{})
         LOCAL L%, U%
@@ Line 24: / Line 31: @@
         SYS "GetTextExtentPoint32W", hdc%, U%, L%, size{}
         ENDPROC
+</code>
 If passed a string containing only 7-bit ASCII text, the procedure will return the same size as **GetTextExtentPoint32**.