Porting static strings containing accented characters

hellomike · Post by **hellomike** » Wed 16 Aug 2023, 11:27

In the following BB4W code, the static text has a few accented characters in it. So with bytes >&7F.

   10 MODE 0
   20 Test$="Téxt has áccentèd charácters"
   30 PRINT
   40 PRINT Test$
   50 PRINT LENTest$

When run, the string is printed correctly in a MODE 0 window and the reported LENgth is 28 bytes.

BBCSDL (1.36a) seems to have a problem here when porting to it.

When:
1) I copy/paste the code from the BB4W editor window into the BBCSDL editor, the source in the editor displays fine but when run, the string isn't displayed as expected and reported LENgth is 32 bytes.

2) I load the tokenised .BBC source in BBCSDL, in the editor the static text is shown awkward. When run, the string is displayed correctly as well as the LENgth 28 bytes. In addition, it is hard to edit the string in the editor. I.e. in line 20, try clicking between the last 's' and closing quote. The caret won't go there.

For sure all this is due to ANSI vs Unicode or single bytes characters in a string vs multibyte characters.
When porting sources from BB4W to BBCSDL, what is the least painful way to deal with literal strings containing accented characters?

Mike

Hated Moron · Post by **Hated Moron** » Wed 16 Aug 2023, 13:13

hellomike wrote: ↑Wed 16 Aug 2023, 11:27 I copy/paste the code from the BB4W editor window into the BBCSDL editor, the source in the editor displays fine but when run, the string isn't displayed as expected and reported LENgth is 32 bytes.

I suspect that you are forgetting that it is necessary to enable UTF-8 support at run-time using VDU 23,22..... Here's an example which should work correctly if copied-and-pasted into BBCSDL or BB4W so long as Unicode is enabled in the Options menu:

Code: Select all

   10 VDU 23,22,640;512;8,16,2,8 : REM Equivalent to MODE 0 but with UTF-8 enabled
   15 INSTALL @lib$ + "utf8lib"
   20 Test$ = "Téxt has áccentèd charácters"
   30 PRINT
   40 PRINT Test$
   50 PRINT FN_ulen(Test$)

The key differences from your original are that Unicode (UTF-8) mode has been enabled in line 10 and the length of the string is measured using the library function FN_ulen() which returns the length in characters whereas LEN() returns the length in bytes. But these differences have nothing to do with porting from BB4W to BBCSDL, the modified code runs correctly in both.

it is hard to edit the string in the editor. I.e. in line 20, try clicking between the last 's' and closing quote. The caret won't go there.

What font are you using? I'm using the default DejaVuSansMono (at 11pt size) and I have no difficulty in positioning the caret between the final 's' and the closing quote. If you tell me what font you are using I will try to reproduce the effect you describe.

For sure all this is due to ANSI vs Unicode or single bytes characters in a string vs multibyte characters.

There's nothing particularly wrong with ANSI so long as you restrict yourself to the ASCII character set, that is characters with codes in the range &20 to &7E. What you should avoid - in both BB4W and BBCSDL - is using the old Code Page based approach, whereby codes &80 to &FF are mapped to a subset of 'foreign' characters depending on the currently-selected Code Page.

That was a terrible kludge introduced in the last century to provide limited support for foreign alphabets whilst retaining a one-byte-per-character encoding. Unicode made that obsolete decades ago and whilst there is still limited support for it in both BB4W and BBCSDL it should ideally not be used.

When porting sources from BB4W to BBCSDL, what is the least painful way to deal with literal strings containing accented characters?

The 'least painful' way is to use Unicode (UTF-8) everywhere.

hellomike · Post by **hellomike** » Wed 16 Aug 2023, 14:02

No, I did not forget VDU 23 enabling UTF8 nor the functions from your UTF8LIB library.

But before replacing the MODE statement(s) and the LEN() function(s) myself, I wanted to know what was the best way to approach this.

I didn't change the font and thus the editor is also on the same font/size as you mentioned. I checked.
As it's difficult to simply embed screenshots into forum posts I will email you.

Mike

Hated Moron · Post by **Hated Moron** » Wed 16 Aug 2023, 14:29

hellomike wrote: ↑Wed 16 Aug 2023, 14:02 As it's difficult to simply embed screenshots into forum posts I will email you.

It's not difficult to embed screenshots, so long as they don't exceed the maximum size allowed for attachments. Here's what I see after clicking between the final 's' and the closing quote (Windows 11):

screenshot.png

Hated Moron · Post by **Hated Moron** » Wed 16 Aug 2023, 14:43

And here's a screenshot showing that ANSI mode works in BBCSDL as well! Although I recommend using UTF-8 you could continue to use ANSI exactly as you have been doing in BB4W (but you can't easily copy-and-paste it into SDLIDE):

ansimode.png

Hated Moron · Post by **Hated Moron** » Wed 16 Aug 2023, 14:57

One difference between the BB4W and BBCSDL (SDLIDE) editors, although not between the interpreters, is that BB4W attempts to determine automatically whether a program uses ANSI or Unicode encoding, when it is loaded, and switches the mode if necessary.

This detection isn't 100% reliable, and therefore I didn't attempt to replicate it in SDLIDE (the Unicode setting is purely under manual control in the Options menu); it would also be a relatively slow test. However BB4W gets it right more often than it gets it wrong.

Do you think I should consider adding automatic (although unreliable, and slow) ANSI/Unicode detection to SDLIDE?

hellomike · Post by **hellomike** » Thu 17 Aug 2023, 07:48

All is well and understood now!
You were, as usual, completely correct. The misunderstanding was caused by me being spoiled by the automatic ANSI/Unicode detection in BB4W and thus never really understood the implications and also never really manually switched to Unicode.

SDLIDE starts up in Unicode mode and then when loading the mentioned source, the string in line 20 has unprintable characters in it and is hard to edit. Simply switching to ANSI solves this.

To answer your question: No, do not add automatic ANSI/Unicode detection to SDLIDE.

What is maybe left is the difference between BB4IDE en SDLIDE. When typing an ANSI string containing accented characters in ANSI mode, it remains displayed as ANSI even after switching to Unicode while SDLIDE tries displaying the >&7F characters as Unicode. Seems more logical than BB4IDE.
Also it seems that when typing an accented character in SDLIDE, it interprets it as Unicode even when in ANSI mode....

Regards,

Mike

Hated Moron · Post by **Hated Moron** » Thu 17 Aug 2023, 11:11

hellomike wrote: ↑Thu 17 Aug 2023, 07:48 SDLIDE starts up in Unicode mode

It should start up in whatever mode it was left in previously; like the other settings and preferences, the ANSI/Unicode state is stored in the SDLIDE.ini file. This can be found in the @usr$ directory, which in Windows is %appdata%\bbcbasic.

If this file is deleted, or when BBCSDL is run for the very first time, I expect it defaults to Unicode.

To answer your question: No, do not add automatic ANSI/Unicode detection to SDLIDE.

OK, although it would have been helpful to you when loading a program that you had previously been editing in ANSI mode in BB4W.

The reason automatic detection is unreliable, of course, is that any sequence of bytes (including those corresponding to valid UTF-8 characters) can legitimately be present in an ANSI file, so it could be misidentified as UTF-8 when it isn't. But statistically that's not very likely.

On balance I feel it would be an improvement, what do others think?

Also it seems that when typing an accented character in SDLIDE, it interprets it as Unicode even when in ANSI mode....

Oh, that does sound like a bug

. It arises, no doubt, from the fact that Windows has both ANSI and Unicode support at the OS API level, whereas SDL2 is UTF-8 only internally. I will investigate and fix, thank you.

Hated Moron · Post by **Hated Moron** » Thu 17 Aug 2023, 13:04

Hated Moron wrote: ↑Thu 17 Aug 2023, 11:11 I will investigate and fix, thank you.

I have updated the Windows edition (only) to 1.36d which hopefully fixes this issue and accepts keyboard input correctly in both Unicode and ANSI modes.

The bug will remain in the editions for other platforms, until the next major release, but since all Linux-based OSes (including MacOS) have been fully Unicode internally for a long time, the likelihood of anybody wanting to use the program editor in ANSI mode on those systems is small.

The simplified IDE for Android, iOS and in-browser editions (touchide.bbc) has no support for ANSI input or editing at all.

Hated Moron · Post by **Hated Moron** » Thu 17 Aug 2023, 14:41

Hated Moron wrote: ↑Thu 17 Aug 2023, 13:04 I have updated the Windows edition (only) to 1.36d...

I feel I should add a reminder that the various IDEs (BBCEdit and SDLIDE for the desktop editions, touchide for the mobile and browser editions) are themselves BBC BASIC programs. Not only that, but they are programs which run in both BBCSDL and BB4W, which necessarily means that they do not depend on OS-specific features and contain hardly any SYS (API) calls.

The GUI features of these programs (particularly the desktop IDEs), including Dialogue Boxes, Toolbars, Menu Bars, Scroll Bars, Editing Panes etc., are all achieved using standard BBC BASIC graphics statements such as MOVE, DRAW and PLOT. Keyboard input is via INKEY and mouse control is achieved using MOUSE and ON MOUSE.

As such these programs are quite straightforward (albeit fairly large) and should be understandable, and therefore modifiable, by anybody. I am always happy to receive submissions of custom versions which are enhanced in some way, with a view to the changes possibly being incorporated in the distributed versions.

BBC BASIC forum

Porting static strings containing accented characters

Porting static strings containing accented characters

Re: Porting static strings containing accented characters

Re: Porting static strings containing accented characters

Re: Porting static strings containing accented characters

Re: Porting static strings containing accented characters

Re: Porting static strings containing accented characters

Re: Porting static strings containing accented characters

Re: Porting static strings containing accented characters

Re: Porting static strings containing accented characters

Re: Porting static strings containing accented characters