Ceefax - can you help?

RichardRussell · Post by **RichardRussell** » Sat 12 Sep 2020, 22:14

The Ceefax.bbc program (supplied as an example with all editions of BBC BASIC for SDL 2.0, except the in-browser edition) relies on being able to extract the 'relevant text' from the HTML code of the BBC's web pages. This is non-trivial; once upon a time HTML was a simple markup language and it was easy to distinguish the formatting commands from the actual content, but no more.

Ceefax.bbc generally succeeds quite well with this, but recently the BBC has changed the format of some of its pages and this is completely breaking the code which tries to extract the relevant text for display. I've looked at the HTML source of one of these pages and I can't see any reliable way of filtering it to get at the wanted content.

So I'm asking for help. Before anybody thinks "If Richard can't manage it nor can I" I want to remind you that I am suffering from 'cognitive decline' (tentatively diagnosed as Alzheimer's disease) so the fact that I can't see a way of doing it means nothing. It could just be that my 'lateral thinking' skills are lacking (which I know for sure they are).

So please don't be put off. The HTML source of one of the troublesome pages is here and the wanted page content starts with "A new Covid-19 contact-tracing app will be launched across England and Wales on 24 September...". If somebody can find a way of parsing the HTML to extract the wanted text, without being too specific to this particular example, I would be very grateful.

michael · Post by **michael** » Sun 13 Sep 2020, 01:21

Ceefax - can you help?

I might be the most unlikely person to be able to get a final solution to this BUT I have made programs in the past that could create a HTML and CSS pages using a program generator.

Here is what I have discovered:

What a person can do is do a line scan at the front of each line looking for
[{"type":
and then scan for keyword with punctuation to signify the upcoming relevant text.
{"text":"Welsh Health and Social Services Minister Vaughan Gething said launching a joint app with England was \"the most practical option\"

Basically, you would be extracting a series of key words to dig out each line of text.
You wouldn't need to scan the entire line of each line, just the lines that had the above text and symbol keywords.

It also appears that some key words that control page structure after a "text": statement would need to be scanned for to prevent irrelevant extraction. Almost like you would need a degree of AI to sort out garbage info.

I can look further and see if there is more possibilities

I have tried to be not too specific to the content from your sample. I think I am well within subject restraints and fair use limits.
Best wishes to you Richard.

Michael

RichardRussell · Post by **RichardRussell** » Sun 13 Sep 2020, 02:41

michael wrote: ↑Sun 13 Sep 2020, 01:21 What a person can do is do a line scan at the front of each line looking for
[{"type":
and then scan for keyword with punctuation to signify the upcoming relevant text.

That's pretty much what the current Ceefax.bbc does. It looks for a few such 'markers':

Code: Select all

      I% = INSTR(htm$, "<p class=""story") : M% = 1
      IF I% = 0 I% = INSTR(htm$, "<div class=""vxp-media__summary") : M% = 0
      IF I% = 0 I% = INSTR(htm$, "<div class=""story-body") : M% = 0
      IF I% = 0 I% = INSTR(htm$, "<div class=""main_article") : M% = 0
      IF I% = 0 I% = INSTR(htm$, "<p class=""sp-story-body") : M% = 1

I'll see if adding '{"text":' to the list looks promising. Thanks for the suggestion.

RichardRussell · Post by **RichardRussell** » Mon 14 Sep 2020, 11:22

RichardRussell wrote: ↑Sun 13 Sep 2020, 02:41 I'll see if adding '{"text":' to the list looks promising. Thanks for the suggestion.

A better marker seems to be <div data-component="text-block" but the suggestion was sound.

My comment about "lateral thinking" turned out to be spot-on. The real reason why I was having so much trouble parsing the file turns out to be because I had assumed (naturally enough, I think) that the number of < in the file and number of > would match. That has always been true in my experience of HTML to date, so my code incremented a count every time it saw a < and decremented it on a > to keep track of 'level'.

But that's not true on these pages! Apparently it's legitimate for a lone > to appear in 'inline CSS' and that's what's happening: as a result there is an imbalance and my code was completely confused. Stripping out those lone > made all the difference, and things started to make more sense. The code I've got now works, but it's still quite specific to that particular page format. That's unfortunate but may be unavoidable.

I've updated the released versions with the new Ceefax.bbc, if only as an interim measure to restore functionality.

BBC BASIC forum

Ceefax - can you help?

Ceefax - can you help?

Re: Ceefax - can you help?

Re: Ceefax - can you help?

Re: Ceefax - can you help?