Simplest editor for a String?

Baldilocks · Post by **Baldilocks** » Sat 30 Nov 2019, 11:24

The problem:
I've got hundreds of bank statements that I need to analyse (very long story).
I can't get electronic copies of the account (Thanks HSBC).
I've scanned them all in, then OCR'd them but I need to correct the OCR errors, many caused by pale printing, so that I can put them into a spreadsheet.
lQ DCT iS CHO JONES 123.45 should read:
10 OCT 15 CHQ JONES 123.45

I can positively identify some errors by position and spacing - I know what the first example should read because of simple rules that I have already programmed, but lQ DCT iSCHO JONES 123.45 fails two simple tests (Date and transaction type) because of a missing space.

In this case I would like my program to stop and display a questionable line in a way that I can carry out a simple edit.
I have seen Richard's TEXTEDIT program but it is so far above my programming competency that I don't know where to start

Any ideas, anyone? Please!

My program is something like:
Open OCR_File
Open Corrected_OCR_File
Repeat
Input OCRLine$
Proc_Correct-All-Obvious-Errors(OCRLine$, Hunky_Dory)
IF NOT Hunky_Dory Then Proc_Edit(OCRLine$)
Print Correct OCRLine$ To Corrected_OCR_File
Until End of OCR_File

RichardRussell · Post by **RichardRussell** » Sat 30 Nov 2019, 12:17

Baldilocks wrote: ↑Sat 30 Nov 2019, 11:24 lQ DCT iS CHO JONES 123.45 should read:
10 OCT 15 CHQ JONES 123.45

If everything is (or should be) strictly tabulated in columns it makes your life easier. Suppose for example you want to check that columns 8 and 9 contain a two digit numeric value; you could do something like:

Code: Select all

      number$ = MID$(record$,8,2)
      IF STR$(VAL(number$)) <> number$ THEN PRINT "Columns 8 and 9 do not contain a two-digit number"

As far as actual editing of the string is concerned, 'in place' editing (i.e. editing that does not involve insertions or deletions) is easily done as follows:

Code: Select all

      MID$(record$,8,2) = "15"

Insertions or deletions are a little more fiddly because you've got to break the string down, so for example to insert a space at column 10:

Code: Select all

      record$ = LEFT$(record$,9) + " " + MID$(record$,10)

Baldilocks · Post by **Baldilocks** » Sun 01 Dec 2019, 10:33

Thanks Richard. What I was hoping for was more along the lines of your last suggestion about insertions and deletions. The OCR'd text is without
columns. I tried about 8 free OCR programs and only two made any attempt with columns, both of them assumed that they were reading newspaper
columns and put column 2 below column 1, wrecking the horizontal relationships by removing blank lines. They were all pretty naff in coping badly
with the pale text and the best one, Capture2text scored because I can open the Tif scan in a graphics viewer (IrfanView), magnify it - which
interpolates the characters and 'improves' their resolution, then pop-up the OCR to capture the text off the screen.

This proved to be more reliable - and faster - than the others, with a much lower error rate. The cost was that all long spaces are reduced to a
single space!

I've already got a routine worked out for the dates which looks at a template and gives a score. A date is always 10 characters (including the
terminator space) at the beginning of a line.
If character 1 (capitalised) matches d1$ then it gets 2 points, If it matches one of d1_alt$ (Errors that I've seen) then it gets 1 point.
I'm just starting to build up a list of the errors that I've seen (or consider reasonable) for the others.
1960 d1$ = "0123"
1965 d1_alt$ ="ODCL!I8"
1970 d2$ = "0123456789"
1975 d2_alt$ ="ODCL!I8"
1980 m1$ = "JFMASOND"
1985 m1_alt$ =""
1990 m2$ = "AEPUCO"
1995 m2_alt$ =""
2000 m3$ = "NBRYLQPTVC"
2025 m3_alt$ =""
2010 y1$ = "1"
2025 y1_alt$ ="L!I"
2020 y2$ = "0123456789"
2025 y2_alt$ ="ODCL!I8"

Each date can be tested by position for each of the above 7 characters plus the 3 spaces and get a 'perfect' score of 20. Most errors that I've
worked on have been a single 'substitution' error and would give a score of 19, so I would be happy to get it to autocorrect that, possibly with
my approval eg:

ERROR
"l5 MAR 16 " Should it read
"15 MAR 16 "? Y/N

The same rule-based analysis can be applied to the important parts of the data, and automatic corrections can be checked with a second pass
through because dates are sequential, the various transaction types "SO CR DR DD BP DO OD ATM TFR VIS CHQ " are all either credits or debits
so the Day Balances and Page Balances should make sense.

I can certainly identify lines with multiple errors and re-type it:

ERROR: This line contains multiple errors. Re-type it in its entirety:
lQ DCT iS CHO JONES 123.45

But I would be much happier editing the displayed line, if I can find a way.

Thanks again.

RichardRussell · Post by **RichardRussell** » Sun 01 Dec 2019, 12:29

Baldilocks wrote: ↑Sun 01 Dec 2019, 10:33But I would be much happier editing the displayed line, if I can find a way.

In what way does my earlier reply not provide the solution? You presumably have your OCRd text in a file, or in a string, or in an array, so why can't you directly apply my suggestions in order to edit it?

Baldilocks · Post by **Baldilocks** » Sun 01 Dec 2019, 15:36

The problem is not one string, but thousands. I've got two corporate bank accounts to trawl, for about 6.5 years. Each month has between 4 and 6 pages of entries for each account with 20-40 transactions per page (depending on how verbose each one is.

RichardRussell · Post by **RichardRussell** » Sun 01 Dec 2019, 16:00

Baldilocks wrote: ↑Sun 01 Dec 2019, 15:36 The problem is not one string, but thousands.

I gained the impression from your original post that you had already OCRd them into one or more files, in which case processing thousands of strings is no more difficult than one (apart from the time taken), that's what loops are for! But perhaps I misunderstood the format in which all these records currently exist.

If it was my problem, my first step would be to get the OCRd data into a format in which BBC BASIC can easily access it, so for example one or more 'plain text' files containing one record per line (say delimited by CRLF terminators for compatibility with standard text editors or spreadsheets). Then I would write a program that processed these records individually, in a loop, writing the 'corrected' data to a new file.

If it's OK to treat the records in isolation, then it's a simple matter of reading each record into a string, processing the string to fix the errors, then writing the corrected string to another file; rinse-and-repeat until the entire input file has been processed. On the other hand if correcting the errors requires cross referencing with other records, I would read the entire input file into an array of strings, then process each element of the array (referring to other records as required), and then write the entire corrected array to a new file.

The latter approach, if it is necessary, would of course require considerably more storage to contain the array(s). If your dataset is so large that it will not fit into the default amount of memory available (2 Mbytes) you should be able to raise HIMEM sufficiently to make enough room (with a limit of around 512 Mbytes).

KenDown · Post by **KenDown** » Sun 01 Dec 2019, 18:32

I wonder if something like this is what you have in mind?

Code: Select all

      a$="01 Mar 16 Cash transaction 255.73"
      PRINTTAB(2,10)a$
      REPEAT
        MOUSEmx%,my%,mb%
      UNTILmb%=4
      my%=1140

      c%=(mx%-34)DIV16

      VDU5:GCOL3,2:MOVEmx%-8,my%:PRINT"|":VDU4

      REPEAT
        g%=GET
        VDU5:GCOL3,2:MOVEmx%-8,my%:PRINT"|":VDU4
        CASEg%OF
          WHEN8
            a$=LEFT$(a$,c%-1)+MID$(a$,c%+1)
          WHEN13
          WHEN136:mx%-=16:c%-=1
          WHEN137:mx%+=16:c%+=1
          WHEN138
          WHEN139
                OTHERWISE
            a$=LEFT$(a$,c%)+CHR$g%+MID$(a$,c%+1)
            mx%+=16:c%+=1
        ENDCASE
        PRINTTAB(2,10)a$
        VDU5:GCOL3,2:MOVEmx%-8,my%:PRINT"|":VDU4
      UNTILg%=13

This will only process one line and you have to click on the line to start the editing. It allows you to insert or delete (if you want to overwrite I'm sure you can easily implement that). The string variable a$ is fixed at the beginning of the code, but in your actual code you would read it in from a file and perform whatever automatic alterations you want before displaying it on the screen and doing manual editing. Once you press Return (g%=13) you save the string to your output file and read in the next string from your input file.

Baldilocks · Post by **Baldilocks** » Mon 02 Dec 2019, 11:55

Richard wrote:

If it was my problem, my first step would be to get the OCRd data into a format in which BBC BASIC can easily access it, so for example one or more 'plain text' files containing one record per line (say delimited by CRLF terminators for compatibility with standard text editors or spreadsheets). Then I would write a program that processed these records individually, in a loop, writing the 'corrected' data to a new file.

If it's OK to treat the records in isolation, then it's a simple matter of reading each record into a string, processing the string to fix the errors, then writing the corrected string to another file; rinse-and-repeat until the entire input file has been processed. On the other hand if correcting the errors requires cross referencing with other records, I would read the entire input file into an array of strings, then process each element of the array (referring to other records as required), and then write the entire corrected array to a new file.

Phew. My days of being a frequent, if not particularly fluent, programmer are long gone but I'm very glad to see that I didnt go too far astray on this.

My OCR program puts its output to the clipboard so I just dropped that into a Notepad++ file. I process the individual lines within arrays because of yet another quirk of HSBC statements - they are very mean with dates! They only use a date to indicate the beginning of a month's statement and a new date within that statement. A particular date is continued, even across 2 pages, until it is terminated by a Balance value
They also allow very little space for the transaction details and each transaction can take up to 4 lines and also be split over 2 pages.
All this needs to be identified and acted upon so my program does this analysis within 'boolean' arrays to identify HasDate, HasBalance etc and reconstructs every transaction with its own date and balance so that it forms a complete and correct entry within the final spreadsheed.

At the moment I am refining all the different rules (called "Yak shaving" apparently) by running my program on a sample OCR page, then on several, until the majority of errors are identifable and I can comfortably let it run and export everything to the spreadsheed - when error finding might start again!

Thank you.

Baldilocks · Post by **Baldilocks** » Mon 02 Dec 2019, 16:30

Yes KenDown, That's exactly what I was looking for!

Or at least, I think it probably is.

I have to confess that I have never (that I can remember) used VDU, MOUSE or GCOL commands so this is ice-breaking for me.
It's not working properly on my PC but I suspect that I have a different resolution set from yours, or something of that ilk.
I'm working on it!

Many thanks.

Baldi

KenDown · Post by **KenDown** » Wed 04 Dec 2019, 04:13

Well, my setup in 1920x1080, so if yours is different you will have to alter the setting for my%. Insert the word STOP immediately after the MOUSE command, then run the program, click on the string and see what value you get with PRINTmy%.

The simplest way of making the program portable is to set the size of BASIC's output window right at the start. Here is what I hope is an improved version. It runs with a fixed value for a$. To get it to work with a disk file remove any REM that is followed by BASIC code (ie. REM PRINTa$ will become PRINTa$) and any line which is followed by REM Remove this line. This sets a red "cursor" at the start of the text, which you move around with the left and right arrow keys. There is no mouse clicking involved.

Code: Select all

      PROCinit
      REMF%=OPENINinfile$
      REMG%=OPENOUToutfile$
      REPEAT
        a$=FNget(F%)
        a$=FNpreprocess(a$)
        PRINTTAB(2,4)a$
        l%=LENa$
        mx%=16:my%=1320
        c%=0
        REPEAT
          VDU5:GCOL3,6:MOVEmx%+8,my%:PRINT"|":VDU4
          g%=GET
          VDU5:GCOL3,6:MOVEmx%+8,my%:PRINT"|":VDU4
          CASEg%OF
            WHEN8:REM Backspace
              IFc%>0THEN
                a$=LEFT$(a$,c%-1)+MID$(a$,c%+1)
                PRINTTAB(2,4)a$+" "
                c%-=1:mx%-=16
                l%-=1
              ENDIF
            WHEN9:REM Tab
            WHEN13:REM Return
            WHEN130:REM Home
              c%=0:mx%=16
            WHEN131:REM End
              c%=l%:mx%=16+c%*16
            WHEN136:REM Left arrow key
              IFc%>0c%-=1:mx%-=16
            WHEN137:REM Right arrow key
              IFc%<l%c%+=1:mx%+=16
            OTHERWISE
              a$=LEFT$(a$,c%)+CHR$g%+MID$(a$,c%+1)
              PRINTTAB(2,4)a$
              c%+=1:mx%+=16
              l%+=1
          ENDCASE
        UNTILg%=13
        STOP
        REMBPUT#G%,a$+CHR$13
        REMUNTILEOF#F%
      UNTILFALSE:REM Remove this line when reading from file
      CLOSE#F%
      CLOSE#G%
      END
      :
      DEFFNpreprocess(a$)
      REM Insert preprocessing here
      =a$
      :
      DEFFNget(h%):LOCALa$
      REMa$=GET$#h%:IFa$=""a$=GET$#h%
      a$="10 Oct 2019 Cash transaction 255.73":REM Remove this line when reading from file
      =a$
      :
      DEFPROCinit
      SYS"GetWindowLong",@hwnd%,-16TOws%
      SYS"SetWindowLong",@hwnd%,-16,ws%AND&FFFBFFFF AND&FFFEFFFF
      REM These calls give you the screen dimensions in graphics units, not Windows units
      SYS"GetSystemMetrics",0TOscreenx%
      SYS"GetSystemMetrics",1TOscreeny%
      winwide%=500*2+12:REM The width of your main dialog box*2+12 to allow for the window borders
      winhigh%=200*2+58:REM The height of your main dialog box*2+58 to allow for borders and menu bar
      sx%=(screenx%-winwide%)DIV2
      sy%=(screeny%-winhigh%)DIV2-30
      SYS"SetWindowPos",@hwnd%,0,sx%,sy%,winwide%,winhigh%,0
      ENDPROC

BBC BASIC forum

Simplest editor for a String?

Simplest editor for a String?

Re: Simplest editor for a String?

Re: Simplest editor for a String?

Re: Simplest editor for a String?

Re: Simplest editor for a String?

Re: Simplest editor for a String?

Re: Simplest editor for a String?

Re: Simplest editor for a String?

Re: Simplest editor for a String?

Re: Simplest editor for a String?