Simplest editor for a String?
-
- Posts: 8
- Joined: Sat 30 Nov 2019, 10:04
Simplest editor for a String?
The problem:
I've got hundreds of bank statements that I need to analyse (very long story).
I can't get electronic copies of the account (Thanks HSBC).
I've scanned them all in, then OCR'd them but I need to correct the OCR errors, many caused by pale printing, so that I can put them into a spreadsheet.
lQ DCT iS CHO JONES 123.45 should read:
10 OCT 15 CHQ JONES 123.45
I can positively identify some errors by position and spacing - I know what the first example should read because of simple rules that I have already programmed, but lQ DCT iSCHO JONES 123.45 fails two simple tests (Date and transaction type) because of a missing space.
In this case I would like my program to stop and display a questionable line in a way that I can carry out a simple edit.
I have seen Richard's TEXTEDIT program but it is so far above my programming competency that I don't know where to start
Any ideas, anyone? Please!
My program is something like:
Open OCR_File
Open Corrected_OCR_File
Repeat
Input OCRLine$
Proc_Correct-All-Obvious-Errors(OCRLine$, Hunky_Dory)
IF NOT Hunky_Dory Then Proc_Edit(OCRLine$)
Print Correct OCRLine$ To Corrected_OCR_File
Until End of OCR_File
I've got hundreds of bank statements that I need to analyse (very long story).
I can't get electronic copies of the account (Thanks HSBC).
I've scanned them all in, then OCR'd them but I need to correct the OCR errors, many caused by pale printing, so that I can put them into a spreadsheet.
lQ DCT iS CHO JONES 123.45 should read:
10 OCT 15 CHQ JONES 123.45
I can positively identify some errors by position and spacing - I know what the first example should read because of simple rules that I have already programmed, but lQ DCT iSCHO JONES 123.45 fails two simple tests (Date and transaction type) because of a missing space.
In this case I would like my program to stop and display a questionable line in a way that I can carry out a simple edit.
I have seen Richard's TEXTEDIT program but it is so far above my programming competency that I don't know where to start
Any ideas, anyone? Please!
My program is something like:
Open OCR_File
Open Corrected_OCR_File
Repeat
Input OCRLine$
Proc_Correct-All-Obvious-Errors(OCRLine$, Hunky_Dory)
IF NOT Hunky_Dory Then Proc_Edit(OCRLine$)
Print Correct OCRLine$ To Corrected_OCR_File
Until End of OCR_File
Re: Simplest editor for a String?
If everything is (or should be) strictly tabulated in columns it makes your life easier. Suppose for example you want to check that columns 8 and 9 contain a two digit numeric value; you could do something like:Baldilocks wrote: ↑Sat 30 Nov 2019, 11:24 lQ DCT iS CHO JONES 123.45 should read:
10 OCT 15 CHQ JONES 123.45
Code: Select all
number$ = MID$(record$,8,2)
IF STR$(VAL(number$)) <> number$ THEN PRINT "Columns 8 and 9 do not contain a two-digit number"
Code: Select all
MID$(record$,8,2) = "15"
Code: Select all
record$ = LEFT$(record$,9) + " " + MID$(record$,10)
-
- Posts: 8
- Joined: Sat 30 Nov 2019, 10:04
Re: Simplest editor for a String?
Thanks Richard. What I was hoping for was more along the lines of your last suggestion about insertions and deletions. The OCR'd text is without
columns. I tried about 8 free OCR programs and only two made any attempt with columns, both of them assumed that they were reading newspaper
columns and put column 2 below column 1, wrecking the horizontal relationships by removing blank lines. They were all pretty naff in coping badly
with the pale text and the best one, Capture2text scored because I can open the Tif scan in a graphics viewer (IrfanView), magnify it - which
interpolates the characters and 'improves' their resolution, then pop-up the OCR to capture the text off the screen.
This proved to be more reliable - and faster - than the others, with a much lower error rate. The cost was that all long spaces are reduced to a
single space!
I've already got a routine worked out for the dates which looks at a template and gives a score. A date is always 10 characters (including the
terminator space) at the beginning of a line.
If character 1 (capitalised) matches d1$ then it gets 2 points, If it matches one of d1_alt$ (Errors that I've seen) then it gets 1 point.
I'm just starting to build up a list of the errors that I've seen (or consider reasonable) for the others.
1960 d1$ = "0123"
1965 d1_alt$ ="ODCL!I8"
1970 d2$ = "0123456789"
1975 d2_alt$ ="ODCL!I8"
1980 m1$ = "JFMASOND"
1985 m1_alt$ =""
1990 m2$ = "AEPUCO"
1995 m2_alt$ =""
2000 m3$ = "NBRYLQPTVC"
2025 m3_alt$ =""
2010 y1$ = "1"
2025 y1_alt$ ="L!I"
2020 y2$ = "0123456789"
2025 y2_alt$ ="ODCL!I8"
Each date can be tested by position for each of the above 7 characters plus the 3 spaces and get a 'perfect' score of 20. Most errors that I've
worked on have been a single 'substitution' error and would give a score of 19, so I would be happy to get it to autocorrect that, possibly with
my approval eg:
ERROR
"l5 MAR 16 " Should it read
"15 MAR 16 "? Y/N
The same rule-based analysis can be applied to the important parts of the data, and automatic corrections can be checked with a second pass
through because dates are sequential, the various transaction types "SO CR DR DD BP DO OD ATM TFR VIS CHQ " are all either credits or debits
so the Day Balances and Page Balances should make sense.
I can certainly identify lines with multiple errors and re-type it:
ERROR: This line contains multiple errors. Re-type it in its entirety:
lQ DCT iS CHO JONES 123.45
But I would be much happier editing the displayed line, if I can find a way.
Thanks again.
columns. I tried about 8 free OCR programs and only two made any attempt with columns, both of them assumed that they were reading newspaper
columns and put column 2 below column 1, wrecking the horizontal relationships by removing blank lines. They were all pretty naff in coping badly
with the pale text and the best one, Capture2text scored because I can open the Tif scan in a graphics viewer (IrfanView), magnify it - which
interpolates the characters and 'improves' their resolution, then pop-up the OCR to capture the text off the screen.
This proved to be more reliable - and faster - than the others, with a much lower error rate. The cost was that all long spaces are reduced to a
single space!
I've already got a routine worked out for the dates which looks at a template and gives a score. A date is always 10 characters (including the
terminator space) at the beginning of a line.
If character 1 (capitalised) matches d1$ then it gets 2 points, If it matches one of d1_alt$ (Errors that I've seen) then it gets 1 point.
I'm just starting to build up a list of the errors that I've seen (or consider reasonable) for the others.
1960 d1$ = "0123"
1965 d1_alt$ ="ODCL!I8"
1970 d2$ = "0123456789"
1975 d2_alt$ ="ODCL!I8"
1980 m1$ = "JFMASOND"
1985 m1_alt$ =""
1990 m2$ = "AEPUCO"
1995 m2_alt$ =""
2000 m3$ = "NBRYLQPTVC"
2025 m3_alt$ =""
2010 y1$ = "1"
2025 y1_alt$ ="L!I"
2020 y2$ = "0123456789"
2025 y2_alt$ ="ODCL!I8"
Each date can be tested by position for each of the above 7 characters plus the 3 spaces and get a 'perfect' score of 20. Most errors that I've
worked on have been a single 'substitution' error and would give a score of 19, so I would be happy to get it to autocorrect that, possibly with
my approval eg:
ERROR
"l5 MAR 16 " Should it read
"15 MAR 16 "? Y/N
The same rule-based analysis can be applied to the important parts of the data, and automatic corrections can be checked with a second pass
through because dates are sequential, the various transaction types "SO CR DR DD BP DO OD ATM TFR VIS CHQ " are all either credits or debits
so the Day Balances and Page Balances should make sense.
I can certainly identify lines with multiple errors and re-type it:
ERROR: This line contains multiple errors. Re-type it in its entirety:
lQ DCT iS CHO JONES 123.45
But I would be much happier editing the displayed line, if I can find a way.
Thanks again.
Re: Simplest editor for a String?
In what way does my earlier reply not provide the solution? You presumably have your OCRd text in a file, or in a string, or in an array, so why can't you directly apply my suggestions in order to edit it?Baldilocks wrote: ↑Sun 01 Dec 2019, 10:33But I would be much happier editing the displayed line, if I can find a way.
-
- Posts: 8
- Joined: Sat 30 Nov 2019, 10:04
Re: Simplest editor for a String?
The problem is not one string, but thousands. I've got two corporate bank accounts to trawl, for about 6.5 years. Each month has between 4 and 6 pages of entries for each account with 20-40 transactions per page (depending on how verbose each one is.
Re: Simplest editor for a String?
I gained the impression from your original post that you had already OCRd them into one or more files, in which case processing thousands of strings is no more difficult than one (apart from the time taken), that's what loops are for! But perhaps I misunderstood the format in which all these records currently exist.
If it was my problem, my first step would be to get the OCRd data into a format in which BBC BASIC can easily access it, so for example one or more 'plain text' files containing one record per line (say delimited by CRLF terminators for compatibility with standard text editors or spreadsheets). Then I would write a program that processed these records individually, in a loop, writing the 'corrected' data to a new file.
If it's OK to treat the records in isolation, then it's a simple matter of reading each record into a string, processing the string to fix the errors, then writing the corrected string to another file; rinse-and-repeat until the entire input file has been processed. On the other hand if correcting the errors requires cross referencing with other records, I would read the entire input file into an array of strings, then process each element of the array (referring to other records as required), and then write the entire corrected array to a new file.
The latter approach, if it is necessary, would of course require considerably more storage to contain the array(s). If your dataset is so large that it will not fit into the default amount of memory available (2 Mbytes) you should be able to raise HIMEM sufficiently to make enough room (with a limit of around 512 Mbytes).
-
- Posts: 327
- Joined: Wed 04 Apr 2018, 06:36
Re: Simplest editor for a String?
I wonder if something like this is what you have in mind?
This will only process one line and you have to click on the line to start the editing. It allows you to insert or delete (if you want to overwrite I'm sure you can easily implement that). The string variable a$ is fixed at the beginning of the code, but in your actual code you would read it in from a file and perform whatever automatic alterations you want before displaying it on the screen and doing manual editing. Once you press Return (g%=13) you save the string to your output file and read in the next string from your input file.
Code: Select all
a$="01 Mar 16 Cash transaction 255.73"
PRINTTAB(2,10)a$
REPEAT
MOUSEmx%,my%,mb%
UNTILmb%=4
my%=1140
c%=(mx%-34)DIV16
VDU5:GCOL3,2:MOVEmx%-8,my%:PRINT"|":VDU4
REPEAT
g%=GET
VDU5:GCOL3,2:MOVEmx%-8,my%:PRINT"|":VDU4
CASEg%OF
WHEN8
a$=LEFT$(a$,c%-1)+MID$(a$,c%+1)
WHEN13
WHEN136:mx%-=16:c%-=1
WHEN137:mx%+=16:c%+=1
WHEN138
WHEN139
OTHERWISE
a$=LEFT$(a$,c%)+CHR$g%+MID$(a$,c%+1)
mx%+=16:c%+=1
ENDCASE
PRINTTAB(2,10)a$
VDU5:GCOL3,2:MOVEmx%-8,my%:PRINT"|":VDU4
UNTILg%=13
-
- Posts: 8
- Joined: Sat 30 Nov 2019, 10:04
Re: Simplest editor for a String?
Richard wrote:
My OCR program puts its output to the clipboard so I just dropped that into a Notepad++ file. I process the individual lines within arrays because of yet another quirk of HSBC statements - they are very mean with dates! They only use a date to indicate the beginning of a month's statement and a new date within that statement. A particular date is continued, even across 2 pages, until it is terminated by a Balance value
They also allow very little space for the transaction details and each transaction can take up to 4 lines and also be split over 2 pages.
All this needs to be identified and acted upon so my program does this analysis within 'boolean' arrays to identify HasDate, HasBalance etc and reconstructs every transaction with its own date and balance so that it forms a complete and correct entry within the final spreadsheed.
At the moment I am refining all the different rules (called "Yak shaving" apparently) by running my program on a sample OCR page, then on several, until the majority of errors are identifable and I can comfortably let it run and export everything to the spreadsheed - when error finding might start again!
Thank you.
Phew. My days of being a frequent, if not particularly fluent, programmer are long gone but I'm very glad to see that I didnt go too far astray on this.If it was my problem, my first step would be to get the OCRd data into a format in which BBC BASIC can easily access it, so for example one or more 'plain text' files containing one record per line (say delimited by CRLF terminators for compatibility with standard text editors or spreadsheets). Then I would write a program that processed these records individually, in a loop, writing the 'corrected' data to a new file.
If it's OK to treat the records in isolation, then it's a simple matter of reading each record into a string, processing the string to fix the errors, then writing the corrected string to another file; rinse-and-repeat until the entire input file has been processed. On the other hand if correcting the errors requires cross referencing with other records, I would read the entire input file into an array of strings, then process each element of the array (referring to other records as required), and then write the entire corrected array to a new file.
My OCR program puts its output to the clipboard so I just dropped that into a Notepad++ file. I process the individual lines within arrays because of yet another quirk of HSBC statements - they are very mean with dates! They only use a date to indicate the beginning of a month's statement and a new date within that statement. A particular date is continued, even across 2 pages, until it is terminated by a Balance value
They also allow very little space for the transaction details and each transaction can take up to 4 lines and also be split over 2 pages.
All this needs to be identified and acted upon so my program does this analysis within 'boolean' arrays to identify HasDate, HasBalance etc and reconstructs every transaction with its own date and balance so that it forms a complete and correct entry within the final spreadsheed.
At the moment I am refining all the different rules (called "Yak shaving" apparently) by running my program on a sample OCR page, then on several, until the majority of errors are identifable and I can comfortably let it run and export everything to the spreadsheed - when error finding might start again!
Thank you.
-
- Posts: 8
- Joined: Sat 30 Nov 2019, 10:04
Re: Simplest editor for a String?
Yes KenDown, That's exactly what I was looking for!
Or at least, I think it probably is.
I have to confess that I have never (that I can remember) used VDU, MOUSE or GCOL commands so this is ice-breaking for me.
It's not working properly on my PC but I suspect that I have a different resolution set from yours, or something of that ilk.
I'm working on it!
Many thanks.
Baldi
Or at least, I think it probably is.
I have to confess that I have never (that I can remember) used VDU, MOUSE or GCOL commands so this is ice-breaking for me.
It's not working properly on my PC but I suspect that I have a different resolution set from yours, or something of that ilk.
I'm working on it!
Many thanks.
Baldi
-
- Posts: 327
- Joined: Wed 04 Apr 2018, 06:36
Re: Simplest editor for a String?
Well, my setup in 1920x1080, so if yours is different you will have to alter the setting for my%. Insert the word STOP immediately after the MOUSE command, then run the program, click on the string and see what value you get with PRINTmy%.
The simplest way of making the program portable is to set the size of BASIC's output window right at the start. Here is what I hope is an improved version. It runs with a fixed value for a$. To get it to work with a disk file remove any REM that is followed by BASIC code (ie. REM PRINTa$ will become PRINTa$) and any line which is followed by REM Remove this line. This sets a red "cursor" at the start of the text, which you move around with the left and right arrow keys. There is no mouse clicking involved.
The simplest way of making the program portable is to set the size of BASIC's output window right at the start. Here is what I hope is an improved version. It runs with a fixed value for a$. To get it to work with a disk file remove any REM that is followed by BASIC code (ie. REM PRINTa$ will become PRINTa$) and any line which is followed by REM Remove this line. This sets a red "cursor" at the start of the text, which you move around with the left and right arrow keys. There is no mouse clicking involved.
Code: Select all
PROCinit
REMF%=OPENINinfile$
REMG%=OPENOUToutfile$
REPEAT
a$=FNget(F%)
a$=FNpreprocess(a$)
PRINTTAB(2,4)a$
l%=LENa$
mx%=16:my%=1320
c%=0
REPEAT
VDU5:GCOL3,6:MOVEmx%+8,my%:PRINT"|":VDU4
g%=GET
VDU5:GCOL3,6:MOVEmx%+8,my%:PRINT"|":VDU4
CASEg%OF
WHEN8:REM Backspace
IFc%>0THEN
a$=LEFT$(a$,c%-1)+MID$(a$,c%+1)
PRINTTAB(2,4)a$+" "
c%-=1:mx%-=16
l%-=1
ENDIF
WHEN9:REM Tab
WHEN13:REM Return
WHEN130:REM Home
c%=0:mx%=16
WHEN131:REM End
c%=l%:mx%=16+c%*16
WHEN136:REM Left arrow key
IFc%>0c%-=1:mx%-=16
WHEN137:REM Right arrow key
IFc%<l%c%+=1:mx%+=16
OTHERWISE
a$=LEFT$(a$,c%)+CHR$g%+MID$(a$,c%+1)
PRINTTAB(2,4)a$
c%+=1:mx%+=16
l%+=1
ENDCASE
UNTILg%=13
STOP
REMBPUT#G%,a$+CHR$13
REMUNTILEOF#F%
UNTILFALSE:REM Remove this line when reading from file
CLOSE#F%
CLOSE#G%
END
:
DEFFNpreprocess(a$)
REM Insert preprocessing here
=a$
:
DEFFNget(h%):LOCALa$
REMa$=GET$#h%:IFa$=""a$=GET$#h%
a$="10 Oct 2019 Cash transaction 255.73":REM Remove this line when reading from file
=a$
:
DEFPROCinit
SYS"GetWindowLong",@hwnd%,-16TOws%
SYS"SetWindowLong",@hwnd%,-16,ws%AND&FFFBFFFF AND&FFFEFFFF
REM These calls give you the screen dimensions in graphics units, not Windows units
SYS"GetSystemMetrics",0TOscreenx%
SYS"GetSystemMetrics",1TOscreeny%
winwide%=500*2+12:REM The width of your main dialog box*2+12 to allow for the window borders
winhigh%=200*2+58:REM The height of your main dialog box*2+58 to allow for borders and menu bar
sx%=(screenx%-winwide%)DIV2
sy%=(screeny%-winhigh%)DIV2-30
SYS"SetWindowPos",@hwnd%,0,sx%,sy%,winwide%,winhigh%,0
ENDPROC