Reflowing a paragraph

General discussions related to the Altair 8800 Clone

Reflowing a paragraph

Postby mail@gabrielegan.com » May 4th, 2016, 4:05 pm

Well that was harder than I expected! I've been working on an assembler routine to 'reflow' a text paragraph, by which I mean changing the maximum allowed length of a line while word-wrapping so that line-breaks occur only between words.

With a short line length (30 characters) a word-wrapped paragraph might look like this:

<< Emma Woodhouse, handsome,
clever, and rich, with a
comfortable home and happy
disposition, seemed to unite
some of the best blessings
of existence; and had lived
nearly twenty-one years in
the world with very little
to distress or vex her. >>

With a longer line length (50 characters) the same word-wrapped paragraph would look like this:

<< Emma Woodhouse, handsome, clever, and rich, with
a comfortable home and happy disposition, seemed
to unite some of the best blessings of existence;
and had lived nearly twenty-one years in the world
with very little to distress or vex her. >>

By 'reflowing' a paragraph I mean transforming a block of text in memory from the first to the second example above simply by specifying a new maximum line length (known as the 'the measure' in type-setting).

It turned out to be particularly hard because of the convention of representing line-ends with two ASCII bytes: CR+LF. This convention means that the total number of characters in a paragraph changes when you reflow it since one-byte spaces become two-byte line endings and two-byte line endings become one-byte spaces. This means that when reflowing to a narrower measure the paragraph occupies more bytes than before (because it gains line endings) and when reflowing
to a wider measure the paragraph occupies fewer bytes than before (because it loses line endings).

This is a pain for memory management--you have to leave some space after the paragraph for it to grow--and it makes the programming fiddly. If only there were a single character to mark the end of a line the programming for reflowing would be easy and the paragraph would stay the same length in terms of bytes: all you'd do is change some spaces for line-ends and vice versa.

Then I remembered that Apple Macintosh computers use a single character for line endings (whereas MS-DOS/Windows machines use CR+LF) and I got to wondering if this is connected to the first Apple Macs being considered amazing because their GUIs could reflow paragraphs as you altered the size of the window the text was in. Until then, all that personal computers could do was 'reflow' as a command, not as a dynamic and visible on-screen GUI activity. Maybe, I thought, Apple made the decision to have one character line endings precisely so they could have fast, simple reflowing of paragraphs in their GUI

Through the kind offices of his publisher, O'Reilly Books, I was able to ask Andy Hertzfeld, the designer of the system software on the original Macintosh, whether this consideration was part of his thinking, and in fact it wasn't. He explained that the choice of a single byte line terminator as a standard on the Mac was inherited from the Apple Lisa computer, which itself inherited the standard from the USCD Pascal system used to develop the Lisa and Mac. He expressed a preference for one-byte line termination in any case, since "it's simpler and smaller (one character vs two) and cleaner and better defined (no opportunity for ambiguity -- in the two character approach, what happens if one character is missing?)"

My paragraph reflower is copied below. It's a subroutine that, for the purposes of testing and showing to others, is here accompanied by other subroutines (using CP/M BDOS system calls) for entering the paragraph at the keyboard and for displaying it after it has been reflowed. My reflower takes three passes through the paragraph: the first strips out the existing line endings and puts spaces in their places thus shortening the byte-count), the second applies the new measure by changing some of those spaces to single-byte line terminators (just a CR), and then the third pass turns these one-byte CR line terminators into two-byte CR+LF terminators, thus increasing the byte-count.

I reckon that with some thought I could combine the first two passes, stripping out the existing line endings and applying the new ones on-the-fly. But I believe that the third pass has to remain a separate operation because it needs to work backwards from the end of the paragraph, since the paragraph may grow. (At least, that's essential if the reflower works on the paragraph 'in-situ', as mine done, rather than making a copy of it elsewhere in memory.)

I'd be interested in critiques of this code, and in particular:

1) Is there a more efficient algorithm I could have used? I don't mean better reflowing in itself, as in Donald Knuth's 'Minimum raggedness' routine that counts the sum of the squares of the number of spaces at the ends of the lines. I mean just a smarter way to achieve the simple 'Minimum number of lines' result used in my program.

2) Does the program commit any stylistic errors that a beginner should seek to avoid? I've tried to be scrupulous about each subroutine leaving every register the way it found it and may have overdone the PUSHing and POPping needed to achieve this. To make sure I didn't overflow the CP/M allocated stack I set up my own 64-byte stack as the guide I'm using, 'Soul of CP/M' by Michael Waite and Robert Lafore, recommends doing that.

Last thought: am I wrong to think that the CP/M BDOS routines are really annoying in the way that they trash the registers each time you call them? Surely good programming practice would have been to avoid doing this? I've had to put a lot of work into PUSHing and POPping registers every time I call a BDOS routine because of this problem. CP/M is brilliant and all, but isn't this bad practice? Or is there a good reason that they do it that I haven't though of?

Regards

Gabriel Egan


Code: Select all
;---------------------------------------------------------------------
; IN-SITU PARAGRAPH REFLOWER. Parameters are passed to this
;   subroutine thus:
;
; On entry, BC must contain the address of the start of the paragraph
;   (hereafter called SOP)
;
; On entry DE must contains the new width ('measure') to be applied
;
; On entry, the paragraph must have ASCII CR/LF at the end of every
;   line and CR/LF/CR/LF to mark the end of the paragraph. The same
;   formatting will structure the reflowed paragraph on exit.
;
; During execution the passed parameters are stored in
;   the subroutine's memory locations SOP and MEAS (allocated
;   by the assembler), registers BC and DE get used as a pair of
;   pointers to locations within the paragraph, and HL gets
;   used to count the number of CRs inserted to reflow the
;   paragraph. The use of 16-bit values for 'measure'
;   and CR-count means there is no limit (other than the
;   processor's own) on the width or length of the
;   paragraph. Because two-byte line-endings (CR/LF) are
;   necessarily changed to one-byte spaces and vice versa to
;   perform reflowing, the resulting paragraph will be shorter
;   in total bytes (if reflowed to a wider 'measure') or
;   longer in total bytes (if reflowed to a narrower 'measure')
;   than the original paragraph. It is the user's responsibility
;   to ensure that enough memory locations are free directly
;   after the end of the original paragraph to allow for any
;   expansion.
;
; On exit, all registers are restored to their on-entry values.
;
; CP/M's stack pointer is saved in memory and another, larger stack
;    is created for this program (using the 64 bytes after the program's
;    end), and then CP/M's stack pointer is restored on exit
;
;--------------------------------------------------------------------

;--------------------------------------------------------------------
; Assign some symbols to make the code more legible
;--------------------------------------------------------------------
BDOS    equ 05h ; The jump vector into BDOS
GETCHAR equ 01h ; BDOS code for getting one char from keyboard
WRTCHAR equ 02h ; BDOS code for writing a char to console
WRTLINE equ 09h ; BDOS code for writing a line to the console
CR      equ 0Dh ; ASCII carriage return character
LF      equ 0Ah ; ASCII line feed character
SPACE   equ 20h ; ASCII space character
CTRLC   equ 03h ; ASCII Control-C character

;----------------------------------------------------------------
; Here a short bit of code to call the REFLOW routine to show
; it works. First call GETKEY to have the user enter a
; paragraph, then call REFLOW to reflow it, and then call
; DISPlay to show the reflowed paragraph
;----------------------------------------------------------------
    org 0100h

; Save all registers we came in with
    PUSH PSW    ; First lets save all the registers
    PUSH B      ;   on the stack so we can restore
    PUSH D      ;   them upon exitting the routine
    PUSH H

; Set up our own stack starting 64 bytes after this program
    LXI H, 0h        ; Init HL to 0
    DAD SP           ; HL = HL + SP (so LH = SP)
    SHLD OLDSP       ; Stash away the CP/M stack pointer
    LXI SP, STKTOP   ; And put the address of our stack into SP
   
; Set the variables for REFLOW, call DISP, then REFLOW, then DISP again
    LXI B, 8000h     ; Put address of string-start into BC
    LXI D, 30h       ; Put desired measure (here 48 decimal) into DE
    CALL GETKEY      ; Let use type in a paragraph
    CALL REFLOW      ; Call the subroutine for flowing
    CALL DISP        ; Throw the flowed paragraph onto console

; Put back CP/M stack
    LHLD OLDSP       ; Put into HL the 16-bit address of CP/M's
                     ;   stack pointer we stashed away earlier
    SPHL             ; And put in the SP register

; Restore all the registers we came in with
    POP H            ; Before we Return lets get back all the registers
    POP D   
    POP B
    POP PSW

    RET              ; go back to CP/M
;-------------------------------------------------------------------------
; END OF MAIN PROGRAM. The rest is subroutines called by above.
;-------------------------------------------------------------------------

;--------------------------------------------------------------------------
; REFLOW actually starts here
;-----------------------------------------------------------------
REFLOW  PUSH PSW    ; First lets save all the registers
        PUSH B      ;   on the stack so we can restore
        PUSH D      ;   them upon exitting the routine
        PUSH H

; Now save in memory the two variables (SOP and 'measure')
; passed to the subroutine via regs BC and DE
        LXI H, SOP      ; Put the addresss of the variable SOP in HL
        MOV M, B        ;   and store there the contents of reg B
        LXI H, SOP+1    ; Put the address of second btye of SOP in HL
        MOV M, C        ;   and store there the contents of reg C
        LXI H, MEAS     ; Do the same for 'measure' passed in DE
        MOV M, D
        LXI H, MEAS+1
        MOV M, E

;-------------------------------------------------------------------
; Because we've saved their contents, we can use BC and DE as pointers
; and HL as the count of CRs. BC already contains the address of the
; start-of-block we wish to reflow
;--------------------------------------------------------------------

;------------------------------------------------------------------
; FIRST PASS. Turn all CR/LF pairs (except the two at the paragraph
; end) into single spaces, contracting the paragraph as necessary.
; BC will be the lo-pointer and DE the hi-pointer above it that
; skips over LFs we don't want to copy down to the lo-pointer
; location. In this pass we work our way up thru memory.
;------------------------------------------------------------------
        MOV D, B        ; Init. hi-pointer DE by
        MOV E, C        ;   copying lo-pointer BC

LOOP1   LDAX D          ; Copy one character from hi-pointer to
        STAX B          ;   lo-pointer
        CPI CR          ; Is the char we just moved a CR?
        JNZ NOTCR       ; If not, skip CR-handling
        INX D           ; Ok, we hit a CR, so is it the beginning of
        INX D           ;   a CR/LR/CR/LF run? Look two chars ahead
        LDAX D          ;   and see if that is a CR
        CPI CR          ;
        JNZ LINEND      ; No, then skip ahead to line-end handler
                        ; Yes, so we are at paragraph-end so let's tidy
                        ;   up the the end of new paragraph at lo-pointer.
                        ; We already put a CR at lo-pointer to start
        MVI A, LF       ;   the CR/LF/CR/LF run at lo-pointer,
        INX B           ;   so increment lo-pointer and
        STAX B          ;   put a LR there
        MVI A, CR       ; Then increment lo-pointer again and
        INX B           ;   put a CR there
        STAX B          ;
        MVI A, LF       ; Then do the same to put in the
        INX B           ;   last LF of CR/LF/CR/LF
        STAX B          ; 
        JMP PASS2       ; The para-end is now tidy so proceed to PASS2
LINEND                  ; If we get here the current char is CR but
                        ;   it's not the start of a CR/LF/CR/LF run.
        MVI A, SPACE    ; We just placed at lo-pointer a CR that we
        STAX B          ;   need to make a space, so do that.
        DCX D           ; We moved hi-pointer up 2 bytes to check for
        DCX D           ;   CR/LF/CR/LF so lets put it back, and then
        INX D           ;   move it up one to jump over the unwanted LF
NOTCR   INX B           ; Increment lo-pointer
        INX D           ;   and hi-pointer
        JMP LOOP1       ; And go back for next character

;---------------------------------------------------------------------------------
; SECOND PASS. We apply the new measure by turning the space after the
; last-word-that-fits (=the space before the first-word-that-won't-fit)
; on each line into a CR. (In the third pass we will turn these CRs into
; CR/LFs.) In this pass, BC will be lo-pointer, there is no hi-pointer,
; and DE (=space left) will start out holding the passed 'measure' and
; be decremented each time we accept a character and when it hits
; zero we've run out of room on the present line. DE (= space left)
; will be reset to 'measure' each time we start to process another
; HL will hold a count of the CRs we insert in this pass, which info
; is needed for the final pass.
;---------------------------------------------------------------------------------
PASS2   LXI H, SOP      ; Lets get SOP pointer back into BC (since we used
        MOV B, M        ;   BC as lo-pointer in PASS1) by recovering it
        LXI H, SOP+1    ;   from memory locations SOP and SOP+1
        MOV C, M
        LXI H, MEAS     ; And do the same for DE to hold Measure (since
        MOV D, M        ;   we used DE as hi-pointer in PASS1)
        LXI H, MEAS+1   ;
        MOV E, M
        LXI H, 0        ; Set HL (counter of CRs inserted) to zero

LOOP2   LDAX B          ; Get the char. currently lo-pointed to.
        CPI CR          ; Is it a CR? If so we've reach end-of-para
        JZ PASS3        ;   and should proceed to PASS3
                        ; So, it's not a CR. Do we need to put in a
                        ;   line-break because DE (=space left) = 0?
        MOV A, D        ; Put MSB of DE (=space left) into accumm.
        ORA E           ;   and OR with LSB of DE. If the result is not
        JNZ GOTROOM     ;   zero then DE (space left) still has room
                        ; Okay, DE (=space left) is zero so we need to
                        ; back up one char at a time to find a space
                        ; and put a CR there
NOTSPC  DCX B           ; Move lo-pointer back one character
        LDAX B          ; Get the char. pointed to
        CPI SPACE       ; Is it a space?
        JNZ NOTSPC      ; No, go around again to back up another char
        MVI A, CR       ; Yes, it's a space so lets put a CR here,
        STAX B          ;   overwriting that space
        LDA MEAS        ; Since we are thus starting a new line we
        MOV D, A        ;   need to restore DE (=space left)
        LDA MEAS+1      ;   to the passed value of Measure
        MOV E, A   
        INX B           ; Move lo-pointer up on to first byte after
                        ;    the CR we just placed, so it points to
                        ;    the zeroeth char of this new line
        INX H           ; Increment count of how many CRs we inserted
GOTROOM INX B           ; Point to next character in paragraph
        DCX D           ; Decrement DE (= space left) to allow for
                        ; the char we just accepted
    JMP LOOP2           ; Go back and process the next char.

;-----------------------------------------------------------------------
; THIRD PASS. Turn the single CRs that mark line-endings in the reflowed
; paragraph into CR/LF pairs, expanding the paragraph as necessary.
; We will use BC as lo-pointer and DE as hi-pointer. BC already
; points to the CR in the paragraph-ending CR/LF/CR/LF string and
; HL already contains the count of how many CRs we inserted in PASS2.
; We will work backwards in memory, starting with hi-pointer =
; lo-pointer + 3 (to allow for LR/CR/LF) + HL (to allow for one
; extra byte for each CR that we have to expand to CR/LF).
;-----------------------------------------------------------------------
PASS3   MOV D, B    ; Init point hi-pointer DE by copying
        MOV E, C    ;   lo-pointer BC
        INX D       ; Move hi-pointer DE up 3 bytes up so that it
        INX D       ;   it points to the last byte in the
        INX D       ;   paragraph (= last LF of the CR/LR/CR/LF run)

        XCHG        ; Add HL (count of CRs) to hi-pointer, bearing
        DAD D       ;   in mind that DAD (which adds DE to HL)
        XCHG        ;   leaves its result in HL and we want it in
                    ;   DE instead, so we swap HL<>DE before we do
                    ;   the addition and swap them again after

        MVI A, LF   ; Now write out the CR/LF/CR/LF string that ends
        STAX D      ;   the new reflowed paragraph. In reverse order
        MVI A, CR   ;   we put LF, CR, LF, CR into hi-pointer,
        DCX D       ;   hi-pointer-1, hi-pointer-2, and hi-pointer-3,
        STAX D      ;   by decrementing hi-pointer each time
        MVI A, LF
        DCX D
        STAX D
        MVI A, CR
        DCX D
        STAX D
                    ; lo-pointer BC now points to the first CR at
                    ; the end of the old paragraph and hi-pointer DE
                    ; points to the first CR at the end of the new
                    ; longer paragraph
LOOP3   DCX B       ; Move both pointers down one byte
        DCX D       ;   to get ready to process next char.
        LDAX B      ; Get the BE lo-pointer character into the accum.
        STAX D      ; Put it in the DE hi-pointer location
        CPI 0Dh     ; Was it a CR?
        JNZ LOOP3   ; No, then go back for next char
        MVI A, LF   ; Yes, so we have CR->CR/LF handling to do.
                    ; First, put an LF in the hi-pointer location to
        STAX D      ;   overwrite the CR we just stored there
        DCX D       ; Then back up one byte to the location before
        MVI A, CR   ;   that LF and put an CR there
        STAX D      ;
        DCX H       ; We've just handled a CR->CR/LF so lower the
                    ;   count of line-ends still to do
        MOV A, H    ; Is HL=0? Find out by putting MSB of HL
        ORA L       ;   into accumm and ORing with LSB of HL
        JNZ LOOP3   ; If the result not 0 we've more lines to do
                    ; If result is zero, we've just done the last
                    ; CR->CR/LF expansion and are done

; Write out closing message "Reflow done."
        LXI D, DONE     ; Load DE with the address of the start
                        ;   of the prompt string
        MVI C, WRTLINE  ; Load C with the BDOS call number for
                        ;  outputting a string to console
        CALL BDOS       ; Make the BDOS call

        POP H           ; Before we Return lets get back all the registers
        POP D   
        POP B
        POP PSW
        RET
;---------------------------------------------------------------------
; End of REFLOW subroutine
;---------------------------------------------------------------------

;---------------------------------------------------------------------
; DISPlay. This subroutine displays the paragraph that begins at the
; location pointed to by BC on entry. We will use the CP/M
; BDOS call "Console Output" in which the character to be displayed
; must be put in register E and the value 02h must be put in
; register C before CALLing BDOS (at location 0005h). All registers
; are restored before return
;---------------------------------------------------------------------
DISP    PUSH PSW        ; First lets save all the registers
        PUSH B          ;   on the stack so we can restore
        PUSH D          ;   them upon exitting the routine
        PUSH H
LOOP0   PUSH B          ; Save BC (because BDOS call will trash it)
        LDAX B          ; Get the character pointed to by BC
        MOV E, A        ; Put it into E
        MVI C, WRTCHAR  ; Put 02h into C (thereby trashing BC)
        CALL BDOS       ; Make the call that displays this character
        POP B           ; Restore BC
        LDAX B          ; Get the character we just displayed
        INX B           ; Point to next character
        CPI LF          ; Was it a line-feed we just displayed?
        JNZ LOOP0       ; No, so go back for the next character
                        ; Yes, so lets see if it was the LF at
                        ;   the end of a CR/LF/CR/LF run
        DCX B           ; Back up three memory locations
        DCX B           ;   (not two because we already incremented BC)
        DCX B           ;
        LDAX B          ; Get that character
        INX B           ;
        INX B           ; Put BC back where it was
        INX B           ;
        CPI LF          ; Is that also a LF?
        JNZ LOOP0       ; No, then merely a line-end, so go round again
                        ; Yes, so we've displayed the entire paragraph
                        ;   and can stop displaying
        POP H           ; Before we Return lets get back all the registers
        POP D   
        POP B
        POP PSW
        RET
;-----------------------------------------------------------------------
; End of DISPlay subroutine
;-----------------------------------------------------------------------   

;-----------------------------------------------------------------------
; GETKEY subroutine. This uses the get-one-char BDOS call
;-----------------------------------------------------------------------
GETKEY  PUSH PSW        ; First lets save all the registers
        PUSH B          ;   on the stack so we can restore
        PUSH D          l;   them upon exitting the routine
        PUSH H
LOOP    PUSH B          ; Save BC because BDOS call will trash it
        MVI C, 01h      ; Load up C with the BDOS call number
        CALL 05h        ; Make BDOS call to get a char and display it
        POP B           ; Get back the BC that BDOS just trashed
        CPI 03h         ; Did the user just hit CTRL-C?
        JZ CTC          ; Yes, then we're done
        STAX B          ; No, then store at end-of-block the incoming char
        CPI 0Dh         ; Was it a carriage-return?
        JNZ NOTLE       ; No, then skip line-end handler
        INX B           ; Yes, then point BC to next free memory location
        MVI A, 0Ah      ; Get a line-feed
        STAX B          ; Save it after the CR we just save
        PUSH B          ; We need another BDOS call to show this LF
        MOV E, A        ;   so save BC, put the LF into E,
        MVI C, 02h      ;   load up C with the BDOS call number,
        CALL 05h        ;   and make the BDOS call to display this LF
        POP B           ; Get back the BC that the BDOS call trashed
NOTLE   INX B           ; Point BC to the next free memory location
        JMP LOOP        ; Go back for another character from keyboard
CTC     POP H           ; Before we Return lets get back all the registers
        POP D   
        POP B
        POP PSW
        RET
;------------------------------------------------------------------------
; End of GETKEY subroutine
;------------------------------------------------------------------------

;------------------------------------------------------------------------
; Data Area
;------------------------------------------------------------------------
DONE    db  CR,LF,'Reflow done.',CR,LF,CR,LF,'$'
SOP     DS  2    ; Memory location SOP stores passed-B,
                 ;   and memory location SOP+1 stores passed-C
MEAS    DS  2    ; MEAS stores passed-D, MEAS+1 store passed-E
OLDSP   DS  2    ; Two bytes to stash CP/M's stack pointer
        DS  64   ; Reserve 64 bytes for our stack
STKTOP:          ; Set this label to the address at the top of
                 ; our stack

mail@gabrielegan.com
 
Posts: 104
Joined: October 11th, 2014, 8:12 am

Re: Reflowing a paragraph

Postby AltairClone » May 6th, 2016, 7:38 pm

Great work! The best way to improve your coding skills and style is to simply start writing and debugging code - which you've done. There are a number of things in the code that could be changed for speed and/or space efficiency, but that doesn't really matter yet. Great start with a fairly complex program!

Mike
AltairClone
Site Admin
 
Posts: 632
Joined: April 5th, 2013, 10:55 am


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 20 guests