D_Times_E: ; HL = D times E LD HL, 0 ; Use HL to keep track of the product XOR A ; Need to check if either factor is zero OR D RET Z OR E RET Z LD B, D ; Store one of the factors in the loop counter LD D, H ; Clear D so DE hold the other factor Loop: ADD HL, DE DJNZ Loop RETThis looks like a nice, simple routine that does the job. However, it has a little inconvenience. When D is a very large number, there are a lot of additions and the product is calculated very slowly.
579 × 163 1728 = 579×3 34740 = 579×60 + 57900 = 579×100 94368Each digit of the multiplicand is multiplied by each digit of the multiplier. The partial products are then added together to give the result.
If we do this in base-2, we follow the same procedure, but it looks less complicated:
%00001101 (13) × %00000110 (6) %00000000 %00011010 %00110100 + %00000000 %01001110 (78)You can see that the multiplicand is being multiplied by either zero or one, so each partial product is either zero, or the original multiplicand itself, shifted an appropriate amount.
To convert this procedure for assembly:
.module DE_Times_A DE_Times_A: ; HL = DE × A LD HL, 0 ; Use HL to store the product LD B, 8 ; Eight bits to check _loop: RRCA ; Check least-significant bit of accumulator JR NC, _skip ; If zero, skip addition ADD HL, DE _skip: SLA E ; Shift DE one bit left RL D DJNZ _loop RETThis routine will run much faster than the previous one, since the speed isn't based on the value of the multiplier, but rather the amount of 1s.
If we limit the factors to 8 bits, we can make an even faster routine by storing the multiplier and the product in one register:
.module H_Times_E H_Times_E: ; HL = H × E LD D, 0 ; Zero D and L LD L, D LD B, 8 _loop: ADD HL, HL ; Get most-significant bit of HL JR NC, _skip ADD HL, DE _skip: DJNZ _loop RETYou know from Day 9 that ADD HL, HL effectively shifts HL one bit to the left. We are therefore checking the multiplier (H) from its most-significant end rather than the least-significant. In other words, we perform DE×128, DE×64... instead of DE×1, DE×2...
So, if we add DE on the first iteration, the result will get shifted over 7 times, i.e. DE×27.
You might have an uneasy feeling that repeated addition of DE will corrupt our factor in H. However, this is impossible because the result of an 8-bit number plus another 8-bit number can never be more than 9 bits. By the time this can happen, the multiplier has vacated the lower two bits of H:
For example, if we tried 2552 (the original value of H is blue):
Iteration | Command | Binary Value of HL |
---|---|---|
1 | ADD HL, HL ADD HL, DE |
11111110 00000000 11111110 11111111 |
2 | ADD HL, HL ADD HL, DE |
11111101 11111110 11111110 11111101 |
3 | ADD HL, HL ADD HL, DE |
11111101 11111010 11111110 11111001 |
4 | ADD HL, HL ADD HL, DE |
11111101
11110010 11111110 11110001 |
8026 12 | 96315 -96 03 - 0 31 -24 75 -72 3We take the first digit of the dividend and subtract the largest multiple of the divisor that will fit. We then take the next digit of the dividend and weld it to the remainder. This is repeated until we run out of digits.
If done in binary, we only have to subtract zero or the divisor:
00101010 101 | 11010110 -101 110 -101 111 -101 100The general algorithm, in English, for dividing a number in n bits by a number in m bits,
.module Div_HL_D Div_HL_D: ; HL = HL ÷ D, A = remainder XOR A ; Clear upper eight bits of AHL LD B, 16 ; Sixteen bits in dividend _loop: ADD HL, HL ; Do a SLA HL RLA ; This moves the upper bits of the dividend into A JR C, _overflow CP D ; Check if we can subtract the divisor JR C, _skip ; Carry means D > A _overflow: SUB D ; Do subtraction for real this time INC L ; Set bit 0 of quotient _skip: DJNZ _loop RETThe dividend is HL and the temp area is A. This is not strictly in keeping with the requirement that the temp area be m+1 bits (unless the divisor is restricted to seven or fewer bits), so in cases where D > L > $80, there will be an overflow.
For example, if HL = $8C00 and D = $90,
Iteration | Commands | Binary value of HL | Binary value of A |
---|---|---|---|
1 | ADD HL, HL RLA |
0001100000000000 |
00000001 |
2 | ADD HL, HL RLA |
0011000000000000 |
00000010 |
3 | ADD HL, HL RLA |
0110000000000000 |
00000100 |
4 | ADD HL, HL RLA |
1100000000000000 |
00001000 |
5 | ADD RLA |
1000000000000000 |
00010001 |
6 | ADD HL, HL RLA |
0000000000000000 |
00100011 |
7 | ADD HL, HL RLA |
0000000000000000 |
00100011 |
8 | ADD HL, HL RLA |
0000000000000000 |
01000110 |
8 | ADD HL, HL RLA |
0000000000000000 |
10001100 |
9 | ADD HL, HL RLA SUB D INC L |
0000000000000000 0000000000000001 |
00011000 10001000 |
10 | ADD HL, HL RLA SUB D INC L |
0000000000000010 0000000000000011 |
00010000 10000000 |
11 | ADD HL, HL RLA SUB D INC L |
0000000000000110 0000000000000111 |
00010000 10000000 |
12 | ADD HL, HL RLA SUB D INC L |
0000000000001110 0000000000001111 |
00000000 01110000 |
13 | ADD HL, HL RLA SUB D INC L |
0000000000011110 0000000000011111 |
10100000 00010000 |
14 | ADD HL, HL RLA 0000000000111110 |
0010000 | |
15 | ADD HL, HL RLA 0000000001111100 |
0100000 | |
15 | ADD HL, HL RLA 0000000011111000 |
1000000 |
ADC A, { reg8 | imm8 | (HL) } | Adds the operand and the carry flag to the accumulator. | ||||||||
ADC HL, reg16 | Adds reg16 and the carry flag to HL. | ||||||||
|
1 7695 + 2182 9877In the tens position, 9 + 8 = 17, which "overflowed". So you write down '7' and carry the '1'. Add 6 + 1 with the carry to compensate, and everything works out all right. This is exactly how ADC is meant to work (amazing, eh? All those years in elementary school, you were learning assembly and didn't even know it). In an assembly implementation, you work on bytes or words instead of digits, but the theory is the same. So let's try it.
Example: Add 32-bit number dword1 with 32-bit number dword2.
LD HL, (dword1) ; Get least-significant word of dword1 LD DE, (dword2) ; Get least-significant word of dword2 ADD HL, DE ; Add them LD (result), HL ; Store least-significant result LD HL, (dword1 + 2) ; Get most-significant word of dword1 LD DE, (dword2 + 2) ; Get most-significant word of dword2 ADC HL, DE ; Add them with the carry from the previous addition LD (result + 2), HL ; Store most-significant result RET dword1: .DB $B3, $90, $12, $32 ; Each dword is stored with the least-significant dword2: .DB $F1, $15, $06, $B8 ; bytes first. You could just as easily have stored result: .DB $00, $00, $00, $00 ; them in big-endian, but because of how registers are ; loaded from RAM, it wouldn't work.This will end up adding $321290B3 + $B80615F1.
SBC A, { reg8 | imm8 | (HL) } | Subtracts the operand and the carry flag from the accumulator. | ||||||||
SBC HL, reg16 | Subtracts reg16 and the carry flag from HL. | ||||||||
|
7 17 1Ok, 6 - 1 = 5. Next is a problem, can't do 7 - 9. Solution: add 10 and subtract 1 from the next pair to compensate. SBC works in the same way. When a subtraction result is negative, 256 is effectively added to the byte in the minuend and the carry flag is set.8 76 - 6 9 1 1 1 8 5
LD HL, (dword1) ; Get least-significant word of dword1 LD DE, (dword2) ; Get least-significant word of dword2 SBC HL, DE ; Add them LD (result), HL ; Store least-significant result LD HL, (dword1 + 2) ; Get most-significant word of dword1 LD DE, (dword2 + 2) ; Get most-significant word of dword2 SBC HL, DE ; Add them with the carry from the previous addition LD (result + 2), HL ; Store most-significant result RET dword1: .DB $B3, $90, $12, $32 dword2: .DB $F1, $15, $06, $B8 result: .DB $00, $00, $00, $00This routine looks okay, but it has a subtle bug in it, and maybe the more observant of you have noticed. Maybe you're thinking, "Gee, Sean, what happens if the carry flag is set at the start?" and the answer to that is, "The answer's gonna be off by one." And now maybe you're thinking "Gee, Sean, doesn't that make us screwed?" and the answer to that is, "Yes, it does."
Ah. It appears that boolean operations will reset the carry. An OR A before should set things right.
Take the two values $38A4 and $9B4C. Just by comparing the MSBs tells you which one is bigger. In fact, only when the MSBs are the same do you need to compare both bits, and the carry is reset in such a case.
; Do a jump to "success" if the dword at HL is greater than the dword at DE LD DE, dword1 LD HL, dword2 LD B, 4 CmpLoop: LD A, (DE) CP (HL) JR C, success JR NZ, failure INC HL INC DE DJNZ CmpLoop failure: ; Code here deals with (DE) >= (HL)
; Do a jump to "success" if the dword at HL is greater than the dword at DE LD DE, dword1 LD HL, dword2 LD B, 4 CmpLoop: LD A, (DE) SUB (HL) JP PO, $+5 XOR $80 JP M, success ; This code snippet restores the Z flag that got changed by XOR JP PO, $+5 XOR $80 JR NZ, failure ; Since the byte (DE) is >= the byte (HL), ; then an inequality means we failed INC HL INC DE DJNZ CmpLoop failure: ; Code here deals with (DE) >= (HL)
LD HL, qword1 LD DE, qword2 LD B, 8 BoolLoop: LD A, (DE) AND (HL) LD (DE), A INC HL INC DE DJNZ BoolLoop
#define MAX 4 ; Number of bytes LD HL, dword AND A ; Clear carry LD B, MAX - 1 NegLoop: LD A, 0 ; Cannot use XOR A because it would disturb carry SBC A, (HL) LD (HL), A ; Store result INC HL ; Next byte DJNZ NegLoop
LD B, 3 ; Number of bits to shift ShiftLoop: LD HL, dword SRL (HL) INC HL RR (HL) INC HL RR (HL) INC HL RR (HL) DJNZ ShiftLoop dword: .DB $B3, $90, $12, $32
LD HL, dword+3 RL (HL) DEC HL RL (HL) DEC HL RL (HL) DEC HL RL (HL) dword: .DB $B3, $90, $12, $32For RLC-type rotation:
LD HL, dword+3 PUSH HL SLA (HL) DEC HL RL (HL) DEC HL RL (HL) DEC HL RL (HL) POP HL JR NC, $+3 INC (HL) ; Set last bit of (HL) if carry was set
The routine that is given below is certainly not the most efficient, only the most general. Regardless, it is probably one of the most complicated pieces of coding in this entire guide.
.module XMul ;B = Size of multiplier ;C = Size of multiplicand ;DE = Address of multiplier ;HL = Address of multiplicand ;IX = Address of product buffer (B + C bytes, you can use logarithms to see why this is so.) ; ;All registers including IY are destroyed XMul: LD IYH, B LD IYL, CFirst of all, we will have to use the size counters multiple times, so we save them into IY.
XOR A PUSH IX _Clear1: LD (IX), A INC IX DJNZ _Clear1 LD B, C _Clear2: LD (IX), A INC IX DJNZ _Clear2 POP IXNow we initialize the product area of memory by setting it all to zeros.
_LoopA: LD C, (HL) LD B, IYH PUSH HL PUSH IX PUSH DEGet one byte of the multiplicand into C. Then restore the size of the multiplier into B. Finally save all the pointers.
_LoopB: LD A, (DE) LD H, A CALL mul_hc LD A, L ADD A, (IX) LD (IX), A LD A, H ADC A, (IX + 1) LD (IX + 1), AOkay, we get one byte of the multiplier into H and multiply H by C using the routine below, getting the product in HL. Now we take this partial product and integrate it into the current full product.
JR NC, _EndB PUSH IX POP HL INC HL INC HL _CyLoop: INC (HL) INC HL JR Z, _CyLoopNow this definately requires some explanation. We have just added HL to two bytes of the product, but there might have been a carry out of this addition, so the next byte of the product is incremented. However the product could be something like $12FFFFFFFFFF, so we need to keep propagating the carry as far as necessary. This brings up a slight problem in that INC does not affect the carry flag. But this can be remedied with a little trick. Imagine for a second that INC actually did affect carry. You should quickly discover that the only circumstance under which carry will be set is when the incremented byte goes from $FF to $00, and in this case zero will be set! Sooooo, all we need to do is to just blindly increment bytes as long as INC (HL) sets Z.
_EndB: INC IX INC DE DJNZ _LoopBWe're done with the current byte of the multiplier so we increment the pointer, and we increment the product pointer to work in the next partial product. And do our calculations again.
POP DE POP IX POP HL INC HL INC IX DEC IYL JP NZ, _LoopA RETSo we're finished with the entire multiplier at this point and we pop the original values of all our pointers back. Now we increment our multiplicand pointer to get the next byte, and increment the product pointer (analogous to indenting a partial product when multiplying on paper).
.module mul_hc mul_hc: PUSH BC LD L, 0 LD B, L LD A, 8 _loop: ADD HL, HL JR NC, _skip ADD HL, BC _skip: DEC A JP NZ, _loop POP BC RETThis is the H_Times_E routine from the beginning of this chapter with some modifications to work in this particular context.
.module XDiv ;IX = Address of dividend ;BC = Size of dividend ;E = Divisor XDiv: LD D, 0 _loop: LD H, D LD L, (IX) CALL DivHLByE LD (IX), L LD D, A DEC IX DEC BC LD A, B OR C JR NZ, _loop RET .module DivHLByE DivHLByE: PUSH BC XOR A LD B, 16 _loop: ADD HL, HL RLA JR C, _overflow CP E JR C, _skip _overflow: SUB E INC L _skip: DJNZ _loop POP BC RET
; Sign-extends E into DE LD A, E RLCA ; Move sign bit into carry flag SBC A, A ; A is now 0 or 11111111, depending on the carry LD D, A ; Sign-extends DE into HLDE LD H, D LD L, E ADD HL, HL ; Move sign bit into carry flag SBC HL, HL ; HL is now 0 or 11111111 11111111, depending
So what exactly is fixed-point? It is a form of computer math that uses an integer to contain both the characteristic and the mantissa by means of scaling. To fully understand how this works, let us take a short detour into some more place value theory.
Take the decimal number 605.916. We know without thinking that the 605 is really a shorthand way to depict 6×102 + 0×101 + 5×100. The decimal half can easily be figured by taking geometric progression of the 10ns and extending it:
6×102 + 0×101 + 5×100 + 9×10-1 + 1×10-2 + 6×10-3
Since the fractional part of the number is just more terms of place value, we can actually apply it to any radix. We might, therefore, encounter a duodecimal such as 12B0.293. This can be converted to a more familiar decimal number:
1×123 + 2×122 + 11×121 + 0×120 + 2×12-1 + 9×12-2 + 3×12-3 = 2149.123699
How does this relate to programming. Let's consider for a moment, that for the binary base there is an infinite continuum of place values that can model all binary real numbers, and that an eight-bit register can be superimposed on this continuum to reveal an 8-place "snapshot" of a binary number, and as has normally been the case, that snapshot is of the first eight positive places:
2oo | ··· | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 2-1 | 2-2 | 2-3 | 2-4 | ··· | 2-oo |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Zeros continuing to infinity |
0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | Zeros continuing to infinity |
2oo | ··· | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 2-1 | 2-2 | 2-3 | 2-4 | ··· | 2-oo |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Zeros continuing to infinity |
0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | Zeros continuing to infinity |
Now we should set up some ground rules so that we are all on the same wavelength. First, while we could place the binary point in any of the 17 positions, I will set it to between bits 7 and 8. This decision has two clear advantages: we can easily extract either the whole or fractional part, and it provides the most balanced compromise between numerical range and fractional precision. Rule #2 is more of a procedure, but we can convert a number to fixed point by either a left shift by 8 or a multiplication by 256.
If we were to add (or subtract) two FP numbers, the internal calculation would be 256(a) + 256(b) = 256(a + b). The result is valid since the scaling factor stayed constant. Thus, fixed point numbers may be added or subtracted using ordinary integer arithmetic instructions.
Suppose we were to get a product of two FP numbers: 256(a) × 256(b) = 65536(ab). The answer has a scaling factor of 65536, which means we get only the fractional part of the result. We must correct this somehow:
Contrary to multiplication, division suffers an almost inverse scaling problem: 256(a) ÷ 256(b) = 1(a÷b). The solutions to this problem are similar to multiplication, but again the best method may well be to scale the dividend to 32 bits and craft a routine to divide it by a 16-bit divisor.
The whole premise is based on the fact that if you have a number x, then division by x is the same operation as multiplication by 1/x. This ain't that pansy quotient of 256 that gave only approximate results, but the bona fide x-1 multiplicative inverse. The reciprocal is in, you guessed it, fixed point.
E.g., let's use a divisor of 15. The reciprocal is 1/15, or 0.0666.
Now I'm asking you, how many bits should we use so that (a) we get a result that is accurate enough, and (b) involves no more than the absolute minimum of arithmetic?
We are now going to calculate the product as this:
q = x >> 4; q = (q + x) >> 4; q = (q + x) >> 4;Did you get all that? Heh-heh-heh, okay, here's a blow-by-blow analysis of each operation.
q = x >> 4Okay, when we start, q = x × m, where m is initially 1 and will eventually become the reciprocal. A right shift by four will make m = .0001.
q = (q + x) >> 4If we add the original number, we get a multiplier m = 1.0001, see? If this is then shifted right by four, where the first shift is shifting in the carry from (q + x), m now equals 0.00010001. If we do it again, we get a multiplier of 0.000100010001. Isn't that just magickal?
Were you the cautious type you might want to quit while we're ahead, but forget it! The next step is to go the whole nine yards and get the remainder too. It's straightforward enough, remultiply the remainder by 15.
r = (q << 4) - q; r = x - r;Enough of all this C, let's see it in assembler already.
Div_15: ; IN A dividend ; OUT C quotient ; A remainder LD B, A RRA RRA RRA RRA AND $0F ; A = q * 0.0001 ADD A, B ; A = q * 1.0001 RRA RRA RRA RRA AND $1F ; A = q * 0.00010001 ADD A, B ; A = q * 1.00010001 RRA RRA RRA RRA AND $1F ; A = q * 0.000100010001 LD C, A ADD A, A ADD A, A ADD A, A ADD A, A SUB C ; A = r * 15 SUB B NEG ; A = -(r * 15 - x) = x - r * 15 RET