In this "issue":
A relatively short StuChat this time, but that isn't to say it isn't useful. Indeed, the cache article is probably the most researched article this year and offers you the chance to really speed your project up with some careful thought.
The rest of this StuChat is very much lighthearted, but that's because it's Christmas, and it's my last "chat" of the year.
So. Brains in gear. Wake up at the back.
This article will discuss the PowerPC caches; what they are, what they do and how to use them. You can achieve a worthwhile increase from effective use of the cache and related instructions. You can also completely destroy your performance through misunderstanding the operation of the cache and PowerPC memory architecture.
1. Some processors have just one "unified" cache (601) and some have two (code and data) caches. For the purposes of this article the word "cache" can be taken to mean "data cache".
2. This article generally assumes the memory being accessed is marked as caching allowed.
3. The 601 is used by way of example - other processor "differences" are highlighted where necessary.
4. There are some differences between processors within the current PowerPC family. One of the major differences you need to bear in mind is that it is easy to "block" a 603 whereas 601 or 604 are non blocking. "Non blocking" means a "cache miss" does not block the use of the cache for other operations during the next cycle.
5. There are some differences in performance between little endian and big endian with respect to alignment - this article assumes the normal Macintosh big-endian mode.
The cache provides instructions and data for the processor to operate on. We are primarily concerned with data, but for the sake of completeness the next few paragraphs describe how the processor receives instructions.
The processor takes instructions out of an "instruction
queue". A 601 holds eight instructions in the queue. The
instructions can be "dispatched" from positions q0 to
q3 where q0 is the end of the queue and q7 is the start. These
instructions are fed into the instruction queue from the cache
at q7.
For example, an F.P.U. instruction may be taken out at q3 if the
F.P.U. is not busy.
The instruction cache feeds the instruction queue directly; as instructions are processed they are removed from the queue and new instructions are fed into the top. On a 601 the instruction queue is 8 words in size and a cache "sector" or block is 8 words long, therefore the processor can "load" eight instructions from the cache into the queue in one cycle.
Impressive fact: The bus between the cache and the queue is 8 words wide and at 66Mhz can transfer at a rate of 2.1 Gbytes/second.
When the processor needs either instructions or data it "looks" first in the on chip level one cache. If the data is there no external access will occur and the data is read quickly. If the data isn't in the level one cache, the memory interface will look in any level 2 cache fitted. If not found in the level two cache (or if no level 2 cache is fitted), the data will be retrieved from main memory.
Level 1 cache is extremely fast - nanoseconds. Level 2 cache is fast - 10-15 nanoseconds. Main memory is slow - 50 to 70 nanoseconds. (Access times given).
Caches come in three basic designs (there are others):
1. Direct Mapping. Here the cache very simply maps directly onto memory. It is very fast as there is no searching. This type of cache is expensive because it ideally has to be as large as possible to prevent misses or severe bus traffic with certain types of data access - for example cyclic access where the physical amount of data being accessed is larger than the cache.
2. Fully Associative. Here a block in memory can be placed in any of the available "lines" and a "tag" is stored that represents the address of the data stored at this line. When checking for data in the cache, it's possible that all the tags will have to be compared to either find or not find the required address. Again expensive as fast searches require fast hardware.
3. Set Associative.This is an associative cache but made cheaper by grouping the available lines into "sets". Once the particular set has been identified. the number of searches required to prove a miss are reduced by a factor "w" where "w" is the number of sets available.
The PowerPC 601 processor is described as having an "eight-way set-associative" unified cache. From the above three points, we can now see why.
PowerPC caches are physically addressed - that is the "tag" bits within a set represent the "real"* upper (20) bits of the address the datum resides in in real memory. Thus to scan the cache for a datum the processor must convert the "virtual" address to a physical address - this is carried out in parallel with other operations and takes no time.
For your information, a 603 has two two-way associative 8k
caches, a 603e has two four-way associative 16k caches, a 604
has 2 four-way 16k caches and a 604e has two four-way 32k caches.
A 620 has two eight-way set-associative instruction and data caches;
data cache is 2-way interleaved.
*physical
A problem specific to unified cache designs (601 only) is that the cache unit can only handle one request at any one time. But on any given cycle the processor may generate cache requests that require instructions and data simultaneously.
The 601 provides a three entry cache retry buffer to efficiently handle this problem. This contains buffering for one outstanding floating-point store, one integer load/store or floating-point load and an instruction fetch.
It doesn't matter how small the size of data being accessed is; if it isn't in cache (a read miss) a whole block will be read irrespective of whether one is reading or storing.
Cache blocks are 32 bytes (8 words) and are always aligned on 32-byte boundaries. You may well want to align data to a 32 or 64 byte boundary to prevent unnecessary data transfer. (The 601 aligns a sector (two blocks) on a 16 word boundary). See alignment below for more details.
Important point; because the cache is of a finite size, this also means that a whole cache block may well need to be written (cast out) before a new block can be read thus generating more memory bus overhead.
Processors with two blocks per sector (601,604) may well try to read in the next block as a low priority* bus operation which again will cause the second block of the cast sector to be written if modified - a second-cycle cast-out.
When the cache is full and needs to read a line in, deciding
which line to write out to main memory is controlled by a "Least
Recently Used (LRU)" algorithm. This LRU algorithm is strictly
enforced, so even if there is "dirty" data in the cache
that needs writing to memory the LRU algorithm will still write
out one of the oldest lines.**
*Low priority is defined
as "both system interface read-queue elements being free".
**We have been unable to find out
exactly how many bits are used for the LRU "modification
time" within a block. We suspect the LRU "time"
is indicated by storing a sequential counter when the block is
actually modified.
601 and 604:
Each cache "line" is two blocks (16 words or 64 bytes)
and on the 601 there are 32 cache lines per "set" and
a grand total of eight sets (64 bytes per line times 32 lines
per "set" times 8 sets = 32k).
Thus a block is 32 bytes, a line is 64 bytes, a set is 4096 bytes.
603: Each cache line is 32 bytes, or one block.
The PowerPC family has a rich set of cache control instructions that may be unfamiliar to those used to other processors.
The one operation that most will be familiar with is cache touching - effectively warning the processor that you will be operating on data shortly; thus now would be a good time to retrieve the data if it isn't already in cache. If your program tries to read data that isn't in the cache, the processor effectively stops; waiting for the data to appear.
Imagine you are in a loop moving many words of data. The actual loop code will be contained in the level 1 cache, so the only data on the bus is the data you are moving (if it's not possibly in cache already - you should know this!). So if you are pretty sure the data is not in the cache, why not touch the next block as many cycles as possible before you need it? See "dcbz" below for further optimisation.
There are two touch instructions, one for reading and one for writing; dcbt and dcbtst. The processor treats them both the same, but level two cache may not, so use the right one.
There are a few problems with touching. Firstly touching addresses already in cache simply wastes cycles. Touching addresses you don't need increases data bus traffic unnecessarily. You can easily overload the touching mechanism which will have a different affect on different processors - a 603 will just block if a touch is executed whilst a touch is in progress. Finally, a 603 uses just one buffer to hold the touch buffer - if you don't use the data before another touch is issued the contents of that cache line are thrown away.
As mentioned above, the cache is only so big. When a line is read, if the cache is full, a line has to be written. If the line going is marked as modified it has to be physically written to memory generating more bus traffic when you possibly don't need it.
The data cache block store "dcbst" instruction physically writes a line to memory and thus the line is no longer marked as modified. This is handy if you know the processor is going to be busy actually computing as you can get the memory bus working whilst the C.P.U. is busy executing instructions from cache.
Now when a new line needs to be read in, the line being replaced doesn't have to be written because you wrote it when the bus would have been idle (as long as the LRU doesn't get in the way). Hence writing out newish data isn't worth it. Writing out data you'll be using again shortly is a waste of cycles. Writing out "old" data in anticipation of later cache reads is a big win.
Finally, the most dangerous instruction is data cache block
zero - "dcbz".
This physically zero's out a line of data which is dangerous but
prevents a lot of bus traffic.
Normally when writing to "cacheable" memory, the area surrounding the address is brought into the cache (this can be many reads on a 64 bit bus). This is a waste if you'll be writing to EVERY byte in the cache block.
Data Cache Block Zero simply prevents the reading from taking place as the data is established in the cache line as zero bytes. This instruction is dangerous as it is very easy to have one two many at the end of a loop which will zero out data that shouldn't be wiped!
If the block is marked as write through or caching inhibited, then the system alignment exception handler is invoked - for example V.R.A.M. is marked as caching disabled.
Performance is dependent on the following factors:
The first three are the most easy to program for. Firstly operand size and alignment. The processor performs best when the operand is "naturally" aligned. Load/store operations that access unaligned operands may access two aligned double words (it's a 64 bit data bus remember) thus the access takes at least twice as long as it does if the data was aligned.
Therefore a half should be even aligned, a word/single should
be quad aligned, a double should be octal aligned. You can ensure
this is the case with the align, rs_align and rs_align_auto directives.
A byte is always naturally aligned.
The third item in the list above - crossing a cache block boundary can be taken care of through careful design of structures. You may want to pad structures out, or even completely skew the logical alignment onto a physical alignment which will minimalise block crossing during access. Again, use the align directives to take care of this.
The fourth item above refers to Load/Store accesses that cross a 4-Kbyte page boundary or a 256-Mbyte segment boundary. These accesses may take an alignment exception and rerun the executing instruction if not a single register load/store. It may be worth ensuring any frequently accessed data buffers less than 4k in size are aligned to a 4k boundary by creating a buffer 8k in size and setting your logical start to the 4k boundary. Easily achieved by ANDing the original address with 4095 (0xfff) subtracting this from 4096 (0x1000) and then adding this result to the initial address.
There will be a performance degradation when crossing a cache block because the cache has to check whether the address of the datum in the next block is resident within the cache and valid.
With respect to crossing a page boundary; if the boundary is a protection boundary then an exception will be generated. This normally happens if the page has not been accessed by the program previously or because the processor needs to check for a change in storage control attributes.
For more details on alignment, see section 7.3.3.3.4 in the 601 user manual or Book II of the PowerPC Architecture - Virtual Architecture..
Note: Unaligned external control access addresses will almost certainly generate an exception. - for example to an I/O device.Cache control instructions, along with careful structure design can provide big performance improvements - but don't forget that every cache control instruction you use takes 1 cycle to execute.
Data alignment is a must and should be thought out carefully.
60x User Manuals. I.B.M.
The PowerPC architecture. Books I, II, III .I.B.M.
comp.sys.powerpc.tech.. Maynard Handley.
PowerPC 601 Microprocessor. Charles R. Moore
"The application of skewed-associative memories to cache only memory architectures" - Muller, Stallard, Warren - D.C.S. University of Bristol. Aug. 95. First published in Proceedings of the 1995 International Conference on Parallel Processing. ICPP Vol I pp150-154. CRC press.
"Technik des Wissenschaftlichen Rechnens" (Techniques of Scientific Computing) TU München - Ulrich Ruede, 22 Feb.1995
Many thanks to:
Rob Probin and James Hague for technical reading.
Celine Dione and Chris Rea for providing the music at this end
of term bash..
This has been an intense, hard graft. Sifting through all the registrations. Averaging everything out. Reams of paper. Hundreds of registration forms. Tedious to the point of Guinness. But interesting.
If you believe that load of old tosh I have a second hand car you may be interested in.
I lied. It's a computerised database.
But it's a database with a problem....
This particular database's problem stems from the fact that
there isn't an average Fantasm user. No; there are three forms
of the average Fantasm user.
This then is the root of the database's problem. And if the truth
be known, in this situation the space used by the database would
be better occupied by desktop patterns.
The astute amongst you will have realised there can be only one "average" and I am talking nonsense.
Thus I lied about lying and obviously this isn't a serious piece of writing at all. But that's O.K. too as it is Christmas after all and that piece on caching was hard work (and rather funny at one point, well I thought so anyway...), so just what or who is the average Fantasm user?
The first type is the professional hi-tech type. Users in this category generally cite medical imaging, mathematical analysis, jet propulsion etc. as their main use. It would be unwise to take the mickey out of this group, so instead I'll move onto the second group.
The second group are hardened game writers, demo coders (normally ex-Amiga) and just plain hackers. I'll come back to this group in a sec. because they just know I'm going to have a go at them :-) (Because they've been having a go at me all bloody year!)
The third and final group are "beginners" - not complete beginners, but beginners at assembly language . This group know why they want Fantasm - to learn assembly language and get promoted to group #2. These people I have a lot of time for.
Lighthearted bitNow then, moving back to wonderful group 2. These people are highly intelligent, well motivated, have high speed Internet access and think they know it all. They generally love Fantasm but unfortunately when something goes wrong or is missing we get death threats and they expect the thing fixed yesterday. If you believed these people you'd think that Fantasm has NO useful directives. I have a list as long as your arm as to what these people want - "write your own matrix multiply code" that's what I say.
My absolute favourites are demo writers - oops, sorry - "coderz". Rob's definition of a demo coder -"Can't handle big systems, so writes small demo's instead" - yes that's right - failed game authors (you have his email address :-)). It's true. I'd bet quite a lot that you'll never see a scrolling Mac demo - you see scrolling Mac games (but not many), but not scrolling Mac demos. Why? 'Cause it's too hard for the demo coders to keep track of all the printing operations and manage the scrolling at the same time. How many scrolling Amiga demos were there - lots. How many "copper" demos were there - lots. Why? 'Cause it was easy.
Ooh, I feel a lot better now- I've been waiting all year for that <BG>.
Well that was 1996 then. Was it good for you? It was good for me (by the end - the start was a tadge iffy).
Back in January we were finishing off the first release version of PF, and at the very same time the Mac (according to the press) appeared to be dying rather rapidly. We were not terribly happy. Indeed, if it wasn't for the fact that we actually needed PF it is very debatable it would have been finished and now we'd be writing for Intel boxes (as an aside, I see Intel are looking for people with VLW interests, but that's another story). To this day you can still see remnants of what we were thinking in Fants scroller.
So, we weren't happy, but decided to ignore the press and carry on. Finally, round about February we saw V4.00 released with amazingly very few actual assembler bugs.
So we pressed on and I'm actually glad we did - people bought PF. The cynics said "nobody" wants to actually program in PPC assembly language - I can only imagine these people really have no idea of what they are talking about. Being reasonably proficient in both 68k and PPC I can certainly say that PPC makes 68k look rather old fashioned if not down right hostile! There is so much you can do at an assembly language level that a compiler would have great difficulty doing (cache control for example - see above) which can have a real dramatic affect on the run time performance. Most everything is 32 bit so no sizing to worry about, and when you get your head into the PPC ethos it all makes sense - even the parallelism.
The rest of 1996 saw progressive upgrades and bug fixes along with some parallel work into little sidelines. I think the most amazing thing about 1996 is that the compiler still isn't finished. We have notes in the master design document exclaiming about how old some entries are!
So, that was 1996 and that was the last StuChat and a jolly
good Job it was too. NeXT up it's 1997. Notice the way I steered
clear of Apple and O.S. issues. Personally I would have far preferred
Be. But that's life. As long as we get messaging and memory protection
I'm happy.
Coming up pretty much immediately in the New Year we have Fant416
which has been detailed in other places. Also we'll be providing
an index into the technical articles on this site for quick reference
and sometime in 1997 have FantV5 and Fanta_C coming.
Have a very merry Christmas, have a Happy New year and till then
Code on!
All at Lightsoft.