You ask for it....I'll find it or write it...
I will be posting a complete watercooling guide soon.
I hope this helps you A64 guys out there.
------------------------------------------------------------------------------------------------------------------------
Intro to A64 Architecture
Traditionally, a Northbridge exists between the memory bus and the CPU. The rate at which data is transferred between the memory and CPU is known as the front side bus. However, the Athlon64’s memory controller is on-die, and as such, has no Northbridge, nor a front side bus. The Athlon64’s have two independent buses; one between the memory and the on-die controller, and another bus that communicates with the other system devices- the HyperTransport bus. The CPU’s clock speed is determined by the HyperTransport speed multiplied by a clock multiplier, which is why it’s often suggested to view the HyperTransport bus as if it were the front side bus. However, this is about where the similarities between the two diverge. Traditionally, the memory speed is derived off of the front side bus, and can be manipulated by FSB/memory ratios. In contrast, in the A64, memory speed is derived off of the CPU speed in CPU/memory ratios. This is why it’s rather inaccurate to say that the memory is ever running “synchronously.” The memory is always running asynchronously with respect to the CPU speed, off of which it’s derived. How fast it’s running with respect to the HyperTransport bus does not matter at all. There is no latency hit in running the memory slower than the HyperTransport bus. The HyperTransport bus’ effective speed is determined by an LDT(Lighting data transport) multiplier. While the front side bus could’ve been traditionally double or quad-pumped, the HyperTransport’s effective data rate can be anywhere from 1x to 5x it’s speed on the CPU.
Core variants
Clawhammer- Come in both 512k and 1024k cache variants. Have a 64-bit wide, single channel DDR SDRAM controller. They come in speeds ranging from 1.6GHz to 2.4GHz, with PR ratings ranging from 2800+ to 3700+. There are two revisions of the A64. The first is the C0, which came in both 512k and 1024k flavors. The newer revision, the CG, comes only in the 1024k variety. The last letter in the OPN code indicates it’s revision; AP for C0’s, and AR for CG’s. Both desktop and mobile parts are available on this core. Desktops and desktop replacement mobiles have a default voltage of 1.5v, and an approximate heat output of 81.9W. Mobiles have a default voltage of 1.4v, and approximate heat output of 62W.
Newcastle (754) - These are all 512k parts, and are about identical to the Clawhammers in every other respect. There are no C0 Newcastles. All Newcastles currently are CG’s, as denoted by their OPN code of AX. These range in speeds from 1.6GHz to 2.4GHz, PR ratings from 2700+ to 3200+. The mobiles have a default core voltage of only 1.2v, and heat output of only 35W.
Sledgehammer (940)- Found the Opterons and the FX51, and some FX53’s. These all have 1024k of L2 cache, and have a 144-bit wide dual channel memory DDR SDRAM controller, supporting only registered memory. Speeds range from 1.4GHz to 2.4GHz. These are all based on a 940-pit socket package. The OPN codes that you will likely see are AG, which are B3’s, AK, which are C0’s, and AT’s, which are CG’s. They have stock voltages of 1.50v, and wattage ranges from 82.9W to 89W. The Sledgehammers have eight HyperTransport links for multi-way processing. These are factory enabled/disabled as appropriate, and cannot be modified.
Clawhammer (939)- Found in the FX53, and soon to be in the FX55. Currently only the FX53 exists, with a clock speed of 2.4GHz, OPN code AS, revision CG. Can you tell I’m getting tired yet? Stock voltage of 1.5v, wattage of 89W. Sports a 128-bit wide, dual channel memory controller, and supports unregistered memory.
Newcastle (939)- Exactly the same as above, except only have 512k of L2 cache, and are found in the 3500+ and 3800+, OPN code AW, revision CG. Unlike the 940, the 939's have only one HyperTransport link, although for the end-user, this doesn't actually make any difference.
Available Chipsets
VIA K8T800- One of the highest performance A64 chipsets. Supports a maximum HyperTransport effective rate of 800MHz, but unfortunately, lacks AGP and PCI locks. AGP and PCI rates are determined by either a 1/6 or 1/7 divider off of the HyperTransport bus.
VIA K8T800 Pro- HyperTransport effective rate of 1000MHz supported, some motherboards have AGP/PCI locks, some don’t. Available for all sockets. Also has native support for SATA RAID 0+1, which is advantageous, as the PCI bus wouldn’t be used.
nVidia nForce3 150- Slightly lower in performance compared to the VIA’s, supports a max effective HyperTransport rate of 600MHz, but sports AGP and PCI locks in most boards, a huge plus. For the boards that don't have PCI locks, some SFF boards, PCI dividers up to HTT/9 are available, so even these shouldn't have trouble overclocking.
nVidia nForce3 250- Same as above, except supports a 1000MHz HyperTransport rate.
nVidia nForce3 250GB- Same as above, but with a richer feature set. Native support for SATA RAID 0+1, Gigabit Ethernet, and has a built-in firewall feature. This is the preferred chipset; with such rich native support, the PCI bus can be kept quite clean.
SiS 755- A very promising chipset, humbling both the nVidia’s and VIA’s handily at the same speed, and sports AGP/PCI locks. However, motherboard support for this chipset is quite lacking, and there isn’t a solid solution for overclocking that utilizes it to date.
Configuring an A64 System
For your everyday overclocker, one of the most cost-effective solutions is to opt for a s754 512k 3000+, a CG Newcastle. CG Clawhammers are better performers, but are difficult to find outside of mobile parts. Most boards to date have certain issues with these.
The 939 Newcastles are rather pricey. Their dual-channel compensates for the lack off cache and adds a bit of a performance as well, but nothing especially significant. However, if mathematical applications, etc. are a must, this may be worth a consideration.
For the best of the best, there is no substitute for the 939 Clawhammer, AMD’s flagship line. The FX53 virtually untouchable by another processor once you get it going.
As far as chipsets go, the nF3 250GB is the best choice for most. 939 boards based on this chipset are right around the corner, as well. However, the K8T800 Pro has a slight edge in performance. Some VIA boards have PCI/AGP locks, and others don’t. The one’s that do can sometimes be temperamental. For hardcore overclockers, the VIA may be the best route. Currently, good choices for socket 754 are the EPoX 8KDAJ/3+, based on the nF3 250GB chipset, the Gigabyte K8NS Pro, which, although does not have the GB chipset, is still a very solid overclocker, and the VIA-based Abit KV8 Pro. The KV8’s seem to have PCI/AGP locks, and sport a wide range of voltage options. The VIA-based Asus A8V and nVidia-based Gigabyte K8NSNXP are the front-runners. I would opt for the latter personally, because of its no-fuss vdimm mod, and guaranteed PCI/AGP lock. (The Asus appears to be hit or miss).
Memory is where things get tricky. Running at memory voltages of over 3.3v has proven to be rather dangerous, so getting low latency memory, such as BH5/6, up to the extremely quick speeds that the A64 can take is usually not possible. Still, IMHO, it is a much better choice than high latency, high speed memory, even without a voltage mod. The tight latencies tend to be more useful than slighter higher bandwidth. Another complication with the A64 is that it tends to be very, very picky with double-sided modules. When running two at a time, it can be very difficult to get a decent overclock, but even with one, don’t expect to get as far as you may on other platforms. This is another reason why it’s a good idea to use low latency memory. Hitting a relatively low speed wall is quite possible, and if you’re using high latency memory, you’re pretty much trapped with low performance in this situation. Low latency, low speed, on the other hand, is still very competitive. My top picks would be memory based on Micron mT (e.g. OCZ 3200/3500/3700 EB), Winbond BH5/6 (e.g. Mushkin 222 Special), or Samsung TCCD (Corsair 3200XL), in that order. All are solid choices, and can be found in dirt cheap value memory as well, if you look hard enough. If you plan to go the high speed route, Hynix D5 is just about the choice that I can recommend, as it can reach 280+ speeds, and give low latency memory a run for it’s money. Only use single-sided memory in this situation.
If you are planning to use a motherboard without PCI/AGP locks, it is advisable to use a PCI66 compliant SATA/ATA controller card, to allow you to use high PCI speeds without corruption. Also desirable in this case are video cards that can support high AGP speeds. (Especially nVidia-based cards)
Overclocking Technique
You are essentially in control of the speeds of four different…ummm lets call em data paths; the CPU, the memory bus, the HyperTransport bus, and the HyperTransport’s effective data rate. Overclocking the CPU is done pretty much as it always has been, except that the HyperTransport substitutes the front side bus.
The HyperTransport can almost always go just as high as you need it, granted that you don’t exceed the motherboard’s maximum supported data rate by too much. For systems that support a 1000MHz HyperTransport data rate, for example, one could use a 200MHz HyperTransport bus with a 5x LDT multiplier. However, using 5x250 would result in an effective 1250MHz, which would almost certainly lead to instability. The LDT could be dropped to 4x, allowing for a higher HyperTransport bus speed with stability, 250, but resulting in the same 1000MHz effective data rate as default. The HyperTransport bus being increased alone doesn’t accomplish anything; unlike increasing the front side bus does other platforms. Even raising the effective data doesn’t add any noticeable increase in performance, as the bus is already so wide, that saturating it isn’t very likely. For this reason, the nF3 250’s and 150’s perform quite similarly to one another. My suggestion would be to leave the effective rate at as close to stock as possible, and raise the HyperTransport bus speed only as much as necessary. Unless you’re using an 8x multiplier, there shouldn’t be much reason to go far above 300MHz in many cases.
What complicates what multiplier and HyperTransport speed you should use are the CPU/memory dividers. No motherboard allows you to manipulate them directly. Instead, they provide “maximum memory clocks,” or supposed HTT/memory ratios. Make no mistake, no such thing exists. The memory is derived off of the CPU speed, but it’s never made clear, and the dividers need to be manipulated indirectly. Also, CPU/mem dividers are integral only; there are no half dividers, so it’s advisable not to use half multipliers. Ok, what I just said probably doesn’t make much sense, so here are some examples of how to get certain get CPU/mem dividers:
CPU/5-8 - Set memory to 200 and multiplier to desired divider
CPU/9
Memory to 200, multi to 9x(if available)
Memory to 183*, multi to 8x
CPU/10
Memory to 200, multi to 10x(if available)
Memory to 183*, multi to 9x(if available)
Memory to 166, multi to 8x
CPU/11
Memory to 200, multi to 11x(if available)
Memory to 183*, multi to 10x(if available)
Memory to 166, multi to 9x(if available)
Memory to 150, multi to 8x
CPU/12
Memory to 200, multi to 12x(if available)
Memory to 183, multi to 11x(if available)
Memory to 166, multi to 10x(if available)
Memory to 150, multi to 9x(if available)
Memory to 133, multi to 8x
Doesn’t make much sense? Don’t worry, it shouldn’t. You will probably need to experiment with different multipliers and max mem clocks to find the CPU/mem divider that you desire. Using half multipliers complicates things further, as the memory is divided integrally. Just to eliminate variables, drop the LDT multiplier down to 3x if your board supports a 1000MHz HyperTransport speed, or down to 2x if 800MHz or 600MHz is it’s maximum. You can also increase in the HTT/LDT voltage on some motherboards, which can give you an extra 20-50MHz extra MHz on your effective rate in some cases.
As I've always stressed, overclocking needs to be done carefully and systematically. This is especially important with the A64. Focus on one area that you wish to overclock, and overclock it alone. For example, if you wish to overclock your memory, drop your LDT and CPU multipliers as low as they can go, and see how far your memory can go with everything else clocked low enough to not hinder stability. For overclocking the CPU, drop the LDT as low as it can go, and set the max memory clock as low as it can go as well. Once you find your maximum memory speed and CPU clock, play around with the max memory and CPU multiplier to find the suitable CPU/mem ratio. Once you've already got in mind how far the CPU and memory each can go, this isn't too difficult. I cannot stress enough how important it is to isolate variables. It's all too common that people try to max everything out at once, fail, and then give up out of frustration. Take your time, be patient, and have fun. Dividing and conquering can make the task of overclocking the A64 a lot less daunting.
This excellent tool by Cpjk can be very helpful in removing some of the confusion in figuring out how to run things.
If you're lucky enough to have an FX, though, you don't need to bother with finding the right CPU/memory ratio. Simply find your maximum memory clock, and then increase the multiplier as necessary to max out the processor.
On a related note, the absolute core voltage for A64’s rated by AMD is 1.65v, opposed to 2.25v(I believe) for Bartons. The heat output of A64’s at the same speeds as AXP’s is roughly equivalent to what they’d put out with 0.2v less. For this reason, it usually is not too beneficial to exceed a core voltage of 1.7v or so on air cooling. On your everyday R404A setup, 1.8-1.85v usually appears to be all that’s needed for an optimal overclock.
One rather important memory-related setting is the command rate, a.k.a CPU Interface on many other boards. The default for C0 processors is 1t, and the default for the CG’s is 2t. 1t is quicker, but makes overclocking the memory with double-sided sticks especially difficult in many cases. Running at 2t, however, takes off about 1 sec in SuperPI and PIFast, and makes a couple hundred point difference in 3DMark01. The one benchmark where it takes a significant toll is the Sandra Memory Bandwidth Benchmark, where it takes 10%, or 300-400 MB/sec off. I don’t see having to run at 2t as the end of the world, unless you’re a Sandra fanatic. The difference between 1t and 2t is actually less than that between tRCD2 and tRCD3 in my experience. Again, there is no one size fits all solution. It may take some experimentation to see what combination of command rate, latencies and memory speeds are optimally for you. Low tRCD is highly recommended, but CAS doesn't matter very much. tRAS at 10, and nothing else, seems to deliver the best performance, while backing it down much lower begins to hurt.
Some notes on Windows Tweaking/Overclocking
Overclocking A64s within Windows was originally done when high HTT’s caused BIOS corruption, however this doesn’t appear to be an issue today. It still can be very convenient, and for some boards like mine that don’t allow overclocking in the BIOS with mobiles, can be a godsend. ClockGen is a Windows-based utility for overclocking. It allows multiplier, voltage, HyperTransport speed, and PCI/AGP bus speed manipulation. Changing the voltage doesn’t work on all boards, and the CPU/mem ratio and LDT multiplier cannot be changed using the utility, so some settings must be set in the BIOS. It also allows for profiles, so you can quickly change speeds on the fly. To make a profile, put the signature of your board, as found on the website in brackets on the first line, e.g. [CG-NVNF3] for nForce3’s, and then the values you want to change in the succeeding lines, e.g. FID=9.0, HTT=250. For nForce3 boards, if you set your AGP rate or HTT rate anywhere above spec in the BIOS, the AGP/PCI lock is enabled, so you can increase the HTT easy in Windows. Setting the HTT to 201 in the BIOS is the most common technique. The nVidia System utility is a nice tool to have for nVidia-based boards. It allows manipulation of the tRAS, tRCD and tRP within Windows, and also the changing of the HyperTransport and AGP/PCI speeds. A64 Tweaker is an excellent utility written by CodeRed. It allows manipulation of just about every memory-related setting on the fly in Windows. It’s made my life dozens of times easier when trying to test things out. It also has much more functionality than you’ll find in most BIOSes.
------------------------------------------------------------------------------------------------------------------------
Thanks to Nuclear for help in putting this guide together.