Pentium 4 Prescott - the worst CPU ever made

**Darth-Apple** · June 5th, 2019 at 1:50 AM

Most of us probably remember the "blazing fast" pentium 4 CPUs that debuted in the early 2000s. By today's standards, they would barely handle any modern system if at all. Even on a single core, modern CPUs are several orders of magnitude faster. The pentium 4 was especially controversial because its architecture was heavily refactored from earlier generations (Pentium 3 and below) for exactly one purpose: to achieve the highest clock speed possible, largely for marketing purposes. This was actually done at a considerable performance penalty on a clock per clock basis. As a result, at the same clock speed, a Pentium 3 would considerably outperform a pentium 4. They were only faster because they could clock to 3GHZ and beyond, where the original Pentium 3 could only clock up to ~1GHZ or so.

This architecture was, as history has shown, released during a time when Intel was having quite a bit of trouble determining where they were going, and they were still figuring out the process of building a good processor. At the time, P4 based systems were fast, but they have not stood the test of time. Today, the pentium 4 is known as perhaps the biggest architectural flop in the company's history. It had enormous power consumption, heat generation, and was very slow on a clock-for-clock basis compared to every single architecture that came after it, and most of the architectures that preceded it. However, because it was the first processor from Intel to go far beyond the 1GHZ barrier and to go to 3 GHZ and beyond, it was still an important piece of Intel’s processor history.

TL;DR: The pentium 4 was fast because it ran on uncharted territory and explored high clock speeds for the first time in intel's history. However, it was an architectural nightmare, was highly inefficient, and created an unholy amount of heat with unrealistic power consumption. It was an incredibly poor design, and was completely scrapped when Intel came out with the Core 2 Duo series, basing these new processors on the Pentium 3 architecture instead.

The original Pentium 4 was better than the revised versions that followed it.

Usually, when you revise a processor, the new one is supposed to be better than the one it is supposed to replace. With Pentium 4’s, this was not the case. The original Pentium 4 Northwood (edit: willamette, then northwood) cores were still considered decent for their time. They were fast, could reach high clock speeds, and were quickly a hit for the time. Despite having a worse clock-for-clock performance over a pentium 3, the higher clock speeds quickly made up for the difference.

A few years later, intel created a new version of the architecture revision known as Prescott to replace Northwood in all Pentium 4 CPUs. In the history of Intel processors, this is known as one of the worst revisions that the company has ever made. It was so bad that these Prescott CPUs actually generated more heat, required more power, and literally performed worse than the Northwood CPUs they were designed to replace. The goal was to clock the Prescott cores higher than their Northwood counterparts to offset this balance. However, the Prescott cores were so inefficient compared to the Northwood cores that it was highly impractical to raise the clock speeds high enough to offset this performance difference. Almost all Prescott cores, even when clocked higher than the old cores, were considerably slower than the old Pentium 4’s. Often the performance difference was 15% or more.

Why would a newer core be slower?

The reason for this was primarily because of a term that is somewhat elusive in processor architectures, and that is pipelining. It may seem that a longer pipeline is better, but this is usually not the case. The pentium 3 CPUs had a pipeline of about 10 stages. The original Northwood Pentium 4's had a pipeline of about 20 stages. The Prescott Pentium 4 chips had a pipeline of about 31. Modern intel CPUs, for comparison, have pipelines of about 14-19 stages. Much better. In general, the longer the pipeline is, the slower the processor is. However, longer pipelines are often used to increase clock speeds, and shorter pipelines can't as easily achieve this.

But what is a pipeline?

It turns out that CPU’s do not actually complete a single instruction in a single clock cycle. It takes many cycles to fully complete execution of an instruction. The instruction has to be loaded from cache (clock cycles), decoded (usually a few cycles), executed (at least one, sometimes several cycles), and written back to the registers and memory. The reason some of these operations take several cycles is because of complex gate circuitry in the processor. It takes time for a transistor to flip states from one state to another, and if you have very deep logic circuitry in a stage, a processor won't be able to reach high clock speeds because each stage will take too long to complete.

A pipeline splits these instruction logic processes into a series of much simpler steps that can be completed much more quickly. Each step represents one clock cycle, and with smaller steps, the processor can reach much higher clock speeds. A pipeline loads many instructions at once in an assembly line fashion. As a result, even though a single instruction may take ~20 cycles to execute, the processor is also working on 20 instructions at any given time, so the effective result is that one instruction is still done every clock cycle.

[url=https://en.wikipedia.org/wiki/Pipeline_(computing)]Wikipedia[/url] has a much better explanation for this than I could provide here.

So why are longer pipelines slower if they are constantly being fed with instructions anyway?

The problem is the unruly branch statement. Consider the following code for a second here:

Code:
value = getPasswordFromUser(); 

if (value=password): 

  log_user_in();

else: 

  ask_for_password_again();

This is a branch statement. It states that if a certain value matches some criteria, the code is supposed to do this. If not, it will do something else instead. These types of branching statements are very, very common. So common that a pipeline with a 20 stage length will, on average, have a couple of them sitting in it at any given time.

The issue is that you don't actually know what value is going to be until you get through the entire pipeline and execute the instruction before it. Because of this, even though the instructions for the if statement are sitting behind this command in the pipeline, the processor has no idea which code to actually execute. It has to wait for the entire pipeline to be cleared so that it knows what value is going to be equal to. And once it's done with this, it has wasted 20 cycles.

This is why the Prescott cores were so bad. They had to waste 31 cycles every time there was an instruction like this because of their longer pipeline. And if there was an instruction like this every 5-10 cycles, then the processor would execute 5-10 instructions, then stall for 31 cycles.

If so many cycles are wasted on every branch statement, how is this mitigated?

Modern processors actually mitigate this through a concept known as branch prediction. There is quite a bit of complex circuitry involved in doing this, but essentially, when the processor initially sees that it has an if or else statement, it will try to guess which one will more likely be taken, and speculatively execute that without actually knowing until the proceeding instructions exit the pipeline. If the processor's guess was correct, it keeps executing. Otherwise, it throws away the results and starts over with the correct branch. As a result, it only wastes these cycles if it can't guess correctly, significantly mitigating this. Modern processors are actually extremely good at this.

The pentium 4 era CPUs were terrible, by comparison. They performed branch prediction, but not well. They generally guessed a branch would never be taken and would immediately jump to the else block, and they would generally do the same thing no matter what code was fed into it. As you can imagine, this was fairly inefficient and resulted in a lot of very long pipelines being flushed.

This is one of the reasons modern processors are so much faster, despite often running at lower actual clock speeds. Intel and AMD have done a fantastic job integrating complicated circuitry to allow the processor to be properly utilized to its full capacity. Unfortunately, the pentium 4 was a flop on the way to this goal, and modern Core based processors were actually based on the Pentium 3 architecture and completely scrapped the work that went into the Pentium 4. The P4 was so bad that Intel literally threw it away and started over, and they based the new design on the very architecture that the P4 was supposed to replace.

But nevertheless, it was design mistakes such as these that were a part of modern computing history, and allowed Intel to learn how to build the powerful processors they have today. Rest assured, they are a true marvel of engineering.

Thomas · June 5th, 2019 at 2:49 AM

Very interesting! Thank you

Login
Username:
Password:	Lost Password?
	Remember me

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[Idea] Verilog HDL - Designing a CPU...	Darth-Apple	0	1,660	January 11th, 2020 at 11:03 PM Last Post: Darth-Apple
	Made the Big Switch	Thomas	17	8,804	September 14th, 2019 at 12:02 AM Last Post: SpookyZalost