Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5

Your iPhone probably outperforms your computer.

#1
Five years ago, phones were considerably restricted in their performance compared to their desktop counterparts. It used to be the case that raw computing power belonged to more capable devices, ones that could cool off high-power intel CPUs with larger fans and larger batteries. 

Today, this is not so much the case. Apple’s custom CPU designs are so powerful that they can literally outperform virtually any dual-core intel processor made to date. That is with turbo boost, significantly higher TDPs, full fan-cooled designs, hyper threading, and of course much higher clock speeds. The phone in your pocket can outperform almost all of them without a hitch, and that’s with a 2.5 GHZ CPU with a ~3 watt TDP. 

Apple has done a fantastic job of designing these new processing chips, and one of the latest rumors is that they will be designing these chips for Mac computers in the near future. Although this would break compatibility with current x86 compiled code, this would still be revolutionary because these chips would not only be more powerful, but they’d be much lower-powered as well, allowing for significant battery gains. 

But how is this done? How can a tiny, fanless cell-phone CPU outperform a beauty Intel dual core counterpart? Well, the secret lies in a number of factors, but one of them is in the term RISC, which stands for reduced instruction set computer. Modern X86 based processors are known as CISC, or complex instruction set computers. These processors have thousands of instructions. There is very deep circuitry involved in decoding them, and once they are decoded, they are actually turned into micro ops (or uops for short), which comprise a much simpler execution structure that is actually sent to the processors execution units. All of this decoding requires power and circuity. Intel processors actually decode their CISC instructions into simpler instructions on the fly, which somewhat resemble RISC instructions once they are decoded and executed. 

The difference is that RISC processors have code that is already compiled in this simpler format. x86 CPUs require heavily logical circuitry to decode these instructions real-time.

The origins of the Apple CPU.

Apple CPUs have always been ARM based rather than x86 based. ARM is an entirely different architecture, with radically different design principles. Interestingly enough, ARM, as a company, does not actually manufacture its chips. They merely design them, and allow individual manufacturers to license them and to manufacture (or modify) them as they please. Because of this, there is actually quite a large degree of variety in the ARM CPU market. ARM CPUs are common. Virtually every cell phone and tablet, and most cromebooks and a few laptops are powered by them. Despite their variety, they have a strictly defined instruction set, and one ARM CPU is binary-compatible with another. 

However, ARM CPUs are designed for extremely low power consumption and for high efficiency, which differs principally from x86 based designs. They are designed not necessarily for performance, but for efficiency. There are not many complex instructions that must be decoded into simpler instructions. Instead, the entire binary that gets sent to the processor has already been decoded into these simpler instructions by the compiler. There are generally more of these instructions, making binaries larger, but they are significantly easier to decode in the processor, thus saving time and power in the process. 

This, of course, is an incredibly simplistic explanation for just one of the reasons why ARM CPUs are much lower in their power consumption. In reality, there are a variety of reasons for this, many of which go beyond the scope of this post. Of course, the comparison between CISC and RISC goes beyond the scope of this article in many other ways as well, as there are actually advantages to both. For example, because CISC processors can accomplish more in fewer instructions, there is often an advantage of less memory traffic to worry about. ARM CPUs require more cache to mitigate some of this. 

However, as evidenced by the benchmarks performed on Apple’s implementations of the ARM architecture, ARM does have a lot of potential, and quite a bit of room to grow. And Apple’s A11 and A12 are the first implementations of the ARM architecture that have been able to outperform Intel CPUs in a significant capacity. 

Why are these CPUs so fast?

Well, Apple hasn’t published a whole lot of information regarding the specifics of their architecture. Little is known about many of the deeper workings of the processors, but we do know a few things that may point to why these processors excel in benchmarks. 
  • These processors have a massive 8MB L2 cache on-chip. 
  • These processors are incredibly wide architectures. So wide that no desktop CPU is even in the same class. 
  • Due to their wide architecture, these processors can decode 7 instructions in a single clock cycle, and dispatch them to any of 13 execution units. For comparison, intel’s modern processors (6th generation and newer) can decode 5 instructions every cycle, and send them to 8 different execution units. 
 

This last point is likely one of the major reasons that this chip is so fast. It’s an incredibly wide architecture. 

What is a “wide architecture”? How does a chip dispatch seven instructions at once?  

Modern processors notably do not execute code in order. Consider this: 

Code:
 
A = 1
B = 2
X = A + B
C =5 
D = X + C
F = 1

Your processor will execute these instructions in an entirely different order than they are written here. The processor will execute both A and B in parallel, by two different execution units, running on the same clock cycle. There are generally several execution units available, including a few ALUs for integer arithmetic, at least one floating point unit, as well as additional units for address calculation, and for loading and storing data. There is usually more than one of each given type of execution unit, but not all execution units can process all types of instructions. The processor has a dedicated scheduling unit, whose job is specifically to determine which execution units are free, and to send instructions to the appropriate units for the job. 

The processor will hold until the results for these are completed before executing line 3, because line 3 depends on the previous two lines. However, a very wide processor would have room to execute lines 4 and 6 before line 3 even begins. If the data is not complete, it will continue to look ahead in the code for other commands that do not depend on preceding ones. Wide architectures are capable of executing several instructions at once, so long as there are instructions in the code that do not have unresolved dependencies. 

This concept is called out of order execution. And rest assured, it’s an extremely complex concept that involves quite a bit of complex circuitry to implement. Virtually every modern processor, even relatively low end cell phones (with the exception of the bottom end Cortex A53 based devices) implement out of order execution. This feature has been in existence since the original pentium processors. 

If a processor were to complete these instructions in order, the total throughput would be dramatically reduced. 

The disadvantage with out of order execution is that many instructions are highly sequential in nature. For example: 

Code:
A = someMath
B = A + 1
C = sqrt(B) 
D = C - 3
E = D
E = D + 1
If (E = 5): 
    A = 1
Else: 
    A = 5

There is not a single instruction in this code that can be executed in parallel. These instructions are entirely sequential, one after the other. 

However, as discussed in the previous post about modern processor architectures, branch prediction will come into play in this example. The processor can look ahead and see a branch statement, which is an if/else block. A unit known as a branch predictor will attempt to determine which outcome is more likely, and will speculatively execute this outcome. 

How do you keep results concurrent when they are processed in seemingly scrambled order?

How this is done in modern processors is highly complicated. The CPU usually keeps a table of previous branches taken, and will keep track of the direction that previous branches have taken. If for example, the code runs in a loop and the branch above has been taken many times before, it will be safely assumed that it will be taken again. Modern processors keep track of this at lightning speed, and can reach 95% accuracy in many cases. Processors are so accurate at these predicting branches that many of the performance impacts of longer pipelines are significantly mitigated by the branch predictor unit. 

However, on a random new branch where the code has not been seen before, it does not have any means of being able to see which branch will have been taken more often in the past. A modern processor uses what is known as static branch prediction in these cases, and usually determines that a branch is more often not taken that it is to be taken. It will almost always jump to the else block, as a result, and execute these results in parallel while the preceding instructions are being completed. If it gets to the end of the preceding instructions and it guessed correctly, it keeps the results it previously calculated. If it miscalculated and it guessed incorrectly, it throws away the results, flushes the pipeline, and starts over with the correct result. 

Compilers are almost always somewhat aware of the algorithms that a processor uses to predict a branch. If a compiler can determine that a branch is not likely to be taken, it encodes it in such a way when it compiles that the processor will not speculatively mispredict the branch. In other words, the compiler attempts to look at your code and determine which outcome is more likely to occur. Once this is done, it encodes the resulting binary in such an order that the processor will correctly guess the more likely outcome. 

Whoa mate, we have a problem.  

Suppose that the actual code execution sequence gets reorganized, and looks like this: 

Quote:A = someMath
If (E = 5): 
...A = 1
Else:
A = 5

B = A + 1
C = sqrt(B) 
D = C - 3
E = D
E = D + 1

There is a slight problem with this reorganization of the code.

When it executes the branch statement, the processor speculatively initializes A to 1 earlier than it is supposed to. Of course we don’t know for sure if this is the value we want yet, but it speculatively executes it and checks it later. However, for all of the instructions that are executed after the branch statement is speculatively run, these statements depend on A being set to someMath, and we’ve speculatively set A to 1 instead. Not only do we have to check and see if the branch prediction was accurate after this code is executed, but we now have to throw away all of the code’s results because every single instruction that follows it is now incorrect. It depended on a different version of A that was not supposed to be reinitialized. 

As a result, regardless of whether our initial branch prediction was correct or not, all of the results following it have to be thrown out regardless. This is a waste of time if there ever was one. This is actually a huge problem in out of order execution. 

This is solved, in part, with register renaming.  

What the hell? Register renaming? What? 

In modern CPU architectures, there are actually multiple physical registers that correspond to any given register that can be used for instructions. In other words, if you have register A available, the processor has multiple copies of this register, despite only one of these registers being logically visible to the program. The compiler can only see one of these. 

To the programmer, these duplicated registers do not exist. The CPU only has a single register for each designated name that is available to the compiler, and thus, to the programmer. 

The benefit of this, however, is to allow the CPU to keep track of multiple versions of the register. In the code example above, two registers will both be designated as A, which each register carrying a different version of A. Thus, when the if/else block is speculatively executed, it is stored in a separate physical register, which only overwrites the original one once the code that precedes it has been executed in proper, logical order. In other words, the CPU has a different version of A, and knows which version of A to use depending on where it is in the code execution process. 

Modern CPUs actually carry anywhere from 128 to 200+ instructions in a buffer before they are executed, and this buffer serves as a set of instructions that the CPU can reorder instructions from. In the code above, we only have a few lines of code. In modern CPUs, the architecture can look ahead for instructions that are on the order of hundreds of instructions away. In general, they look for instructions that aren’t dependent on code preceding it to be completed first. 

And what we ultimately have is a super powerful processor that is capable of dispatching multiple instructions in parallel, from seemingly random parts of the program. The term for this is a superscalar architecture, which is better defined in more detail on this page.

If a cell phone processor can outperform intel’s offerings, we’ve accomplished something pretty astounding. 

This Reddit page offers quite a bit of additional information if you’re interested.

Reply
#2
So... Mac is going to go back to it's roots in more than just case design eh?

(btw I love the new mac pro's case design)

well as a fan of the power PC mac's, I say go for it! Tongue

just make sure not to be beat in cost vs power.

plus with cross platform libraries becoming increasingly a thing with Vulkan, Steam, and other groups, I don't doubt that either through some wine port, software patches, or straight up compiled versions of various programs will be made available on the new Mac CPU.
"I reject your reality and subsitute my own." - Adam Savage, Mythbusters
[Image: 5.jpg]
Reply
#3
Mac OS will likely include another emulator when this happens. When the original switch from Power PC to Intel happened, Intel included a translation layer (Rosetta) on top of their OS that could translate executables on the fly, with a bit of a performance penalty of course. At the time, Intel processors were fast enough to more or less mitigate the difference. Hopefully the same is true the second time around.

And I love the designs myself. They are a bit pricey. I'm baffled that the default SSD is only 256 GB and I've heard from some people that the iMac's GPU is literally better than the base GPU, but nobody buying a machine like this is going to go for the base configuration. It's about time they had an upgradable desktop again, and this time around, I think they did it right.

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  iPhone 12 Darth-Apple 6 2,648 August 29th, 2021 at 1:39 AM
Last Post: s3_gunzel
  iPhone vs. Android tc4me 11 6,159 June 24th, 2020 at 9:46 PM
Last Post: Darth-Apple
  iPhone = obtained Darth-Apple 11 14,509 September 1st, 2018 at 10:38 PM
Last Post: Thomas
  Need some iPhone advice Darth-Apple 9 9,823 August 6th, 2018 at 11:10 AM
Last Post: SpookyZalost



Users browsing this thread: 1 Guest(s)

Dark/Light Theme Selector

Contact Us | Makestation | Return to Top | Lite (Archive) Mode | RSS Syndication 
Proudly powered by MyBB 1.8, © 2002-2024
Forum design by Makestation Team © 2013-2024