Which of the modern gpu types offers the highest performance?
Show
How to choose the right Amazon EC2 GPU instance for deep learning training and inference — from best performance to the most cost-effective and everything in-betweenIllustration by author Just a decade ago, if you wanted access to a GPU to accelerate your data processing or scientific simulation code, you’d either have to get hold of a PC gamer or contact your friendly neighborhood supercomputing center. Today, you can log on to your AWS console and choose from a range of GPU based Amazon EC2 instances. What GPUs can you access on AWS you ask? You can launch GPU instances with different GPU memory sizes (8 GB, 16 GB, 24 GB, 32 GB, 40 GB), NVIDIA GPU generations (Ampere, Turing, Volta, Maxwell, Kepler) different capabilities (FP64, FP32, FP16, INT8, Sparsity, TensorCores, NVLink), different number of GPUs per instance (1, 2, 4, 8, 16), and paired with different CPUs (Intel, AMD, Graviton2). You can also select instances with different vCPUs (core thread count), system memory and network bandwidth and add a range of storage options (object storage, network file systems, block storage, etc.) — in summary, you have options. My goal with this blog post is to provide you with guidance on how you can choose the right GPU instance on AWS for your deep learning projects. I’ll discuss key features and benefits of various EC2 GPU instances, and workloads that are best suited for each instance type and size. If you’re new to AWS, or new to GPUs, or new to deep learning, my hope is that you’ll find the information you need to make the right choice for your projects. Topics covered in this blog post:
Key recommendations for the busy data scientist/ML practitionerIn a hurry? just want the final recommendation without the deep dive? I got you covered. Here are 5 GPU instance recommendations that should serve majority of deep learning use-cases. However, I do recommend you come back and review the rest of the article so you can make a more informed decision. 1. Highest performing multi-GPU instance on AWSInstance: 2. Highest performing single-GPU instance on AWS:Instance: 3. Best performance/cost, single-GPU instance on AWSInstance: 4. Best performance/cost, multi-GPU instance on AWS:Instance: 5. High-performance GPU instance at a budget on AWSInstance: With that you should have enough information to get started with your project. If you’re still itching to learn more, let’s dive deep and geek out on every instance type, GPU type, their features on AWS and discuss when and why you should consider each of them. Why you should choose the right “GPU instance” not just the right “GPU”
A GPU is the workhorse of a deep learning system, but the best deep learning system is more than just a GPU. You have to choose the right amount of compute power (CPUs, GPUs), storage, networking bandwidth and optimized software that can maximize utilization of all available resources. Some deep learning models need higher system memory or a more powerful CPU for data pre-processing, others may run fine with fewer CPU cores and lower system memory. This is why you’ll see many Amazon EC2 GPU instances options, some with the same GPU type but different CPU, storage and networking options. If you’re new to AWS or new to deep learning on AWS, making this choice can feel overwhelming. Let’s start with high-level EC2 GPU instance nomenclature on AWS. There are two families of GPU instances — the P family and the G family of EC2 instances and the chart below shows the various instance generations and instance sizes. Amazon EC2 GPU instances for deep learningHistorically P instance type represented GPUs better suited for High-performance computing (HPC) workloads, characterized by their higher performance (higher wattage, more cuda cores) and support for double precision (FP64) used in scientific computing. G instance types had GPUs better suited for graphics and rendering, characterized by their lack of double precision and lower cost/performance ratio (Lower wattage, smaller number of cuda cores). All this has started to change as the amount of machine learning workloads on GPUs are growing rapidly in recent years. Today, the newer generation P and G instance types are both suited for machine learning. P instance type is still recommended for HPC workloads and demanding machine learning training workloads and I recommend G instance type for machine learning inference deployments and less compute intensive training. All this will become clearer in the following section when we discuss specific GPU instance types. Each instance size has a certain vCPU count, GPU memory, system memory, GPUs per instance, and network bandwidth. The number next to the letter (P3, G5) represent the instance generation. Higher the number, the newer the instance type is. Each instance generation can have GPUs with different architecture and the timeline image below shows NVIDIA GPU architecture generations, GPU types and the corresponding EC2 instance generations. Now let’s take a look at each of these instances by family, generation and sizes in the order listed below. We’ll discuss each GPU instance type in the order shown hereAmazon EC2 P4: Highest performing deep learning training GPU instance type on AWS.P4 instances provide access to NVIDIA A100 GPUs based on NVIDIA Ampere architecture. It only comes in one size — a multi-GPUs per instance with 8 A100 GPUs with 40 GB of GPU memory per GPU, 96 vCPU, and 400 Gbps network bandwidth for record setting training performance. P4 instance features at a glance:
What’s new in the NVIDIA Ampere based NVIDIA A100 GPU on P4 instances?Every new GPU generation is faster than the previous generation, and there’s no exception here. NVIDIA A100 is significantly faster than NVIDIA V100 (found on P3 instances discussed later) but also includes newer precision types suited for deep learning, particularly BF16 and TF32. Deep learning training is typically done in single precision or FP32. The choice of FP32 IEEE standard format pre-dates deep learning, so hardware and chip manufacturers have started to support newer precision types that work better for deep learning. This is a perfect example of hardware evolving to suit the needs of application vs. developers having to change applications to work on existing hardware. The NVIDIA A100 includes special cores for deep learning called Tensor Cores to run mixed-precision training, which was first introduced in the Volta architecture. Rather than training the model in single precision (FP32), your deep learning framework can use Tensor Cores to perform matrix multiplication in half-precision (FP16) and accumulate in single precision (FP32). This often requires updating your training scripts, but can lead to much higher training performance. Each framework handles this differently, so refer to your framework’s official guides (TensorFlow, PyTorch and MXNet) for using mixed-precision. The NVIDIA A100 GPU supports two new precision formats — BF16 and TensorFloat-32 (TF32). The advantage of TF32 is that the TF32 Tensor Cores on the NVIDIA A100 can read FP32 data from the
deep learning framework and use and produces a standard FP32 output, but internally it uses reduced internal precision. This means that unlike mixed precision training which often required code changes to your training scripts, frameworks like TensorFlow and PyTorch can support TF32 out of the box. BF16 is an alternative to IEEE FP16 standard that has a higher dynamic range, better suited for processing gradients without loss in accuracy. TensorFlow has
supported BF16 for a while, and you can now take advantage of BF16 precision on NVIDIA A100 GPU when using P4 instance come in only 1 size: p4d.24xlarge: Fastest GPU instance in the cloudIf you need the absolutely fastest training GPU instance in the cloud then look no further than the
You get access to 8 NVIDIA A100 GPUs with 40 GB GPU memory, interconnected with 3rd generation NVLink that theoretically double the inter-GPU bandwidth compared to the 2nd generation NVLink on the NVIDIA V100 available on the P3 instance type we’ll
discuss in the next section. This makes Run Amazon EC2 G5: Best performance per cost single-GPU instance and multi-GPU instance options for inference deploymentG5 instances are interesting as there are two types of NVIDIA GPUs under this instance type. This is a departure from all other instance types which have a 1:1 relationship between EC2 instance type and GPU architecture type. G5 instance type has two different sub categories with different CPU and GPU typesAnd they each come in different instance sizes that include single and multi-GPU instances. First let’s take a look at G5 instance type and particularly G5 instances: Best performance per cost single-GPU instances on AWSG5 instance features at a glance:
What do you get with G5?The GPU instances: If you take a look at the output of This makes G5 instances perfect for single GPU training and migrating your training workload to P4 if your models and data size grows and you need to do distributed training or if you want to run multiple parallel training experiements on a faster GPU. Output of nvidia-smi ong5.xlarge Although the you get access to multi-GPU instance sizes, I do not recommend them for multi-GPU distributed training, since there is no NVIDIA high-bandwidth NVLink GPU interconnect, and communication will fall back to PCIe which is significantly slower. The multi-GPU options on G5 are meant to host multiple models on each GPU for inference deployment use-cases. G5g instances: Good performance and cost-effective GPU instance, if you’re ok with an ARM CPUG5g instance features at a glance:
What do you get with G5g?Unlike G5 instances, G5g instance offers NVIDIA T4G GPUs which are based on the older NVIDIA Turing architecture. NVIDIA T4G GPU’s closest cousin is the NVIDIA T4 GPU available on the Amazon EC2 G4 instance that I’ll discuss in the next section. The key difference between the G5g instance and G4 instance is interestingly the choice of CPU.
our choice between these two should come down to the CPU architecture you prefer. My personal preference for machine learning today would be the G4 instance over the G5g instance since more open-source frameworks are designed to run on Intel CPUs vs ARM based CPUs. Amazon EC2 P3: Highest performance single GPU instance and cost effective multiple GPUs instance options on AWSP3 instances provide access to
NVIDIA V100 GPUs based on NVIDIA Volta architecture and you can launch a single GPU per instance or multiple GPUs per instance (4 GPUs, 8 GPUs). A single GPU instance P3 instance features at a glance:
The NVIDIA V100 also includes
Tensor Cores to run mixed-precision training, but doesn’t offer TF32 and BF16 precision types introduced in the NVIDIA A100 offered on the P4 instance. P3 instances however, come in 4 different sizes from single GPU instance size up to 8 GPU instance size making it the ideal choice flexible training workloads. Let’s take a look at each of the following instance sizes p3.2xlarge: Best GPU instance for single GPU trainingThis
should be your go-to instance for most of your deep learning training work if you need a single GPU and performance is a priority. G5 instances are more cost effective for slightly lower performance than P3. With If you spin up an Amazon EC2 p3.8xlarge and p3.16xlarge: Ideal GPU instance for small-scale multi-GPU training and running parallel experimentsIf you need more GPUs for experimenting, more vCPUs for data pre-processing and data augmentation, or higher network bandwidth consider Multi-GPU training jobs: If you’re just getting started with multi-GPU training, 4 GPUs on
Parallel experiments: Multi-GPU instances also come in handy when you have
to run variations of your model architecture and hyperparameters in parallel, to experiment faster. With p3dn.24xlarge: High-performance and cost effective trainingThis instance previously held the
fastest GPU instance in the cloud title, which now belongs to Run Amazon EC2 G4: High-performance single-GPU instances for training and multi-GPU options for cost-effective inferenceG4 instances provide access to NVIDIA T4 GPUs based on NVIDIA Turing architecture. You can launch a single GPU per instance or multiple GPUs per instance (4 GPUs, 8 GPUs). In the timeline diagram below, you’ll see that right below G4 instance is G5g instance, which are both based on GPUs with NVIDIA Turing architecture. We already discussed G5g instance type in the earlier section and the GPU in G4 (NVIDIA T4)and G5g (NVIDIA T4G) are very similar in performance. You choice will come down to choice of CPU type on these instances.
In the GPU timeline diagram you can see that NVIDIA Turing architecture came after the NVIDIA Volta architecture and introduced several new features for machine learning like the next generation Tensor Cores and integer precision support which make them ideal for cost effective inference deployments and graphics. G4 instance features at a glance:
What’s new in the NVIDIA T4 GPU on G4 instances?NVIDIA Turing was the first to introduce support for integer precision (INT8) data type, that can significantly accelerate inference throughput. During training, model weights and gradients are typically stored in single precision (FP32). As it turns out, to run predictions on a trained model, you don’t actually need full precision, and you can get away with reduced precision calculations in either half precision (FP16) or 8 bit integer precision (INT8). Doing so gives you a boost in throughput, without sacrificing too much accuracy. There will be a some drop in accuracy, and how much depends on various factors specific to your model and training. Overall, you get the best inference performance/cost with G4 instances compared to other GPU instances. NVIDIA’s support matrix shows what neural network layers and GPU types support INT8 and other precision for inference. NVIDIA T4 (and NVIDIA T4G) are the lowest powered GPUs on any EC2 instance on AWS. Run The following instance sizes all give you access to single NVIDIA T4 GPU with increasing number
of vCPUs, system memory, storage and network bandwidth: G4
instance sizes also include two multi-GPU configurations: Amazon EC2 P2: Cost-effective for HPC workloads, NO LONGER recommend for only-ML workloadsP2 instances give you access to NVIDIA K80 GPUs based on the NVIDIA Kepler architecture. Kepler architecture is a few generations old (Kepler -> Maxwell -> Pascal -> Volta -> Turing), therefore they’re not the fastest GPUs around. They do have some specific features such as full precision (FP64) support that makes them attractive and cost-effective for high-performance computing (HPC) workloads that rely on the extra precision. P2 instances come in 3 different sizes: p2.xlarge (1 GPU), p2.8xlarge (8 GPUs), p2.16xlarge (16 GPUs). The NVIDIA K80 is an interesting GPU. A single NVIDIA K80 is actually two GPUs on a physical board, which NVIDIA calls dual-GPU design. What this means is that, when you launch an instance of
P2 instance features at a glance:
So, should I even use the P2 instances for deep learning?No, there are better options discussed above. Prior to the launch of Amazon EC2 G4 and G5 instances, the P2 instances were the recommended cost-effective deep learning training instance type. Since the launch of G4 instances, I recommend G4 as the go-to cost-effective training and prototyping GPU instance for deep learning training. P2 continues to be cost-effective for HPC workloads in scientific computing, but you’ll miss out on several new features such as support for mixed-precision training (Tensor Cores) and reduced precision inference, which have become a standard on newer generations. If you run Amazon EC2 G3: NO LONGER recommended for only-ML workloadsG3 instances give you access to NVIDIA M60 GPUs based on the NVIDIA Maxwell architecture. NVIDIA refers to the M60 GPUs as virtual workstations and positions them for professional graphics. However, with much more powerful and cost-effective options for deep learning with P3, G4, G5, G5g instances, G3 is not a recommended option for deep learning. I’ve only included it here for some history and sake of completeness. G3 instance features at a glance:
Should you consider G3 instances for deep learning?Prior to the launch of Amazon EC2 G4 instances, single GPU G3 instances were cost effective to develop, test and prototype. And although the Maxwell architecture is more recent than NVIDIA K80’s Kepler architectures found on P2 instances, you should still consider P2 instances before G3 for deep learning. Your choice order should be P3 > G4 > P2 > G3. G3 instances come in 4 sizes, Other machine learning instance options on AWSNVIDIA GPUs are no doubt a staple for deep learning, but there are other instance options and accelerators on AWS that may be the better option for your training and inference workloads.
For a detailed discussion on inference deployment options please refer to blog post on choosing the right AI accelerators for inference: Blog Post: Cost optimization tips when using GPU instances for MLYou have a few different options to optimize the cost of your training and inference workloads. Spot instancesSpot-instance pricing makes high-performance GPUs much more affordable and allows you to access spare Amazon EC2 compute capacity at a steep discount compared to on-demand rates. For an up-to-date list of prices by instance and Region, visit the Spot Instance Advisor. In some cases you can save over 90% on your training costs, but your instances can be preempted and be terminated with just 2 mins notice. Your training scripts must implement frequent checkpointing and ability to resume training once Spot capacity is restored. Amazon SageMaker managed trainingDuring the development phase much of your time is spent prototyping, tweaking code and trying different options in your favorite editor or IDE (which is obvious VIM) — all of which don’t need a GPU. You can save costs by simply decoupling your development and training resources and Amazon SageMaker will let you do this easily. Using the Amazon SageMaker Python SDK you can test your scripts locally on your laptop, desktop, EC2 instance or SageMaker notebook instance. When you’re ready to train, specify what GPU instance type you want to train on and SageMaker will provision the instances, copy the dataset to the instance, train your model, copy results back to Amazon S3, and tear down the instance. You are only billed for the exact duration of training. Amazon SageMaker also supports managed Spot Training for additional convenience and cost savings.
Use just the required amount of GPU with Amazon Elastic InferenceSave costs for inference workloads by leveraging EI to add just the right amount of GPU acceleration to your CPU instances discussed in this blog post: A complete guide to AI accelerators for deep learning inference — GPUs, AWS Inferentia and Amazon Elastic Inference Optimize for cost by improving utilization
What software to use on Amazon EC2 GPU instances?Downloading your favourite deep learning framework is easy right? Just For this reason, I highly recommend using AWS Deep Learning AMIs or AWS Deep Learning Containers (DLC) instead. AWS qualifies and tests them on all Amazon EC2 GPU instances, and they include AWS optimizations for networking, storage access and the latest NVIDIA and Intel drivers and libraries. Deep learning frameworks have upstream and downstream dependencies on higher level schedulers and orchestrators and lower-level infrastructure services. By using AWS AMIs and AWS DLCs you know it’s been tested end-to-end and is guaranteed to give you the best performance. Which GPUs to consider for HPC use-cases?High-performance Computing (HPC) is another scientific domain that relies on GPUs to speed up computation for simulation, data processing and visualization. While deep learning training can be done on lower precision arithmetic from FP32 (single precision) down to FP16 (half precision) and variations such as Bfloat16 and TF32, HPC applications need to high-precision arithmetic up to FP64 (double precision). The NVIDIA A100, V100 and K80 GPUs support FP64 precision and these are available on P4, P3 and P2 instances respectively. A complete and unapologetically detailed spreadsheet of all GPU instances and their featuresIn today’s “I put this together because I couldn’t find one already” contribution, I present to you a GPUs on AWS features list. I often want to know how much memory is on a specific GPU or if a specific precision type is supported on a GPU, or if the instance has an Intel, AMD or Graviton CPU etc. before I launch a GPU instance. To avoid having to go through various webpages and NVIDIA white papers, I’ve painstakingly compiled all the information into a table. You can use the image below or go right to the markdown table embedded at the end of the post and hosted on GitHub, your choice. Enjoy! GPU features at a glanceDo you prefer consuming content in a graphical format, I got you covered there too! The following image shows all the GPU instance types and sizes on AWS. There isn’t enough space for all the features, for that I still recommend the spreadsheet. Hey there! thanks for reading!Thank you for reading. If you found this article interesting, consider giving this an applause and following me on medium. Please also check out my other blog posts on medium or follow me on twitter (@shshnkp), LinkedIn or leave a comment below. Want me to write on a specific machine learning topic? I’d love to hear from you! Which of the following is a hardware device that can administer multiple hosts with the use of a single mouse keyboard and computer screen?A KVM (which stands for Keyboard, Video, Mouse) switch is a hardware device that allows server administration to control multiple computers from a single keyboard, video display monitor, and mouse (KVM).
Which of the following units is used to measure the brightness of an image rendered by an image projector?ANSI Lumens is a unit, defined by the American National Standards Institute (ANSI), that measures the overall amount of light output by a projector, in other words the higher the lumen value for a projector the brighter the light it produces.
Which of the following are devices capable of providing both input and output functions?Network cards work as both input as well and output devices. It is also known as a Local area Network card, Ethernet card, or Network card Only.
Which of the following is the type of electrical current supplied to most internal PC components?Personal computers, like most electronic devices with transistors, run on direct current (DC).
|