Which of the following commands will display CPU load data along with information concerning users who are currently logged into the system?
Red Hat Enterprise Linux 8 Show
Optimizing system throughput, latency, and power consumptionAbstract This documentation collection provides instructions on how to monitor and optimize the throughput, latency, and power consumption of Red Hat Enterprise Linux 8 in different scenarios. Making open source more inclusiveRed Hat is committed to replacing problematic language in our code, documentation, and web properties. We are beginning with these four terms: master, slave, blacklist, and whitelist. Because of the enormity of this endeavor, these changes will be implemented gradually over several upcoming releases. For more details, see our CTO Chris Wright’s message. Providing feedback on Red Hat documentationWe appreciate your input on our documentation. Please let us know how we could make it better.
Chapter 1. Overview of performance monitoring optionsThe following are some of the performance monitoring and configuration tools available in Red Hat Enterprise Linux 8:
Additional resources
Chapter 2. Getting started with TuneDAs a system administrator, you can use the TuneD application to optimize the performance profile of your system for a variety of use cases. 2.1. The purpose of TuneDTuneD is a service that monitors your system and optimizes the performance under certain workloads. The core of TuneD are profiles, which tune your system for different use cases. TuneD is distributed with a number of predefined profiles for use cases such as:
It is possible to modify the rules defined for each profile and customize how to tune a particular device. When you switch to another profile or deactivate TuneD, all changes made to the system settings by the previous profile revert back to their original state. You can also configure TuneD to react to changes in device usage and adjusts settings to improve performance of active devices and reduce power consumption of inactive devices. 2.2. TuneD profilesA detailed analysis of a system can be very time-consuming. TuneD provides a number of predefined profiles for typical use cases. You can also create, modify, and delete profiles. The profiles provided with TuneD are divided into the following categories:
The performance-boosting profiles include profiles that focus on the following aspects:
Syntax of profile configurationThe tuned.conf file can contain one [main] section and other sections for configuring plug-in instances. However, all sections are optional. Lines starting with the hash sign (#) are comments. Additional resources
2.3. The default TuneD profileDuring the installation, the best profile for your system is selected automatically. Currently, the default profile is selected according to the following customizable rules:
Additional resources
2.4. Merged TuneD profilesAs an experimental feature, it is possible to select more profiles at once. TuneD will try to merge them during the load. If there are conflicts, the settings from the last specified profile takes precedence. Example 2.1. Low power consumption in a virtual guest The following example optimizes the system to run in a virtual machine for the best performance and concurrently tunes it for low power consumption, while the low power consumption is the priority: # tuned-adm profile virtual-guest powersaveMerging is done automatically without checking whether the resulting combination of parameters makes sense. Consequently, the feature might tune some parameters the opposite way, which might be counterproductive: for example, setting the disk for high throughput by using the throughput-performance profile and concurrently setting the disk spindown to the low value by the spindown-disk profile. Additional resources *tuned-adm man page. * tuned.conf(5) man page. 2.5. The location of TuneD profilesTuneD stores profiles in the following directories: /usr/lib/tuned/ Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile consists of the main configuration file called tuned.conf, and optionally other files, for example helper scripts. /etc/tuned/ If you need to customize a profile, copy the profile directory into the directory, which is used for custom profiles. If there are two profiles of the same name, the custom profile located in /etc/tuned/ is used. Additional resources
2.6. TuneD profiles distributed with RHELThe following is a list of profiles that are installed with TuneD on Red Hat Enterprise Linux. There might be more product-specific or third-party TuneD profiles available. Such profiles are usually provided by separate RPM packages. balanced The default power-saving profile. It is intended to be a compromise between performance and power consumption. It uses auto-scaling and auto-tuning whenever possible. The only drawback is the increased latency. In the current TuneD release, it enables the CPU, disk, audio, and video plugins, and activates the conservative CPU governor. The radeon_powersave option uses the dpm-balanced value if it is supported, otherwise it is set to auto. It changes the energy_performance_preference attribute to the normal energy setting. It also changes the scaling_governor policy attribute to either the conservative or powersave CPU governor. powersaveA profile for maximum power saving performance. It can throttle the performance in order to minimize the actual power consumption. In the current TuneD release it enables USB autosuspend, WiFi power saving, and Aggressive Link Power Management (ALPM) power savings for SATA host adapters. It also schedules multi-core power savings for systems with a low wakeup rate and activates the ondemand governor. It enables AC97 audio power saving or, depending on your system, HDA-Intel power savings with a 10 seconds timeout. If your system contains a supported Radeon graphics card with enabled KMS, the profile configures it to automatic power saving. On ASUS Eee PCs, a dynamic Super Hybrid Engine is enabled. It changes the energy_performance_preference attribute to the powersave or power energy setting. It also changes the scaling_governor policy attribute to either the ondemand or powersave CPU governor. In certain cases, the balanced profile is more efficient compared to the powersave profile. Consider there is a defined amount of work that needs to be done, for example a video file that needs to be transcoded. Your machine might consume less energy if the transcoding is done on the full power, because the task is finished quickly, the machine starts to idle, and it can automatically step-down to very efficient power save modes. On the other hand, if you transcode the file with a throttled machine, the machine consumes less power during the transcoding, but the process takes longer and the overall consumed energy can be higher. That is why the balanced profile can be generally a better option. throughput-performanceA server profile optimized for high throughput. It disables power savings mechanisms and enables sysctl settings that improve the throughput performance of the disk and network IO. CPU governor is set to performance. It changes the energy_performance_preference and scaling_governor attribute to the performance profile. accelerator-performance The accelerator-performance profile contains the same tuning as the throughput-performance profile. Additionally, it locks the CPU to low C states so that the latency is less than 100us. This improves the performance of certain accelerators, such as GPUs. latency-performanceA server profile optimized for low latency. It disables power savings mechanisms and enables sysctl settings that improve latency. CPU governor is set to performance and the CPU is locked to the low C states (by PM QoS). It changes the energy_performance_preference and scaling_governor attribute to the performance profile. network-latencyA profile for low latency network tuning. It is based on the latency-performance profile. It additionally disables transparent huge pages and NUMA balancing, and tunes several other network-related sysctl parameters. It inherits the latency-performance profile which changes the energy_performance_preference and scaling_governor attribute to the performance profile. hpc-compute A profile optimized for high-performance computing. It is based on the latency-performance profile. network-throughputA profile for throughput network tuning. It is based on the throughput-performance profile. It additionally increases kernel network buffers. It inherits either the latency-performance or throughput-performance profile, and changes the energy_performance_preference and scaling_governor attribute to the performance profile. virtual-guestA profile designed for Red Hat Enterprise Linux 8 virtual machines and VMWare guests based on the throughput-performance profile that, among other tasks, decreases virtual memory swappiness and increases disk readahead values. It does not disable disk barriers. It inherits the throughput-performance profile and changes the energy_performance_preference and scaling_governor attribute to the performance profile. virtual-hostA profile designed for virtual hosts based on the throughput-performance profile that, among other tasks, decreases virtual memory swappiness, increases disk readahead values, and enables a more aggressive value of dirty pages writeback. It inherits the throughput-performance profile and changes the energy_performance_preference and scaling_governor attribute to the performance profile. oracle A profile optimized for Oracle databases loads based on throughput-performance profile. It additionally disables transparent huge pages and modifies other performance-related kernel parameters. This profile is provided by the tuned-profiles-oracle package. desktop A profile optimized for desktops, based on the balanced profile. It additionally enables scheduler autogroups for better response of interactive applications. optimize-serial-consoleA profile that tunes down I/O activity to the serial console by reducing the printk value. This should make the serial console more responsive. This profile is intended to be used as an overlay on other profiles. For example: # tuned-adm profile throughput-performance optimize-serial-consolemssql A profile provided for Microsoft SQL Server. It is based on the thoguhput-performance profile. intel-sstA profile optimized for systems with user-defined Intel Speed Select Technology configurations. This profile is intended to be used as an overlay on other profiles. For example: # tuned-adm profile cpu-partitioning intel-sst2.7. TuneD cpu-partitioning profileFor tuning Red Hat Enterprise Linux 8 for latency-sensitive workloads, Red Hat recommends to use the cpu-partitioning TuneD profile. Prior to Red Hat Enterprise Linux 8, the low-latency Red Hat documentation described the numerous low-level steps needed to achieve low-latency tuning. In Red Hat Enterprise Linux 8, you can perform low-latency tuning more efficiently by using the cpu-partitioning TuneD profile. This profile is easily customizable according to the requirements for individual low-latency applications. The following figure is an example to demonstrate how to use the cpu-partitioning profile. This example uses the CPU and node layout. Figure 2.1. Figure cpu-partitioning You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf file using the following configuration options: Isolated CPUs with load balancing In the cpu-partitioning figure, the blocks numbered from 4 to 23, are the default isolated CPUs. The kernel scheduler’s process load balancing is enabled on these CPUs. It is designed for low-latency processes with multiple threads that need the kernel scheduler load balancing. You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf file using the isolated_cores=cpu-list option, which lists CPUs to isolate that will use the kernel scheduler load balancing. The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as 3-5. This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping CPU. Isolated CPUs without load balancingIn the cpu-partitioning figure, the blocks numbered 2 and 3, are the isolated CPUs that do not provide any additional kernel scheduler process load balancing. You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf file using the no_balance_cores=cpu-list option, which lists CPUs to isolate that will not use the kernel scheduler load balancing. Specifying the no_balance_cores option is optional, however any CPUs in this list must be a subset of the CPUs listed in the isolated_cores list. Application threads using these CPUs need to be pinned individually to each CPU. Housekeeping CPUs Any CPU not isolated in the cpu-partitioning-variables.conf file is automatically considered a housekeeping CPU. On the housekeeping CPUs, all services, daemons, user processes, movable kernel threads, interrupt handlers, and kernel timers are permitted to execute.Additional resources
2.8. Using the TuneD cpu-partitioning profile for low-latency tuningThis procedure describes how to tune a system for low-latency using the TuneD’s cpu-partitioning profile. It uses the example of a low-latency application that can use cpu-partitioning and the CPU layout as mentioned in the cpu-partitioning figure. The application in this case uses:
Prerequisites
Procedure
Additional resources
2.9. Customizing the cpu-partitioning TuneD profileYou can extend the TuneD profile to make additional tuning changes. For example, the cpu-partitioning profile sets the CPUs to use cstate=1. In order to use the cpu-partitioning profile but to additionally change the CPU cstate from cstate1 to cstate0, the following procedure describes a new TuneD profile named my_profile, which inherits the cpu-partitioning profile and then sets C state-0. Procedure
In the shared example, a reboot is not required. However, if the changes in the my_profile profile require a reboot to take effect, then reboot your machine. Additional resources
2.10. Real-time TuneD profiles distributed with RHELReal-time profiles are intended for systems running the real-time kernel. Without a special kernel build, they do not configure the system to be real-time. On RHEL, the profiles are available from additional repositories. The following real-time profiles are available: realtime Use on bare-metal real-time systems. Provided by the tuned-profiles-realtime package, which is available from the RT or NFV repositories. realtime-virtual-hostUse in a virtualization host configured for real-time. Provided by the tuned-profiles-nfv-host package, which is available from the NFV repository. realtime-virtual-guestUse in a virtualization guest configured for real-time. Provided by the tuned-profiles-nfv-guest package, which is available from the NFV repository. 2.11. Static and dynamic tuning in TuneDThis section explains the difference between the two categories of system tuning that TuneD applies: static and dynamic. Static tuning Mainly consists of the application of predefined sysctl and sysfs settings and one-shot activation of several configuration tools such as ethtool. Dynamic tuning Watches how various system components are used throughout the uptime of your system. TuneD adjusts system settings dynamically based on that monitoring information. For example, the hard drive is used heavily during startup and login, but is barely used later when the user might mainly work with applications such as web browsers or email clients. Similarly, the CPU and network devices are used differently at different times. TuneD monitors the activity of these components and reacts to the changes in their use. By default, dynamic tuning is disabled. To enable it, edit the /etc/tuned/tuned-main.conf file and change the dynamic_tuning option to 1. TuneD then periodically analyzes system statistics and uses them to update your system tuning settings. To configure the time interval in seconds between these updates, use the update_interval option. Currently implemented dynamic tuning algorithms try to balance the performance and powersave, and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be enabled or disabled in the TuneD profiles. Example 2.2. Static and dynamic tuning on a workstation On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a few emails go in and out or some web pages might be loaded. For those kinds of loads, the network interface does not have to run at full speed all the time, as it does by default. TuneD has a monitoring and tuning plug-in for network devices that can detect this low activity and then automatically lower the speed of that interface, typically resulting in a lower power usage. If the activity on the interface increases for a longer period of time, for example because a DVD image is being downloaded or an email with a large attachment is opened, TuneD detects this and sets the interface speed to maximum to offer the best performance while the activity level is high. This principle is used for other plug-ins for CPU and disks as well. 2.12. TuneD no-daemon modeYou can run TuneD in no-daemon mode, which does not require any resident memory. In this mode, TuneD applies the settings and exits. By default, no-daemon mode is disabled because a lot of TuneD functionality is missing in this mode, including:
To enable no-daemon mode, include the following line in the /etc/tuned/tuned-main.conf file: daemon = 02.13. Installing and enabling TuneDThis procedure installs and enables the TuneD application, installs TuneD profiles, and presets a default TuneD profile for your system. Procedure
2.14. Listing available TuneD profilesThis procedure lists all TuneD profiles that are currently available on your system. Procedure
Additional resources
2.15. Setting a TuneD profileThis procedure activates a selected TuneD profile on your system. Procedure
Verification steps
Additional resources
2.16. Disabling TuneDThis procedure disables TuneD and resets all affected system settings to their original state before TuneD modified them. Procedure
Additional resources
Chapter 3. Customizing TuneD profilesYou can create or modify TuneD profiles to optimize system performance for your intended use case. 3.1. TuneD profilesA detailed analysis of a system can be very time-consuming. TuneD provides a number of predefined profiles for typical use cases. You can also create, modify, and delete profiles. The profiles provided with TuneD are divided into the following categories:
The performance-boosting profiles include profiles that focus on the following aspects:
Syntax of profile configurationThe tuned.conf file can contain one [main] section and other sections for configuring plug-in instances. However, all sections are optional. Lines starting with the hash sign (#) are comments. Additional resources
3.2. The default TuneD profileDuring the installation, the best profile for your system is selected automatically. Currently, the default profile is selected according to the following customizable rules:
Additional resources
3.3. Merged TuneD profilesAs an experimental feature, it is possible to select more profiles at once. TuneD will try to merge them during the load. If there are conflicts, the settings from the last specified profile takes precedence. Example 3.1. Low power consumption in a virtual guest The following example optimizes the system to run in a virtual machine for the best performance and concurrently tunes it for low power consumption, while the low power consumption is the priority: # tuned-adm profile virtual-guest powersaveMerging is done automatically without checking whether the resulting combination of parameters makes sense. Consequently, the feature might tune some parameters the opposite way, which might be counterproductive: for example, setting the disk for high throughput by using the throughput-performance profile and concurrently setting the disk spindown to the low value by the spindown-disk profile. Additional resources *tuned-adm man page. * tuned.conf(5) man page. 3.4. The location of TuneD profilesTuneD stores profiles in the following directories: /usr/lib/tuned/ Distribution-specific profiles are stored in the directory. Each profile has its own directory. The profile consists of the main configuration file called tuned.conf, and optionally other files, for example helper scripts. /etc/tuned/ If you need to customize a profile, copy the profile directory into the directory, which is used for custom profiles. If there are two profiles of the same name, the custom profile located in /etc/tuned/ is used. Additional resources
3.5. Inheritance between TuneD profilesTuneD profiles can be based on other profiles and modify only certain aspects of their parent profile. The [main] section of TuneD profiles recognizes the include option: [main] include=parentAll settings from the parent profile are loaded in this child profile. In the following sections, the child profile can override certain settings inherited from the parent profile or add new settings not present in the parent profile. You can create your own child profile in the /etc/tuned/ directory based on a pre-installed profile in /usr/lib/tuned/ with only some parameters adjusted. If the parent profile is updated, such as after a TuneD upgrade, the changes are reflected in the child profile. Example 3.2. A power-saving profile based on balanced The following is an example of a custom profile that extends the balanced profile and sets Aggressive Link Power Management (ALPM) for all devices to the maximum powersaving. [main] include=balanced [scsi_host] alpm=min_powerAdditional resources
3.6. Static and dynamic tuning in TuneDThis section explains the difference between the two categories of system tuning that TuneD applies: static and dynamic. Static tuning Mainly consists of the application of predefined sysctl and sysfs settings and one-shot activation of several configuration tools such as ethtool. Dynamic tuning Watches how various system components are used throughout the uptime of your system. TuneD adjusts system settings dynamically based on that monitoring information. For example, the hard drive is used heavily during startup and login, but is barely used later when the user might mainly work with applications such as web browsers or email clients. Similarly, the CPU and network devices are used differently at different times. TuneD monitors the activity of these components and reacts to the changes in their use. By default, dynamic tuning is disabled. To enable it, edit the /etc/tuned/tuned-main.conf file and change the dynamic_tuning option to 1. TuneD then periodically analyzes system statistics and uses them to update your system tuning settings. To configure the time interval in seconds between these updates, use the update_interval option. Currently implemented dynamic tuning algorithms try to balance the performance and powersave, and are therefore disabled in the performance profiles. Dynamic tuning for individual plug-ins can be enabled or disabled in the TuneD profiles. Example 3.3. Static and dynamic tuning on a workstation On a typical office workstation, the Ethernet network interface is inactive most of the time. Only a few emails go in and out or some web pages might be loaded. For those kinds of loads, the network interface does not have to run at full speed all the time, as it does by default. TuneD has a monitoring and tuning plug-in for network devices that can detect this low activity and then automatically lower the speed of that interface, typically resulting in a lower power usage. If the activity on the interface increases for a longer period of time, for example because a DVD image is being downloaded or an email with a large attachment is opened, TuneD detects this and sets the interface speed to maximum to offer the best performance while the activity level is high. This principle is used for other plug-ins for CPU and disks as well. 3.7. TuneD plug-insPlug-ins are modules in TuneD profiles that TuneD uses to monitor or optimize different devices on the system. TuneD uses two types of plug-ins: Monitoring plug-ins Monitoring plug-ins are used to get information from a running system. The output of the monitoring plug-ins can be used by tuning plug-ins for dynamic tuning. Monitoring plug-ins are automatically instantiated whenever their metrics are needed by any of the enabled tuning plug-ins. If two tuning plug-ins require the same data, only one instance of the monitoring plug-in is created and the data is shared. Tuning plug-ins Each tuning plug-in tunes an individual subsystem and takes several parameters that are populated from the tuned profiles. Each subsystem can have multiple devices, such as multiple CPUs or network cards, that are handled by individual instances of the tuning plug-ins. Specific settings for individual devices are also supported.Syntax for plug-ins in TuneD profilesSections describing plug-in instances are formatted in the following way: [NAME] type=TYPE devices=DEVICESNAME is the name of the plug-in instance as it is used in the logs. It can be an arbitrary string. TYPE is the type of the tuning plug-in. DEVICES is the list of devices that this plug-in instance handles. The devices line can contain a list, a wildcard (*), and negation (!). If there is no devices line, all devices present or later attached on the system of the TYPE are handled by the plug-in instance. This is same as using the devices=* option. Example 3.4. Matching block devices with a plug-in The following example matches all block devices starting with sd, such as sda or sdb, and does not disable barriers on them: [data_disk] type=disk devices=sd* disable_barriers=falseThe following example matches all block devices except sda1 and sda2: [data_disk] type=disk devices=!sda1, !sda2 disable_barriers=falseIf no instance of a plug-in is specified, the plug-in is not enabled. If the plug-in supports more options, they can be also specified in the plug-in section. If the option is not specified and it was not previously specified in the included plug-in, the default value is used. Short plug-in syntaxIf you do not need custom names for the plug-in instance and there is only one definition of the instance in your configuration file, TuneD supports the following short syntax: [TYPE] devices=DEVICESIn this case, it is possible to omit the type line. The instance is then referred to with a name, same as the type. The previous example could be then rewritten into: Example 3.5. Matching block devices using the short syntax [disk] devices=sdb* disable_barriers=false Conflicting plug-in definitions in a profileIf the same section is specified more than once using the include option, the settings are merged. If they cannot be merged due to a conflict, the last conflicting definition overrides the previous settings. If you do not know what was previously defined, you can use the replace Boolean option and set it to true. This causes all the previous definitions with the same name to be overwritten and the merge does not happen. You can also disable the plug-in by specifying the enabled=false option. This has the same effect as if the instance was never defined. Disabling the plug-in is useful if you are redefining the previous definition from the include option and do not want the plug-in to be active in your custom profile. NOTE TuneD includes the ability to run any shell command as part of enabling or disabling a tuning profile. This enables you to extend TuneD profiles with functionality that has not been integrated into TuneD yet. You can specify arbitrary shell commands using the script plug-in. Additional resources
3.8. Available TuneD plug-insThis section lists all monitoring and tuning plug-ins currently available in TuneD. Monitoring plug-insCurrently, the following monitoring plug-ins are implemented: disk Gets disk load (number of IO operations) per device and measurement interval. net Gets network load (number of transferred packets) per network card and measurement interval. load Gets CPU load per CPU and measurement interval. Tuning plug-insCurrently, the following tuning plug-ins are implemented. Only some of these plug-ins implement dynamic tuning. Options supported by plug-ins are also listed: cpu Sets the CPU governor to the value specified by the governor option and dynamically changes the Power Management Quality of Service (PM QoS) CPU Direct Memory Access (DMA) latency according to the CPU load. If the CPU load is lower than the value specified by the load_threshold option, the latency is set to the value specified by the latency_high option, otherwise it is set to the value specified by latency_low. You can also force the latency to a specific value and prevent it from dynamically changing further. To do so, set the force_latency option to the required latency value. eeepc_sheDynamically sets the front-side bus (FSB) speed according to the CPU load. This feature can be found on some netbooks and is also known as the ASUS Super Hybrid Engine (SHE). If the CPU load is lower or equal to the value specified by the load_threshold_powersave option, the plug-in sets the FSB speed to the value specified by the she_powersave option. If the CPU load is higher or equal to the value specified by the load_threshold_normal option, it sets the FSB speed to the value specified by the she_normal option. Static tuning is not supported and the plug-in is transparently disabled if TuneD does not detect the hardware support for this feature. net Configures the Wake-on-LAN functionality to the values specified by the wake_on_lan option. It uses the same syntax as the ethtool utility. It also dynamically changes the interface speed according to the interface utilization. sysctlSets various sysctl settings specified by the plug-in options. The syntax is name=value, where name is the same as the name provided by the sysctl utility. Use the sysctl plug-in if you need to change system settings that are not covered by other plug-ins available in TuneD. If the settings are covered by some specific plug-ins, prefer these plug-ins. usbSets autosuspend timeout of USB devices to the value specified by the autosuspend parameter. The value 0 means that autosuspend is disabled. vmEnables or disables transparent huge pages depending on the value of the transparent_hugepages option. Valid values of the transparent_hugepages option are:
Sets the autosuspend timeout for audio codecs to the value specified by the timeout option. Currently, the snd_hda_intel and snd_ac97_codec codecs are supported. The value 0 means that the autosuspend is disabled. You can also enforce the controller reset by setting the Boolean option reset_controller to true. diskSets the disk elevator to the value specified by the elevator option. It also sets:
In addition, this plug-in dynamically changes the advanced power management and spindown timeout setting for the drive according to the current drive utilization. The dynamic tuning can be controlled by the Boolean option dynamic and is enabled by default. scsi_hostTunes options for SCSI hosts. It sets Aggressive Link Power Management (ALPM) to the value specified by the alpm option. mounts Enables or disables barriers for mounts according to the Boolean value of the disable_barriers option. scriptExecutes an external script or binary when the profile is loaded or unloaded. You can choose an arbitrary executable. The script plug-in is provided mainly for compatibility with earlier releases. Prefer other TuneD plug-ins if they cover the required functionality. TuneD calls the executable with one of the following arguments:
You need to correctly implement the stop action in your executable and revert all settings that you changed during the start action. Otherwise, the roll-back step after changing your TuneD profile will not work. Bash scripts can import the /usr/lib/tuned/functions Bash library and use the functions defined there. Use these functions only for functionality that is not natively provided by TuneD. If a function name starts with an underscore, such as _wifi_set_power_level, consider the function private and do not use it in your scripts, because it might change in the future. Specify the path to the executable using the script parameter in the plug-in configuration. Example 3.6. Running a Bash script from a profile To run a Bash script named script.sh that is located in the profile directory, use: [script] script=${i:PROFILE_DIR}/script.shsysfsSets various sysfs settings specified by the plug-in options. The syntax is name=value, where name is the sysfs path to use. Use this plugin in case you need to change some settings that are not covered by other plug-ins. Prefer specific plug-ins if they cover the required settings. videoSets various powersave levels on video cards. Currently, only the Radeon cards are supported. The powersave level can be specified by using the radeon_powersave option. Supported values are:
For details, see www.x.org. Note that this plug-in is experimental and the option might change in future releases. bootloaderAdds options to the kernel command line. This plug-in supports only the GRUB 2 boot loader. Customized non-standard location of the GRUB 2 configuration file can be specified by the grub2_cfg_file option. The kernel options are added to the current GRUB configuration and its templates. The system needs to be rebooted for the kernel options to take effect. Switching to another profile or manually stopping the tuned service removes the additional options. If you shut down or reboot the system, the kernel options persist in the grub.cfg file. The kernel options can be specified by the following syntax: cmdline=arg1 arg2 ... argNExample 3.7. Modifying the kernel command line For example, to add the quiet kernel option to a TuneD profile, include the following lines in the tuned.conf file: [bootloader] cmdline=quietThe following is an example of a custom profile that adds the isolcpus=2 option to the kernel command line: [bootloader] cmdline=isolcpus=23.9. Variables in TuneD profilesVariables expand at run time when a TuneD profile is activated. Using TuneD variables reduces the amount of necessary typing in TuneD profiles. There are no predefined variables in TuneD profiles. You can define your own variables by creating the [variables] section in a profile and using the following syntax: [variables] variable_name=valueTo expand the value of a variable in a profile, use the following syntax: ${variable_name}Example 3.8. Isolating CPU cores using variables In the following example, the ${isolated_cores} variable expands to 1,2; hence the kernel boots with the isolcpus=1,2 option: [variables] isolated_cores=1,2 [bootloader] cmdline=isolcpus=${isolated_cores}The variables can be specified in a separate file. For example, you can add the following lines to tuned.conf: [variables] include=/etc/tuned/my-variables.conf [bootloader] cmdline=isolcpus=${isolated_cores}If you add the isolated_cores=1,2 option to the /etc/tuned/my-variables.conf file, the kernel boots with the isolcpus=1,2 option. Additional resources
3.10. Built-in functions in TuneD profilesBuilt-in functions expand at run time when a TuneD profile is activated. You can:
To call a function, use the following syntax: ${f:function_name:argument_1:argument_2}To expand the directory path where the profile and the tuned.conf file are located, use the PROFILE_DIR function, which requires special syntax: ${i:PROFILE_DIR}Example 3.9. Isolating CPU cores using variables and built-in functions In the following example, the ${non_isolated_cores} variable expands to 0,3-5, and the cpulist_invert built-in function is called with the 0,3-5 argument: [variables] non_isolated_cores=0,3-5 [bootloader] cmdline=isolcpus=${f:cpulist_invert:${non_isolated_cores}}The cpulist_invert function inverts the list of CPUs. For a 6-CPU machine, the inversion is 1,2, and the kernel boots with the isolcpus=1,2 command-line option. Additional resources
3.11. Built-in functions available in TuneD profilesThe following built-in functions are available in all TuneD profiles: PROFILE_DIR Returns the directory path where the profile and the tuned.conf file are located. exec Executes a process and returns its output. assertion Compares two arguments. If they do not match, the function logs text from the first argument and aborts profile loading. assertion_non_equal Compares two arguments. If they match, the function logs text from the first argument and aborts profile loading. kb2s Converts kilobytes to disk sectors. s2kb Converts disk sectors to kilobytes. strip Creates a string from all passed arguments and deletes both leading and trailing white space. virt_check Checks whether TuneD is running inside a virtual machine (VM) or on bare metal:
3.12. Creating new TuneD profilesThis procedure creates a new TuneD profile with custom performance rules. Procedure
Additional resources
3.13. Modifying existing TuneD profilesThis procedure creates a modified child profile based on an existing TuneD profile. Procedure
Additional resources
3.14. Setting the disk scheduler using TuneDThis procedure creates and enables a TuneD profile that sets a given disk scheduler for selected block devices. The setting persists across system reboots. In the following commands and configuration, replace:
Procedure
Verification steps
Chapter 4. Reviewing a system using tuna interfaceUse the tuna tool to adjust scheduler tunables, tune thread priority, IRQ handlers, and isolate CPU cores and sockets. Tuna reduces the complexity of performing tuning tasks. The tuna tool performs the following operations:
4.1. Installing tuna toolThe tuna tool is designed to be used on a running system. This allows application-specific measurement tools to see and analyze system performance immediately after changes have been made. This procedure describes how to install the tuna tool. Procedure
Verification steps
Additional resources
4.2. Viewing the system status using tuna toolThis procedure describes how to view the system status using the tuna command-line interface (CLI) tool. Procedure
Additional resources
4.3. Tuning CPUs using tuna toolThe tuna tool commands can target individual CPUs. Using the tuna tool, you can: Isolate CPUs All tasks running on the specified CPU move to the next available CPU. Isolating a CPU makes it unavailable by removing it from the affinity mask of all threads. Include CPUs Allows tasks to run on the specified CPU Restore CPUs Restores the specified CPU to its previous configuration. This procedure describes how to tune CPUs using the tuna CLI. Procedure
Verification steps
Additional resources
4.4. Tuning IRQs using tuna toolThe /proc/interrupts file records the number of interrupts per IRQ, the type of interrupt, and the name of the device that is located at that IRQ. This procedure describes how to tune the IRQs using the tuna tool. Procedure
Verification steps
Additional resources
Chapter 5. Monitoring performance using RHEL System RolesAs a system administrator, you can use the Metrics RHEL System Role with any Ansible Automation Platform control node to monitor the performance of a system. 5.1. Introduction to RHEL System RolesRHEL System Roles is a collection of Ansible roles and modules. RHEL System Roles provide a configuration interface to remotely manage multiple RHEL systems. The interface enables managing system configurations across multiple versions of RHEL, as well as adopting new major releases. On Red Hat Enterprise Linux 8, the interface currently consists of the following roles:
All these roles are provided by the rhel-system-roles package available in the AppStream repository. 5.2. RHEL System Roles terminologyYou can find the following terms across this documentation: Ansible playbook Playbooks are Ansible’s configuration, deployment, and orchestration language. They can describe a policy you want your remote systems to enforce, or a set of steps in a general IT process. Control node Any machine with Ansible installed. You can run commands and playbooks, invoking /usr/bin/ansible or /usr/bin/ansible-playbook, from any control node. You can use any computer that has Python installed on it as a control node - laptops, shared desktops, and servers can all run Ansible. However, you cannot use a Windows machine as a control node. You can have multiple control nodes. Inventory A list of managed nodes. An inventory file is also sometimes called a “hostfile”. Your inventory can specify information like IP address for each managed node. An inventory can also organize managed nodes, creating and nesting groups for easier scaling. To learn more about inventory, see the Working with Inventory section. Managed nodes The network devices, servers, or both that you manage with Ansible. Managed nodes are also sometimes called “hosts”. Ansible is not installed on managed nodes. 5.3. Installing RHEL System Roles in your systemTo use the RHEL System Roles, install the required packages in your system. Prerequisites
Procedure
The Ansible Core package provides the ansible-playbook CLI, the Ansible Vault functionality, and the basic modules and filters required by RHEL Ansible content. As a result, you are able to create an Ansible playbook. 5.4. Applying a roleThe following procedure describes how to apply a particular role. Prerequisites
Procedure
5.5. Introduction to the Metrics System RoleRHEL System Roles is a collection of Ansible roles and modules that provide a consistent configuration interface to remotely manage multiple RHEL systems. The Metrics System Role configures performance analysis services for the local system and, optionally, includes a list of remote systems to be monitored by the local system. The Metrics System Role enables you to use pcp to monitor your systems performance without having to configure pcp separately, as the set-up and deployment of pcp is handled by the playbook. Table 5.1. Metrics system role variables
For details about the parameters used in metrics_connections and additional information about the Metrics System Role, see the /usr/share/ansible/roles/rhel-system-roles.metrics/README.md file. 5.6. Using the Metrics System Role to monitor your local system with visualizationThis procedure describes how to use the Metrics RHEL System Role to monitor your local system while simultaneously provisioning data visualization via Grafana. Prerequisites
Procedure
5.7. Using the Metrics System Role to setup a fleet of individual systems to monitor themselvesThis procedure describes how to use the Metrics System Role to set up a fleet of machines to monitor themselves. Prerequisites
Procedure
Where the -k prompt for password to connect to remote system. 5.8. Using the Metrics System Role to monitor a fleet of machines centrally via your local machineThis procedure describes how to use the Metrics System Role to set up your local machine to centrally monitor a fleet of machines while also provisioning visualization of the data via grafana and querying of the data via redis. Prerequisites
Procedure
5.9. Setting up authentication while monitoring a system using the Metrics System RolePCP supports the scram-sha-256 authentication mechanism through the Simple Authentication Security Layer (SASL) framework. The Metrics RHEL System Role automates the steps to setup authentication using the scram-sha-256 authentication mechanism. This procedure describes how to setup authentication using the Metrics RHEL System Role. Prerequisites
Procedure
Verification steps
5.10. Using the Metrics System Role to configure and enable metrics collection for SQL ServerThis procedure describes how to use the Metrics RHEL System Role to automate the configuration and enabling of metrics collection for Microsoft SQL Server via pcp on your local system. Prerequisites
Procedure
Verification steps
Chapter 6. Setting up PCPPerformance Co-Pilot (PCP) is a suite of tools, services, and libraries for monitoring, visualizing, storing, and analyzing system-level performance measurements. This section describes how to install and enable PCP on your system. 6.1. Overview of PCPYou can add performance metrics using Python, Perl, C++, and C interfaces. Analysis tools can use the Python, C++, C client APIs directly, and rich web applications can explore all available performance data using a JSON interface. You can analyze data patterns by comparing live results with archived data. Features of PCP:
PCP has the following components:
Additional resources
6.2. Installing and enabling PCPTo begin using PCP, install all the required packages and enable the PCP monitoring services. This procedure describes how to install PCP using the pcp package. If you want to automate the PCP installation, install it using the pcp-zeroconf package. For more information on installing PCP by using pcp-zeroconf, see Setting up PCP with pcp-zeroconf. Procedure
Verification steps
6.3. Deploying a minimal PCP setupThe minimal PCP setup collects performance statistics on Red Hat Enterprise Linux. The setup involves adding the minimum number of packages on a production system needed to gather data for further analysis. You can analyze the resulting tar.gz file and the archive of the pmlogger output using various PCP tools and compare them with other sources of performance information. Procedure
6.4. System services distributed with PCPThe following table describes roles of various system services, which are distributed with PCP. Table 6.1. Roles of system services distributed with PCP
6.5. Tools distributed with PCPThe following table describes usage of various tools, which are distributed with PCP. Table 6.2. Usage of tools distributed with PCP
6.6. PCP deployment architecturesPerformance Co-Pilot (PCP) offers many options to accomplish advanced setups. From the huge variety of possible architectures, this section describes how to scale your PCP deployment based on the recommended deployment set up by Red Hat, sizing factors, and configuration options. PCP supports multiple deployment architectures, based on the scale of the PCP deployment. Available scaling deployment setup variants: Since the PCP version 5.3.0 is unavailable in Red Hat Enterprise Linux 8.4 and the prior minor versions of Red Hat Enterprise Linux 8, Red Hat recommends localhost and pmlogger farm architectures. For more information about known memory leaks in pmproxy in PCP versions before 5.3.0, see Memory leaks in pmproxy in PCP. Localhost Each service runs locally on the monitored machine. When you start a service without any configuration changes, this is the default deployment. Scaling beyond the individual node is not possible in this case. By default, the deployment setup for Redis is standalone, localhost. However, Redis can optionally perform in a highly-available and highly scalable clustered fashion, where data is shared across multiple hosts. Another viable option is to deploy a Redis cluster in the cloud, or to utilize a managed Redis cluster from a cloud vendor. DecentralizedThe only difference between localhost and decentralized setup is the centralized Redis service. In this model, the host executes pmlogger service on each monitored host and retrieves metrics from a local pmcd instance. A local pmproxy service then exports the performance metrics to a central Redis instance. Figure 6.1. Decentralized logging Centralized logging - pmlogger farmWhen the resource usage on the monitored hosts is constrained, another deployment option is a pmlogger farm, which is also known as centralized logging. In this setup, a single logger host executes multiple pmlogger processes, and each is configured to retrieve performance metrics from a different remote pmcd host. The centralized logger host is also configured to execute the pmproxy service, which discovers the resulting PCP archives logs and loads the metric data into a Redis instance. Figure 6.2. Centralized logging - pmlogger farm Federated - multiple pmlogger farmsFor large scale deployments, Red Hat recommends to deploy multiple pmlogger farms in a federated fashion. For example, one pmlogger farm per rack or data center. Each pmlogger farm loads the metrics into a central Redis instance. Figure 6.3. Federated - multiple pmlogger farms By default, the deployment setup for Redis is standalone, localhost. However, Redis can optionally perform in a highly-available and highly scalable clustered fashion, where data is shared across multiple hosts. Another viable option is to deploy a Redis cluster in the cloud, or to utilize a managed Redis cluster from a cloud vendor. 6.7. Recommended deployment architectureThe following table describes the recommended deployment architectures based on the number of monitored hosts. Table 6.3. Recommended deployment architecture
6.8. Sizing factorsThe following are the sizing factors required for scaling: Remote system size The number of CPUs, disks, network interfaces, and other hardware resources affects the amount of data collected by each pmlogger on the centralized logging host. Logged Metrics The number and types of logged metrics play an important role. In particular, the per-process proc.* metrics require a large amount of disk space, for example, with the standard pcp-zeroconf setup, 10s logging interval, 11 MB without proc metrics versus 155 MB with proc metrics - a factor of 10 times more. Additionally, the number of instances for each metric, for example the number of CPUs, block devices, and network interfaces also impacts the required storage capacity. Logging Interval The interval how often metrics are logged, affects the storage requirements. The expected daily PCP archive file sizes are written to the pmlogger.log file for each pmlogger instance. These values are uncompressed estimates. Since PCP archives compress very well, approximately 10:1, the actual long term disk space requirements can be determined for a particular site. pmlogrewrite After every PCP upgrade, the pmlogrewrite tool is executed and rewrites old archives if there were changes in the metric metadata from the previous version and the new version of PCP. This process duration scales linear with the number of archives stored. Additional resources
6.9. Configuration options for PCP scalingThe following are the configuration options, which are required for scaling: sysctl and rlimit settings When archive discovery is enabled, pmproxy requires four descriptors for every pmlogger that it is monitoring or log-tailing, along with the additional file descriptors for the service logs and pmproxy client sockets, if any. Each pmlogger process uses about 20 file descriptors for the remote pmcd socket, archive files, service logs, and others. In total, this can exceed the default 1024 soft limit on a system running around 200 pmlogger processes. The pmproxy service in pcp-5.3.0 and later automatically increases the soft limit to the hard limit. On earlier versions of PCP, tuning is required if a high number of pmlogger processes are to be deployed, and this can be accomplished by increasing the soft or hard limits for pmlogger. For more information, see How to set limits (ulimit) for services run by systemd. Local Archives The pmlogger service stores metrics of local and remote pmcds in the /var/log/pcp/pmlogger/ directory. To control the logging interval of the local system, update the /etc/pcp/pmlogger/control.d/configfile file and add -t X in the arguments, where X is the logging interval in seconds. To configure which metrics should be logged, execute pmlogconf /var/lib/pcp/config/pmlogger/config.clienthostname. This command deploys a configuration file with a default set of metrics, which can optionally be further customized. To specify retention settings, that is when to purge old PCP archives, update the /etc/sysconfig/pmlogger_timers file and specify PMLOGGER_DAILY_PARAMS="-E -k X", where X is the amount of days to keep PCP archives. Redis The pmproxy service sends logged metrics from pmlogger to a Redis instance. The following are the available two options to specify the retention settings in the /etc/pcp/pmproxy/pmproxy.conf configuration file:
Additional resources
6.10. Example: Analyzing the centralized logging deploymentThe following results were gathered on a centralized logging setup, also known as pmlogger farm deployment, with a default pcp-zeroconf 5.3.0 installation, where each remote host is an identical container instance running pmcd on a server with 64 CPU cores, 376 GB RAM, and one disk attached. The logging interval is 10s, proc metrics of remote nodes are not included, and the memory values refer to the Resident Set Size (RSS) value. Table 6.4. Detailed utilization statistics for 10s logging interval
Table 6.5. Used resources depending on monitored hosts for 60s logging interval
The pmproxy queues Redis requests and employs Redis pipelining to speed up Redis queries. This can result in high memory usage. For troubleshooting this issue, see Troubleshooting high memory usage. 6.11. Example: Analyzing the federated setup deploymentThe following results were observed on a federated setup, also known as multiple pmlogger farms, consisting of three centralized logging (pmlogger farm) setups, where each pmlogger farm was monitoring 100 remote hosts, that is 300 hosts in total. This setup of the pmlogger farms is identical to the configuration mentioned in the Example: Analyzing the centralized logging deployment for 60s logging interval, except that the Redis servers were operating in cluster mode. Table 6.6. Used resources depending on federated hosts for 60s logging interval
Here, all values are per host. The network bandwidth is higher due to the inter-node communication of the Redis cluster. 6.12. Troubleshooting high memory usageThe following scenarios can result in high memory usage:
The pmproxy service daemon uses Redis streams and supports the configuration parameters, which are PCP tuning parameters and affects Redis memory usage and key retention. The /etc/pcp/pmproxy/pmproxy.conf file lists the available configuration options for pmproxy and the associated APIs. This section describes how to troubleshoot high memory usage issue. Prerequisites
Procedure
Chapter 7. Logging performance data with pmloggerWith the PCP tool you can log the performance metric values and replay them later. This allows you to perform a retrospective performance analysis. Using the pmlogger tool, you can:
7.1. Modifying the pmlogger configuration file with pmlogconfWhen the pmlogger service is running, PCP logs a default set of metrics on the host. Use the pmlogconf utility to check the default configuration. If the pmlogger configuration file does not exist, pmlogconf creates it with a default metric values. Procedure
7.2. Editing the pmlogger configuration file manuallyTo create a tailored logging configuration with specific metrics and given intervals, edit the pmlogger configuration file manually. The default pmlogger configuration file is /var/lib/pcp/config/pmlogger/config.default. The configuration file specifies which metrics are logged by the primary logging instance. In manual configuration, you can:
Procedure
7.3. Enabling the pmlogger serviceThe pmlogger service must be started and enabled to log the metric values on the local machine. This procedure describes how to enable the pmlogger service. Procedure
Verification steps
7.4. Setting up a client system for metrics collectionThis procedure describes how to set up a client system so that a central server can collect metrics from clients running PCP. Procedure
Verification steps
7.5. Setting up a central server to collect dataThis procedure describes how to create a central server to collect metrics from clients running PCP. Prerequisites
Procedure
Verification steps
7.6. Replaying the PCP log archives with pmrepAfter recording the metric data, you can replay the PCP log archives. To export the logs to text files and import them into spreadsheets, use PCP utilities such as pcp2csv, pcp2xml, pmrep or pmlogsummary. Using the pmrep tool, you can:
Prerequisites
Procedure
Chapter 8. Monitoring performance with Performance Co-PilotPerformance Co-Pilot (PCP) is a suite of tools, services, and libraries for monitoring, visualizing, storing, and analyzing system-level performance measurements. As a system administrator, you can monitor the system’s performance using the the PCP application in Red Hat Enterprise Linux 8. 8.1. Monitoring postfix with pmda-postfixThis procedure describes how to monitor performance metrics of the postfix mail server with pmda-postfix. It helps to check how many emails are received per second. Prerequisites
Procedure
Verification steps
8.2. Visually tracing PCP log archives with the PCP Charts applicationAfter recording metric data, you can replay the PCP log archives as graphs. The metrics are sourced from one or more live hosts with alternative options to use metric data from PCP log archives as a source of historical data. To customize the PCP Charts application interface to display the data from the performance metrics, you can use line plot, bar graphs, or utilization graphs. Using the PCP Charts application, you can:
Prerequisites
Procedure
8.3. Collecting data from SQL server using PCPWith Red Hat Enterprise Linux 8.2 or later, the SQL Server agent is available in Performance Co-Pilot (PCP), which helps you to monitor and analyze database performance issues. This procedure describes how to collect data for Microsoft SQL Server via pcp on your system. Prerequisites
Procedure
Verification steps
Chapter 9. Performance analysis of XFS with PCPThe XFS PMDA ships as part of the pcp package and is enabled by default during the installation. It is used to gather performance metric data of XFS file systems in Performance Co-Pilot (PCP). This section describes how to analyze XFS file system’s performance using PCP. 9.1. Installing XFS PMDA manuallyIf the XFS PMDA is not listed in the pcp configuration output, install the PMDA agent manually. This procedure describes how to manually install the PMDA agent. Procedure
Verification steps
9.2. Examining XFS performance metrics with pminfoPCP enables XFS PMDA to allow the reporting of certain XFS metrics per each of the mounted XFS file systems. This makes it easier to pinpoint specific mounted file system issues and evaluate performance. The pminfo command provides per-device XFS metrics for each mounted XFS file system. This procedure displays a list of all available metrics provided by the XFS PMDA. Procedure
9.3. Resetting XFS performance metrics with pmstoreWith PCP, you can modify the values of certain metrics, especially if the metric acts as a control variable, such as the xfs.control.reset metric. To modify a metric value, use the pmstore tool. This procedure describes how to reset XFS metrics using the pmstore tool. Procedure
Verification steps
9.4. PCP metric groups for XFSThe following table describes the available PCP metric groups for XFS. Table 9.1. Metric groups for XFS
9.5. Per-device PCP metric groups for XFSThe following table describes the available per-device PCP metric group for XFS. Table 9.2. Per-device PCP metric groups for XFS
Chapter 10. Setting up graphical representation of PCP metricsUsing a combination of pcp, grafana, pcp redis, pcp bpftrace, and pcp vector provides graphs, based on the live data or data collected by Performance Co-Pilot (PCP). This section describes how to set up and access the graphical representation of PCP metrics. 10.1. Setting up PCP with pcp-zeroconfThis procedure describes how to set up PCP on a system with the pcp-zeroconf package. Once the pcp-zeroconf package is installed, the system records the default set of metrics into archived files. Procedure
Verification steps
10.2. Setting up a grafana-serverGrafana generates graphs that are accessible from a browser. The grafana-server is a back-end server for the Grafana dashboard. It listens, by default, on all interfaces, and provides web services accessed through the web browser. The grafana-pcp plugin interacts with the pmproxy protocol in the backend. This procedure describes how to set up a grafana-server. Procedure
Verification steps
Additional resources
10.3. Accessing the Grafana web UIThis procedure describes how to access the Grafana web interface. Using the Grafana web interface, you can:
Prerequisites
Procedure
Additional resources
10.4. Configuring PCP RedisThis section provides information for configuring PCP Redis data source. Use the PCP Redis data source to:
Prerequisites
Procedure
Verification steps
Additional resources
10.5. Creating panels and alert in PCP Redis data sourceAfter adding the PCP Redis data source, you can view the dashboard with an overview of useful metrics, add a query to visualize the load graph, and create alerts that help you to view the system issues after they occur. Prerequisites
Procedure
10.6. Adding notification channels for alertsBy adding notification channels, you can receive an alert notification from Grafana whenever the alert rule conditions are met and the system needs further monitoring. You can receive these alerts after selecting any one type from the supported list of notifiers, which includes DingDing, Discord, Email, Google Hangouts Chat, HipChat, Kafka REST Proxy, LINE, Microsoft Teams, OpsGenie, PagerDuty, Prometheus Alertmanager, Pushover, Sensu, Slack, Telegram, Threema Gateway, VictorOps, and webhook. Prerequisites
Procedure
10.7. Setting up authentication between PCP componentsYou can setup authentication using the scram-sha-256 authentication mechanism, which is supported by PCP through the Simple Authentication Security Layer (SASL) framework. From Red Hat Enterprise Linux 8.3, PCP supports the scram-sha-256 authentication mechanism. Procedure
Verification steps
10.8. Installing PCP bpftraceInstall the PCP bpftrace agent to introspect a system and to gather metrics from the kernel and user-space tracepoints. The bpftrace agent uses bpftrace scripts to gather the metrics. The bpftrace scripts use the enhanced Berkeley Packet Filter (eBPF). This procedure describes how to install a pcp bpftrace. Prerequisites
Procedure
Additional resources
10.9. Viewing the PCP bpftrace System Analysis dashboardUsing the PCP bpftrace data source, you can access the live data from sources which are not available as normal data from the pmlogger or archives In the PCP bpftrace data source, you can view the dashboard with an overview of useful metrics. Prerequisites
Procedure
10.10. Installing PCP VectorThis procedure describes how to install a pcp vector. Prerequisites
Procedure
Additional resources
10.11. Viewing the PCP Vector ChecklistThe PCP Vector data source displays live metrics and uses the pcp metrics. It analyzes data for individual hosts. After adding the PCP Vector data source, you can view the dashboard with an overview of useful metrics and view the related troubleshooting or reference links in the checklist. Prerequisites
Procedure
10.12. Troubleshooting Grafana issuesThis section describes how to troubleshoot Grafana issues, such as, Grafana does not display any data, the dashboard is black, or similar issues. Procedure
Chapter 11. Optimizing the system performance using the web consoleLearn how to set a performance profile in the RHEL web console to optimize the performance of the system for a selected task. 11.1. Performance tuning options in the web consoleRed Hat Enterprise Linux 8 provides several performance profiles that optimize the system for the following tasks:
The tuned service optimizes system options to match the selected profile. In the web console, you can set which performance profile your system uses. 11.2. Setting a performance profile in the web consoleThis procedure uses the web console to optimize the system performance for a selected task. Procedure
Verification steps
11.3. Monitoring performance using the web consoleRed Hat’s web console uses the Utilization Saturation and Errors (USE) Method for troubleshooting. The new performance metrics page has a historical view of your data organized chronologically with the newest data at the top. Here, you can view the events, errors, and graphical representation for resource utilization and saturation. Prerequisites
Procedure
Chapter 12. Setting the disk schedulerThe disk scheduler is responsible for ordering the I/O requests submitted to a storage device. You can configure the scheduler in several different ways:
In Red Hat Enterprise Linux 8, block devices support only multi-queue scheduling. This enables the block layer performance to scale well with fast solid-state drives (SSDs) and multi-core systems. The traditional, single-queue schedulers, which were available in Red Hat Enterprise Linux 7 and earlier versions, have been removed. 12.1. Available disk schedulersThe following multi-queue disk schedulers are supported in Red Hat Enterprise Linux 8: none Implements a first-in first-out (FIFO) scheduling algorithm. It merges requests at the generic block layer through a simple last-hit cache. mq-deadline Attempts to provide a guaranteed latency for requests from the point at which requests reach the scheduler. The mq-deadline scheduler sorts queued I/O requests into a read or write batch and then schedules them for execution in increasing logical block addressing (LBA) order. By default, read batches take precedence over write batches, because applications are more likely to block on read I/O operations. After mq-deadline processes a batch, it checks how long write operations have been starved of processor time and schedules the next read or write batch as appropriate. This scheduler is suitable for most use cases, but particularly those in which the write operations are mostly asynchronous. bfqTargets desktop systems and interactive tasks. The bfq scheduler ensures that a single application is never using all of the bandwidth. In effect, the storage device is always as responsive as if it was idle. In its default configuration, bfq focuses on delivering the lowest latency rather than achieving the maximum throughput. bfq is based on cfq code. It does not grant the disk to each process for a fixed time slice but assigns a budget measured in number of sectors to the process. This scheduler is suitable while copying large files and the system does not become unresponsive in this case. kyberThe scheduler tunes itself to achieve a latency goal by calculating the latencies of every I/O request submitted to the block I/O layer. You can configure the target latencies for read, in the case of cache-misses, and synchronous write requests. This scheduler is suitable for fast devices, for example NVMe, SSD, or other low latency devices. 12.2. Different disk schedulers for different use casesDepending on the task that your system performs, the following disk schedulers are recommended as a baseline prior to any analysis and tuning tasks: Table 12.1. Disk schedulers for different use cases
12.3. The default disk schedulerBlock devices use the default disk scheduler unless you specify another scheduler. For non-volatile Memory Express (NVMe) block devices specifically, the default scheduler is none and Red Hat recommends not changing this. The kernel selects a default disk scheduler based on the type of device. The automatically selected scheduler is typically the optimal setting. If you require a different scheduler, Red Hat recommends to use udev rules or the TuneD application to configure it. Match the selected devices and switch the scheduler only for those devices. 12.4. Determining the active disk schedulerThis procedure determines which disk scheduler is currently active on a given block device. Procedure
12.5. Setting the disk scheduler using TuneDThis procedure creates and enables a TuneD profile that sets a given disk scheduler for selected block devices. The setting persists across system reboots. In the following commands and configuration, replace:
Procedure
Verification steps
12.6. Setting the disk scheduler using udev rulesThis procedure sets a given disk scheduler for specific block devices using udev rules. The setting persists across system reboots. In the following commands and configuration, replace:
Procedure
Verification steps
12.7. Temporarily setting a scheduler for a specific diskThis procedure sets a given disk scheduler for specific block devices. The setting does not persist across system reboots. Procedure
Verification steps
Chapter 13. Tuning the performance of a Samba serverThis chapter describes what settings can improve the performance of Samba in certain situations, and which settings can have a negative performance impact. Parts of this section were adopted from the Performance Tuning documentation published in the Samba Wiki. License: CC BY 4.0. Authors and contributors: See the history tab on the Wiki page. 13.1. Setting the SMB protocol versionEach new SMB version adds features and improves the performance of the protocol. The recent Windows and Windows Server operating systems always supports the latest protocol version. If Samba also uses the latest protocol version, Windows clients connecting to Samba benefit from the performance improvements. In Samba, the default value of the server max protocol is set to the latest supported stable SMB protocol version. To always have the latest stable SMB protocol version enabled, do not set the server max protocol parameter. If you set the parameter manually, you will need to modify the setting with each new version of the SMB protocol, to have the latest protocol version enabled. The following procedure explains how to use the default value in the server max protocol parameter. Procedure
13.2. Tuning shares with directories that contain a large number of filesLinux supports case-sensitive file names. For this reason, Samba needs to scan directories for uppercase and lowercase file names when searching or accessing a file. You can configure a share to create new files only in lowercase or uppercase, which improves the performance. Prerequisites
Procedure
After you applied these settings, the names of all newly created files on this share use lowercase. Because of these settings, Samba no longer needs to scan the directory for uppercase and lowercase, which improves the performance. 13.3. Settings that can have a negative performance impact
By default, the kernel in Red Hat Enterprise Linux is tuned for high network performance. For example, the kernel uses an auto-tuning mechanism for buffer sizes. Setting the socket options parameter in the /etc/samba/smb.conf file overrides these kernel settings. As a result, setting this parameter decreases the Samba network performance in most cases. To use the optimized settings from the Kernel, remove the socket options parameter from the [global] section in the /etc/samba/smb.conf. Chapter 14. Optimizing virtual machine performanceVirtual machines (VMs) always experience some degree of performance deterioration in comparison to the host. The following sections explain the reasons for this deterioration and provide instructions on how to minimize the performance impact of virtualization in RHEL 8, so that your hardware infrastructure resources can be used as efficiently as possible. 14.1. What influences virtual machine performanceVMs are run as user-space processes on the host. The hypervisor therefore needs to convert the host’s system resources so that the VMs can use them. As a consequence, a portion of the resources is consumed by the conversion, and the VM therefore cannot achieve the same performance efficiency as the host. The impact of virtualization on system performanceMore specific reasons for VM performance loss include:
The severity of the virtualization impact on the VM performance is influenced by a variety factors, which include:
Reducing VM performance lossRHEL 8 provides a number of features you can use to reduce the negative performance effects of virtualization. Notably:
Tuning VM performance can have adverse effects on other virtualization functions. For example, it can make migrating the modified VM more difficult. 14.2. Optimizing virtual machine performance using tunedThe tuned utility is a tuning profile delivery mechanism that adapts RHEL for certain workload characteristics, such as requirements for CPU-intensive tasks or storage-network throughput responsiveness. It provides a number of tuning profiles that are pre-configured to enhance performance and reduce power consumption in a number of specific use cases. You can edit these profiles or create new profiles to create performance solutions tailored to your environment, including virtualized environments. To optimize RHEL 8 for virtualization, use the following profiles:
Procedure To enable a specific tuned profile:
14.3. Configuring virtual machine memoryTo improve the performance of a virtual machine (VM), you can assign additional host RAM to the VM. Similarly, you can decrease the amount of memory allocated to a VM so the host memory can be allocated to other VMs or tasks. To perform these actions, you can use the web console or the command-line interface. 14.3.1. Adding and removing virtual machine memory using the web consoleTo improve the performance of a virtual machine (VM) or to free up the host resources it is using, you can use the web console to adjust amount of memory allocated to the VM. Prerequisites
Procedure
14.3.2. Adding and removing virtual machine memory using the command-line interfaceTo improve the performance of a virtual machine (VM) or to free up the host resources it is using, you can use the CLI to adjust amount of memory allocated to the VM. Prerequisites
Procedure
Verification
14.3.3. Additional resources
14.4. Optimizing virtual machine I/O performanceThe input and output (I/O) capabilities of a virtual machine (VM) can significantly limit the VM’s overall efficiency. To address this, you can optimize a VM’s I/O by configuring block I/O parameters. 14.4.1. Tuning block I/O in virtual machinesWhen multiple block devices are being used by one or more VMs, it might be important to adjust the I/O priority of specific virtual devices by modifying their I/O weights. Increasing the I/O weight of a device increases its priority for I/O bandwidth, and therefore provides it with more host resources. Similarly, reducing a device’s weight makes it consume less host resources. Each device’s weight value must be within the 100 to 1000 range. Alternatively, the value can be 0, which removes that device from per-device listings. Procedure To display and set a VM’s block I/O parameters:
14.4.2. Disk I/O throttling in virtual machinesWhen several VMs are running simultaneously, they can interfere with system performance by using excessive disk I/O. Disk I/O throttling in KVM virtualization provides the ability to set a limit on disk I/O requests sent from the VMs to the host machine. This can prevent a VM from over-utilizing shared resources and impacting the performance of other VMs. To enable disk I/O throttling, set a limit on disk I/O requests sent from each block device attached to VMs to the host machine. Procedure
Additional information
14.4.3. Enabling multi-queue virtio-scsiWhen using virtio-scsi storage devices in your virtual machines (VMs), the multi-queue virtio-scsi feature provides improved storage performance and scalability. It enables each virtual CPU (vCPU) to have a separate queue and interrupt to use without affecting other vCPUs. Procedure
14.5. Optimizing virtual machine CPU performanceMuch like physical CPUs in host machines, vCPUs are critical to virtual machine (VM) performance. As a result, optimizing vCPUs can have a significant impact on the resource efficiency of your VMs. To optimize your vCPU:
14.5.1. Adding and removing virtual CPUs using the command-line interfaceTo increase or optimize the CPU performance of a virtual machine (VM), you can add or remove virtual CPUs (vCPUs) assigned to the VM. When performed on a running VM, this is also referred to as vCPU hot plugging and hot unplugging. However, note that vCPU hot unplug is not supported in RHEL 8, and Red Hat highly discourages its use. Prerequisites
Procedure
Verification
14.5.2. Managing virtual CPUs using the web consoleUsing the RHEL 8 web console, you can review and configure virtual CPUs used by virtual machines (VMs) to which the web console is connected. Procedure
14.5.3. Configuring NUMA in a virtual machineThe following methods can be used to configure Non-Uniform Memory Access (NUMA) settings of a virtual machine (VM) on a RHEL 8 host. Prerequisites
Procedure For ease of use, you can set up a VM’s NUMA configuration using automated utilities and services. However, manual NUMA setup is more likely to yield a significant performance improvement. Automatic methods
Manual methods
For best performance results, it is recommended to use all of the manual tuning methods listed above 14.5.4. Sample vCPU performance tuning scenario
To obtain the best vCPU performance possible, Red Hat recommends using manual vcpupin, emulatorpin, and numatune settings together, for example like in the following scenario. Starting scenario
Solution
Verification
Sample topology
14.5.5. Deactivating kernel same-page mergingAlthough kernel same-page merging (KSM) improves memory density, it increases CPU utilization, and might adversely affect overall performance depending on the workload. In such cases, you can improve the virtual machine (VM) performance by deactivating KSM. Depending on your requirements, you can either deactivate KSM for a single session or persistently. Procedure
Memory pages shared between VMs before deactivating KSM will remain shared. To stop sharing, delete all the PageKSM pages in the system using the following command: # echo 2 > /sys/kernel/mm/ksm/runAfter anonymous pages replace the KSM pages, the khugepaged kernel service will rebuild transparent hugepages on the VM’s physical memory. 14.6. Optimizing virtual machine network performanceDue to the virtual nature of a VM’s network interface card (NIC), the VM loses a portion of its allocated host network bandwidth, which can reduce the overall workload efficiency of the VM. The following tips can minimize the negative impact of virtualization on the virtual NIC (vNIC) throughput. Procedure Use any of the following methods and observe if it has a beneficial effect on your VM network performance: Enable the vhost_net module On the host, ensure the vhost_net kernel feature is enabled: # lsmod | grep vhost vhost_net 32768 1 vhost 53248 1 vhost_net tap 24576 1 vhost_net tun 57344 6 vhost_netIf the output of this command is blank, enable the vhost_net kernel module: # modprobe vhost_netSet up multi-queue virtio-net To set up the multi-queue virtio-net feature for a VM, use the virsh edit command to edit to the XML configuration of the VM. In the XML, add the following to the If the VM is running, restart it for the changes to take effect. Batching network packets In Linux VM configurations with a long transmission path, batching packets before submitting them to the kernel may improve cache utilization. To set up packet batching, use the following command on the host, and replace tap0 with the name of the network interface that the VMs use: # ethtool -C tap0 rx-frames 64SR-IOV If your host NIC supports SR-IOV, use SR-IOV device assignment for your vNICs. For more information, see Managing SR-IOV devices.14.7. Virtual machine performance monitoring toolsTo identify what consumes the most VM resources and which aspect of VM performance needs optimization, performance diagnostic tools, both general and VM-specific, can be used. Default OS performance monitoring tools For standard performance evaluation, you can use the utilities provided by default by your host and guest operating systems:
perf kvm You can use the perf utility to collect and analyze virtualization-specific statistics about the performance of your RHEL 8 host. To do so:
For more information on using perf to monitor virtualization performance, see the perf-kvm man page. numastat To see the current NUMA configuration of your system, you can use the numastat utility, which is provided by installing the numactl package. The following shows a host with 4 running VMs, each obtaining memory from multiple NUMA nodes. This is not optimal for vCPU performance, and warrants adjusting: # numastat -c qemu-kvm Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total --------------- ------ ------ ------ ------ ------ ------ ------ ------ ----- 51722 (qemu-kvm) 68 16 357 6936 2 3 147 598 8128 51747 (qemu-kvm) 245 11 5 18 5172 2532 1 92 8076 53736 (qemu-kvm) 62 432 1661 506 4851 136 22 445 8116 53773 (qemu-kvm) 1393 3 1 2 12 0 0 6702 8114 --------------- ------ ------ ------ ------ ------ ------ ------ ------ ----- Total 1769 463 2024 7462 10037 2672 169 7837 32434In contrast, the following shows memory being provided to each VM by a single node, which is significantly more efficient. # numastat -c qemu-kvm Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total --------------- ------ ------ ------ ------ ------ ------ ------ ------ ----- 51747 (qemu-kvm) 0 0 7 0 8072 0 1 0 8080 53736 (qemu-kvm) 0 0 7 0 0 0 8113 0 8120 53773 (qemu-kvm) 0 0 7 0 0 0 1 8110 8118 59065 (qemu-kvm) 0 0 8050 0 0 0 0 0 8051 --------------- ------ ------ ------ ------ ------ ------ ------ ------ ----- Total 0 0 8072 0 8072 0 8114 8110 32368Chapter 15. Importance of power managementReducing the overall power consumption of computer systems helps to save cost. Effectively optimizing energy consumption of each system component includes studying different tasks that your system performs, and configuring each component to ensure that its performance is correct for that job. Lowering the power consumption of a specific component or of the system as a whole leads to lower heat and performance. Proper power management results in:
This section describes the information regarding power management of your Red Hat Enterprise Linux systems. 15.1. Power management basicsEffective power management is built on the following principles: An idle CPU should only wake up when needed Since Red Hat Enterprise Linux 6, the kernel runs tickless, which means the previous periodic timer interrupts have been replaced with on-demand interrupts. Therefore, idle CPUs are allowed to remain idle until a new task is queued for processing, and CPUs that have entered lower power states can remain in these states longer. However, benefits from this feature can be offset if your system has applications that create unnecessary timer events. Polling events, such as checks for volume changes or mouse movement, are examples of such events. Red Hat Enterprise Linux includes tools using which you can identify and audit applications on the basis of their CPU usage. For more information see, Audit and analysis overview and Tools for auditing. Unused hardware and devices should be disabled completely This is true for devices that have moving parts, for example, hard disks. In addition to this, some applications may leave an unused but enabled device "open"; when this occurs, the kernel assumes that the device is in use, which can prevent the device from going into a power saving state. Low activity should translate to low wattageIn many cases, however, this depends on modern hardware and correct BIOS configuration or UEFI on modern systems, including non-x86 architectures. Make sure that you are using the latest official firmware for your systems and that in the power management or device configuration sections of the BIOS the power management features are enabled. Some features to look for include:
Modern CPUs together with Advanced Configuration and Power Interface (ACPI) provide different power states. The three different states are:
15.2. Audit and analysis overviewThe detailed manual audit, analysis, and tuning of a single system is usually the exception because the time and cost spent to do so typically outweighs the benefits gained from these last pieces of system tuning. However, performing these tasks once for a large number of nearly identical systems where you can reuse the same settings for all systems can be very useful. For example, consider the deployment of thousands of desktop systems, or an HPC cluster where the machines are nearly identical. Another reason to do auditing and analysis is to provide a basis for comparison against which you can identify regressions or changes in system behavior in the future. The results of this analysis can be very helpful in cases where hardware, BIOS, or software updates happen regularly and you want to avoid any surprises with regard to power consumption. Generally, a thorough audit and analysis gives you a much better idea of what is really happening on a particular system. Auditing and analyzing a system with regard to power consumption is relatively hard, even with the most modern systems available. Most systems do not provide the necessary means to measure power use via software. Exceptions exist though:
Other vendors are likely to offer similar capabilities for their server platforms, but as can be seen there is no single solution available that is supported by all vendors. Direct measurements of power consumption are often only necessary to maximize savings as far as possible. 15.3. Tools for auditingRed Hat Enterprise Linux 8 offers tools using which you can perform system auditing and analysis. Most of them can be used as supplementary sources of information in case you want to verify what you have discovered already or in case you need more in-depth information on certain parts. Many of these tools are used for performance tuning as well, which include: PowerTOP It identifies specific components of kernel and user-space applications that frequently wake up the CPU. Use the powertop command as root to start the PowerTop tool and powertop --calibrate to calibrate the power estimation engine. For more information on PowerTop, see Managing power consumption with PowerTOP. Diskdevstat and netdevstat They are SystemTap tools that collect detailed information about the disk activity and network activity of all applications running on a system. Using the collected statistics by these tools, you can identify applications that waste power with many small I/O operations rather than fewer, larger operations. Using the yum install tuned-utils-systemtap kernel-debuginfo command as root, install the diskdevstat and netdevstat tool. To view the detailed information about the disk and network activity, use: # diskdevstat PID UID DEV WRITE_CNT WRITE_MIN WRITE_MAX WRITE_AVG READ_CNT READ_MIN READ_MAX READ_AVG COMMAND 3575 1000 dm-2 59 0.000 0.365 0.006 5 0.000 0.000 0.000 mozStorage #5 3575 1000 dm-2 7 0.000 0.000 0.000 0 0.000 0.000 0.000 localStorage DB [...] # netdevstat PID UID DEV XMIT_CNT XMIT_MIN XMIT_MAX XMIT_AVG RECV_CNT RECV_MIN RECV_MAX RECV_AVG COMMAND 3572 991 enp0s31f6 40 0.000 0.882 0.108 0 0.000 0.000 0.000 openvpn 3575 1000 enp0s31f6 27 0.000 1.363 0.160 0 0.000 0.000 0.000 Socket Thread [...]With these commands, you can specify three parameters: update_interval, total_duration, and display_histogram. TuneD It is a profile-based system tuning tool that uses the udev device manager to monitor connected devices, and enables both static and dynamic tuning of system settings. You can use the tuned-adm recommend command to determine which profile Red Hat recommends as the most suitable for a particular product. For more information on TuneD, see Getting started with TuneD and Customizing TuneD profiles. Using the powertop2tuned utility, you can create custom TuneD profiles from PowerTOP suggestions. For information on the powertop2tuned utility, see Optimizing power consumption. Virtual memory statistics (vmstat)It is provided by the procps-ng package. Using this tool, you can view the detailed information about processes, memory, paging, block I/O, traps, and CPU activity. To view this information, use: $ vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 5805576 380856 4852848 0 0 119 73 814 640 2 2 96 0 0Using the vmstat -a command, you can display active and inactive memory. For more information on other vmstat options, see the vmstat man page. iostatIt is provided by the sysstat package. This tool is similar to vmstat, but only for monitoring I/O on block devices. It also provides more verbose output and statistics. To monitor the system I/O, use: $ iostat avg-cpu: %user %nice %system %iowait %steal %idle 2.05 0.46 1.55 0.26 0.00 95.67 Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn nvme0n1 53.54 899.48 616.99 3445229 2363196 dm-0 42.84 753.72 238.71 2886921 914296 dm-1 0.03 0.60 0.00 2292 0 dm-2 24.15 143.12 379.80 548193 1454712blktraceIt provides detailed information about how time is spent in the I/O subsystem. To view this information in human readable format, use: # blktrace -d /dev/dm-0 -o - | blkparse -i - 253,0 1 1 0.000000000 17694 Q W 76423384 + 8 [kworker/u16:1] 253,0 2 1 0.001926913 0 C W 76423384 + 8 [0] [...]Here, The first column, 253,0 is the device major and minor tuple. The second column, 1, gives information about the CPU, followed by columns for timestamps and PID of the process issuing the IO process. The sixth column, Q, shows the event type, the 7th column, W for write operation, the 8th column, 76423384, is the block number, and the + 8 is the number of requested blocks. The last field, [kworker/u16:1], is the process name. By default, the blktrace command runs forever until the process is explicitly killed. Use the -w option to specify the run-time duration. turbostatIt is provided by the kernel-tools package. It reports on processor topology, frequency, idle power-state statistics, temperature, and power usage on x86-64 processors. To view this summary, use: # turbostat CPUID(0): GenuineIntel 0x16 CPUID levels; 0x80000008 xlevels; family:model:stepping 0x6:8e:a (6:142:10) CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM HT TM CPUID(6): APERF, TURBO, DTS, PTM, HWP, HWPnotify, HWPwindow, HWPepp, No-HWPpkg, EPB [...]By default, turbostat prints a summary of counter results for the entire screen, followed by counter results every 5 seconds. Specify a different period between counter results with the -i option, for example, execute turbostat -i 10 to print results every 10 seconds instead. Turbostat is also useful for identifying servers that are inefficient in terms of power usage or idle time. It also helps to identify the rate of system management interrupts (SMIs) occurring on the system. It can also be used to verify the effects of power management tuning. cpupowerIT is a collection of tools to examine and tune power saving related features of processors. Use the cpupower command with the frequency-info, frequency-set, idle-info, idle-set, set, info, and monitor options to display and set processor related values. For example, to view available cpufreq governors, use: $ cpupower frequency-info --governors analyzing CPU 0: available cpufreq governors: performance powersaveFor more information about cpupower, see Viewing CPU related information. GNOME Power Manager It is a daemon that is installed as part of the GNOME desktop environment. GNOME Power Manager notifies you of changes in your system’s power status; for example, a change from battery to AC power. It also reports battery status, and warns you when battery power is low.Additional resources
Chapter 16. Managing power consumption with PowerTOPAs a system administrator, you can use the PowerTOP tool to analyze and manage power consumption. 16.1. The purpose of PowerTOPPowerTOP is a program that diagnoses issues related to power consumption and provides suggestions on how to extend battery lifetime. The PowerTOP tool can provide an estimate of the total power usage of the system and also individual power usage for each process, device, kernel worker, timer, and interrupt handler. The tool can also identify specific components of kernel and user-space applications that frequently wake up the CPU. Red Hat Enterprise Linux 8 uses version 2.x of PowerTOP. 16.2. Using PowerTOPPrerequisites
16.2.1. Starting PowerTOPProcedure
Laptops should run on battery power when running the powertop command. 16.2.2. Calibrating PowerTOPProcedure
Note that powertop --calibrate can only be used on laptops. 16.2.3. Setting the measuring intervalBy default, PowerTOP takes measurements in 20 seconds intervals. If you want to change this measuring frequency, use the following procedure: Procedure
16.3. PowerTOP statisticsWhile it runs, PowerTOP gathers statistics from the system. PowerTOP's output provides multiple tabs:
You can use the Tab and Shift+Tab keys to cycle through these tabs. 16.3.1. The Overview tabIn the Overview tab, you can view a list of the components that either send wakeups to the CPU most frequently or consume the most power. The items within the Overview tab, including processes, interrupts, devices, and other resources, are sorted according to their utilization. The adjacent columns within the Overview tab provide the following pieces of information: Usage Power estimation of how the resource is being used. Events/s Wakeups per second. The number of wakeups per second indicates how efficiently the services or the devices and drivers of the kernel are performing. Less wakeups means that less power is consumed. Components are ordered by how much further their power usage can be optimized. Category Classification of the component; such as process, device, or timer. Description Description of the component. If properly calibrated, a power consumption estimation for every listed item in the first column is shown as well. Apart from this, the Overview tab includes the line with summary statistics such as:
16.3.2. The Idle stats tabThe Idle stats tab shows usage of C-states for all processors and cores, while the Frequency stats tab shows usage of P-states including the Turbo mode, if applicable, for all processors and cores. The duration of C- or P-states is an indication of how well the CPU usage has been optimized. The longer the CPU stays in the higher C- or P-states (for example C4 is higher than C3), the better the CPU usage optimization is. Ideally, residency is 90% or more in the highest C- or P-state when the system is idle. 16.3.3. The Device stats tabThe Device stats tab provides similar information to the Overview tab but only for devices. 16.3.4. The Tunables tabThe Tunables tab contains PowerTOP's suggestions for optimizing the system for lower power consumption. Use the up and down keys to move through suggestions, and the enter key to toggle the suggestion on or off. 16.3.5. The WakeUp tabThe WakeUp tab displays the device wakeup settings available for users to change as and when required. Use the up and down keys to move through the available settings, and the enter key to enable or disable a setting. Figure 16.1. PowerTOP output 16.4. Why Powertop does not display Frequency stats values in some instancesWhile using the Intel P-State driver, PowerTOP only displays values in the Frequency Stats tab if the driver is in passive mode. But, even in this case, the values may be incomplete. In total, there are three possible modes of the Intel P-State driver:
Switching to the ACPI CPUfreq driver results in complete information being displayed by PowerTOP. However, it is recommended to keep your system on the default settings. To see what driver is loaded and in what mode, run: # cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
While using the Intel P-State driver, add the following argument to the kernel boot command line to force the driver to run in passive mode: intel_pstate=passiveTo disable the Intel P-State driver and use, instead, the ACPI CPUfreq driver, add the following argument to the kernel boot command line: intel_pstate=disable16.5. Generating an HTML outputApart from the powertop’s output in terminal, you can also generate an HTML report. Procedure
16.6. Optimizing power consumptionTo optimize power consumption, you can use either the powertop service or the powertop2tuned utility. 16.6.1. Optimizing power consumption using the powertop serviceYou can use the powertop service to automatically enable all PowerTOP's suggestions from the Tunables tab on the boot: Procedure
16.6.2. The powertop2tuned utilityThe powertop2tuned utility allows you to create custom TuneD profiles from PowerTOP suggestions. By default, powertop2tuned creates profiles in the /etc/tuned/ directory, and bases the custom profile on the currently selected TuneD profile. For safety reasons, all PowerTOP tunings are initially disabled in the new profile. To enable the tunings, you can:
16.6.3. Optimizing power consumption using the powertop2tuned utilityPrerequisites
Procedure
Additional information
16.6.4. Comparison of powertop.service and powertop2tunedOptimizing power consumption with powertop2tuned is preferred over powertop.service for the following reasons:
Chapter 17. Tuning CPU frequency to optimize energy consumptionThis section describes how to optimize the power consumption of your system by using the available cpupower commands to set CPU speed on a system as per your requirements after setting up the required CPUfreq governor. 17.1. Supported cpupower tool commands
The cpupower tool is a collection of tools to examine and tune power saving related features of processors. The cpupower tool supports the following commands: idle-info Displays the available idle states and other statistics for the CPU idle driver using the cpupower idle-info command. For more information, see CPU Idle States. idle-set Enables or disables specific CPU idle state using the cpupower idle-set command as root. Use -d to disable and -e to enable a specific CPU idle state. frequency-info Displays the current cpufreq driver and available cpufreq governors using the cpupower frequency-info command. For more information, see CPUfreq drivers, Core CPUfreq Governors, and Intel P-state CPUfreq governors. frequency-set Sets the cpufreq and governors using the cpupower frequency-set command as root. For more information, see Setting up CPUfreq governor. set Sets processor power saving policies using the cpupower set command as root. Using the --perf-bias option, you can enable software on supported Intel processors to determine the balance between optimum performance and saving power. Assigned values range from 0 to 15, where 0 is optimum performance and 15 is optimum power efficiency. By default, the --perf-bias option applies to all cores. To apply it only to individual cores, add the --cpu cpulist option. infoDisplays processor power related and hardware configurations, which you have enabled using the cpupower set command. For example, if you assign the --perf-bias value as 5: # cpupower set --perf-bias 5 # cpupower info analyzing CPU 0: perf-bias: 5monitorDisplays the idle statistics and CPU demands using the cpupower monitor command. # cpupower monitor | Nehalem || Mperf ||Idle_Stats CPU| C3 | C6 | PC3 | PC6 || C0 | Cx | Freq || POLL | C1 | C1E | C3 | C6 | C7s | C8 | C9 | C10 0| 1.95| 55.12| 0.00| 0.00|| 4.21| 95.79| 3875|| 0.00| 0.68| 2.07| 3.39| 88.77| 0.00| 0.00| 0.00| 0.00 [...]Using the -l option, you can list all available monitors on your system and the -m option to display information related to specific monitors. For example, to monitor information related to the Mperf monitor, use the cpupower monitor -m Mperf command as root. Additional resources
17.2. CPU Idle StatesCPUs with the x86 architecture support various states, such as, few parts of the CPU are deactivated or using lower performance settings, known as C-states. With this state, you can save power by partially deactivating CPUs that are not in use. There is no need to configure the C-state, unlike P-states that require a governor and potentially some set up to avoid undesirable power or performance issues. C-states are numbered from C0 upwards, with higher numbers representing decreased CPU functionality and greater power saving. C-states of a given number are broadly similar across processors, although the exact details of the specific feature sets of the state may vary between processor families. C-states 0–3 are defined as follows: C0 In this state, the CPU is working and not idle at all. C1, Halt In this state, the processor is not executing any instructions but is typically not in a lower power state. The CPU can continue processing with practically no delay. All processors offering C-states need to support this state. Pentium 4 processors support an enhanced C1 state called C1E that actually is a state for lower power consumption. C2, Stop-Clock In this state, the clock is frozen for this processor but it keeps the complete state for its registers and caches, so after starting the clock again it can immediately start processing again. This is an optional state. C3, Sleep In this state, the processor goes to sleep and does not need to keep its cache up to date. Due to this reason, waking up from this state needs considerably more time than from the C2 state. This is an optional state. You can view the available idle states and other statistics for the CPUidle driver using the following command: $ cpupower idle-info CPUidle governor: menu analyzing CPU 0: Number of idle states: 9 Available idle states: POLL C1 C1E C3 C6 C7s C8 C9 C10 [...]Intel CPUs with the "Nehalem" microarchitecture features a C6 state, which can reduce the voltage supply of a CPU to zero, but typically reduces power consumption by between 80% and 90%. The kernel in Red Hat Enterprise Linux 8 includes optimizations for this new C-state. Additional resources
17.3. Overview of CPUfreqOne of the most effective ways to reduce power consumption and heat output on your system is CPUfreq, which is supported by x86 and ARM64 architectures in Red Hat Enterprise Linux 8. CPUfreq, also referred to as CPU speed scaling, is the infrastructure in the Linux kernel that enables it to scale the CPU frequency in order to save power. CPU scaling can be done automatically depending on the system load, in response to Advanced Configuration and Power Interface (ACPI) events, or manually by user-space programs, and it allows the clock speed of the processor to be adjusted on the fly. This enables the system to run at a reduced clock speed to save power. The rules for shifting frequencies, whether to a faster or slower clock speed and when to shift frequencies, are defined by the CPUfreq governor. You can view the cpufreq information using the cpupower frequency-info command as root. 17.3.1. CPUfreq driversUsing the cpupower frequency-info --driver command as root, you can view the current CPUfreq driver. The following are the two available drivers for CPUfreq that can be used: ACPI CPUfreq Advanced Configuration and Power Interface (ACPI) CPUfreq driver is a kernel driver that controls the frequency of a particular CPU through ACPI, which ensures the communication between the kernel and the hardware. Intel P-state In Red Hat Enterprise Linux 8, Intel P-state driver is supported. The driver provides an interface for controlling the P-state selection on processors based on the Intel Xeon E series architecture or newer architectures. Currently, Intel P-state is used by default for supported CPUs. You can switch to using ACPI CPUfreq by adding the intel_pstate=disable command to the kernel command line. Intel P-state implements the setpolicy() callback. The driver decides what P-state to use based on the policy requested from the cpufreq core. If the processor is capable of selecting its next P-state internally, the driver offloads this responsibility to the processor. If not, the driver implements algorithms to select the next P-state. Intel P-state provides its own sysfs files to control the P-state selection. These files are located in the /sys/devices/system/cpu/intel_pstate/ directory. Any changes made to the files are applicable to all CPUs. This directory contains the following files that are used for setting P-state parameters:
Additional resources
17.3.2. Core CPUfreq governorsA CPUfreq governor defines the power characteristics of the system CPU, which in turn affects the CPU performance. Each governor has its own unique behavior, purpose, and suitability in terms of workload. Using the cpupower frequency-info --governor command as root, you can view the available CPUfreq governors. Red Hat Enterprise Linux 8 includes multiple core CPUfreq governors: cpufreq_performance It forces the CPU to use the highest possible clock frequency. This frequency is statically set and does not change. As such, this particular governor offers no power saving benefit. It is only suitable for hours of a heavy workload, and only during times wherein the CPU is rarely or never idle. cpufreq_powersave It forces the CPU to use the lowest possible clock frequency. This frequency is statically set and does not change. This governor offers maximum power savings, but at the cost of the lowest CPU performance. The term "powersave" can sometimes be deceiving though, since in principle a slow CPU on full load consumes more power than a fast CPU that is not loaded. As such, while it may be advisable to set the CPU to use the powersave governor during times of expected low activity, any unexpected high loads during that time can cause the system to actually consume more power. The Powersave governor is more of a speed limiter for the CPU than a power saver. It is most useful in systems and environments where overheating can be a problem. cpufreq_ondemand It is a dynamic governor, using which you can enable the CPU to achieve maximum clock frequency when the system load is high, and also minimum clock frequency when the system is idle. While this allows the system to adjust power consumption accordingly with respect to system load, it does so at the expense of latency between frequency switching. As such, latency can offset any performance or power saving benefits offered by the ondemand governor if the system switches between idle and heavy workloads too often. For most systems, the ondemand governor can provide the best compromise between heat emission, power consumption, performance, and manageability. When the system is only busy at specific times of the day, the ondemand governor automatically switches between maximum and minimum frequency depending on the load without any further intervention. cpufreq_userspace It allows user-space programs, or any process running as root, to set the frequency. Of all the governors, userspace is the most customizable and depending on how it is configured, it can offer the best balance between performance and consumption for your system. cpufreq_conservative Similar to the ondemand governor, the conservative governor also adjusts the clock frequency according to usage. However, the conservative governor switches between frequencies more gradually. This means that the conservative governor adjusts to a clock frequency that it considers best for the load, rather than simply choosing between maximum and minimum. While this can possibly provide significant savings in power consumption, it does so at an ever greater latency than the ondemand governor. You can enable a governor using cron jobs. This allows you to automatically set specific governors during specific times of the day. As such, you can specify a low-frequency governor during idle times, for example, after work hours, and return to a higher-frequency governor during hours of heavy workload. For instructions on how to enable a specific governor, see Setting up CPUfreq governor. 17.3.3. Intel P-state CPUfreq governorsBy default, the Intel P-state driver operates in active mode with or without Hardware p-state (HWP) depending on whether the CPU supports HWP. Using the cpupower frequency-info --governor command as root, you can view the available CPUfreq governors. The functionality of performance and powersave Intel P-state CPUfreq governors is different compared to core CPUfreq governors of the same names. The Intel P-state driver can operate in the following three different modes: Active mode with hardware-managed P-states When active mode with HWP is used, the Intel P-state driver instructs the CPU to perform the P-state selection. The driver can provide frequency hints. However, the final selection depends on CPU internal logic. In active mode with HWP, the Intel P-state driver provides two P-state selection algorithms:
When active mode without HWP is used, the Intel P-state driver provides two P-state selection algorithms:
17.3.4. Setting up CPUfreq governorAll CPUfreq drivers are built in as part of the kernel-tools package, and selected automatically. To set up CPUfreq, you need to select a governor. Prerequisites
Procedure
If the kernel-tools package is not installed, the CPUfreq settings can be viewed in the /sys/devices/system/cpu/cpuid/cpufreq/ directory. Settings and values can be changed by writing to these tunables. For example, to set the minimum clock speed of cpu0 to 360 MHz, use: # echo 360000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freqVerification
Additional resources
Chapter 18. Getting started with perfAs a system administrator, you can use the perf tool to collect and analyze performance data of your system. 18.1. Introduction to perfThe perf user-space tool interfaces with the kernel-based subsystem Performance Counters for Linux (PCL). perf is a powerful tool that uses the Performance Monitoring Unit (PMU) to measure, record, and monitor a variety of hardware and software events. perf also supports tracepoints, kprobes, and uprobes. 18.2. Installing perfThis procedure installs the perf user-space tool. Procedure
18.3. Common perf commandsThis section provides an overview of commonly used perf commands. Commonly used perf commands perf stat This command provides overall statistics for common performance events, including instructions executed and clock cycles consumed. Options allow for selection of events other than the default measurement events. perf record This command records performance data into a file, perf.data, which can be later analyzed using the perf report command. perf report This command reads and displays the performance data from the perf.data file created by perf record. perf list This command lists the events available on a particular machine. These events will vary based on performance monitoring hardware and software configuration of the system. perf top This command performs a similar function to the top utility. It generates and displays a performance counter profile in realtime. perf trace This command performs a similar function to the strace tool. It monitors the system calls used by a specified thread or process and all signals received by that application. perf help This command displays a complete list of perf commands.Additional resources
Chapter 19. Profiling CPU usage in real time with perf topYou can use the perf top command to measure CPU usage of different functions in real time. Prerequisites
19.1. The purpose of perf topThe perf top command is used for real time system profiling and functions similarly to the top utility. However, where the top utility generally shows you how much CPU time a given process or thread is using, perf top shows you how much CPU time each specific function uses. In its default state, perf top tells you about functions being used across all CPUs in both the user-space and the kernel-space. To use perf top you need root access.
19.2. Profiling CPU usage with perf topThis procedure activates perf top and profiles CPU usage in real time. Prerequisites
Procedure
Additional resources
19.3. Interpretation of perf top outputThe perf top monitoring interface displays the data in several columns: The "Overhead" column Displays the percent of CPU a given function is using. The "Shared Object" column Displays name of the program or library which is using the function. The "Symbol" column Displays the function name or symbol. Functions executed in the kernel-space are identified by [k] and functions executed in the user-space are identified by [.]. 19.4. Why perf displays some function names as raw function addressesFor kernel functions, perf uses the information from the /proc/kallsyms file to map the samples to their respective function names or symbols. For functions executed in the user space, however, you might see raw function addresses because the binary is stripped. The debuginfo package of the executable must be installed or, if the executable is a locally developed application, the application must be compiled with debugging information turned on (the -g option in GCC) to display the function names or symbols in such a situation. It is not necessary to re-run the perf record command after installing the debuginfo associated with an executable. Simply re-run the perf report command. 19.5. Enabling debug and source repositoriesA standard installation of Red Hat Enterprise Linux does not enable the debug and source repositories. These repositories contain information needed to debug the system components and measure their performance. Procedure
19.6. Getting debuginfo packages for an application or library using GDBDebugging information is required to debug code. For code that is installed from a package, the GNU Debugger (GDB) automatically recognizes missing debug information, resolves the package name and provides concrete advice on how to get the package. Prerequisites
Procedure
Chapter 20. Counting events during process execution with perf statYou can use the perf stat command to count hardware and software events during process execution. Prerequisites
20.1. The purpose of perf statThe perf stat command executes a specified command, keeps a running count of hardware and software event occurrences during the commands execution, and generates statistics of these counts. If you do not specify any events, then perf stat counts a set of common hardware and software events. 20.2. Counting events with perf statYou can use perf stat to count hardware and software event occurrences during command execution and generate statistics of these counts. By default, perf stat operates in per-thread mode. Prerequisites
Procedure
Additional resources
20.3. Interpretation of perf stat outputperf stat executes a specified command and counts event occurrences during the commands execution and displays statistics of these counts in three columns:
20.4. Attaching perf stat to a running processYou can attach perf stat to a running process. This will instruct perf stat to count event occurrences only in the specified processes during the execution of a command. Prerequisites
Procedure
Additional resources
Chapter 21. Recording and analyzing performance profiles with perfThe perf tool allows you to record performance data and analyze it at a later time. Prerequisites
21.1. The purpose of perf recordThe perf record command samples performance data and stores it in a file, perf.data, which can be read and visualized with other perf commands. perf.data is generated in the current directory and can be accessed at a later time, possibly on a different machine. If you do not specify a command for perf record to record during, it will record until you manually stop the process by pressing Ctrl+C. You can attach perf record to specific processes by passing the -p option followed by one or more process IDs. You can run perf record without root access, however, doing so will only sample performance data in the user space. In the default mode, perf record uses CPU cycles as the sampling event and operates in per-thread mode with inherit mode enabled. 21.2. Recording a performance profile without root accessYou can use perf record without root access to sample and record performance data in the user-space only. Prerequisites
Procedure
Additional resources
21.3. Recording a performance profile with root accessYou can use perf record with root access to sample and record performance data in both the user-space and the kernel-space simultaneously. Prerequisites
Procedure
Additional resources
21.4. Recording a performance profile in per-CPU modeYou can use perf record in per-CPU mode to sample and record performance data in both and user-space and the kernel-space simultaneously across all threads on a monitored CPU. By default, per-CPU mode monitors all online CPUs. Prerequisites
Procedure
Additional resources
21.5. Capturing call graph data with perf recordYou can configure the perf record tool so that it records which function is calling other functions in the performance profile. This helps to identify a bottleneck if several processes are calling the same function. Prerequisites
Procedure
Additional resources
21.6. Analyzing perf.data with perf reportYou can use perf report to display and analyze a perf.data file. Prerequisites
Procedure
Additional resources
21.7. Interpretation of perf report outputThe table displayed by running the perf report command sorts the data into several columns: The 'Overhead' column Indicates what percentage of overall samples were collected in that particular function. The 'Command' column Tells you which process the samples were collected from. The 'Shared Object' column Displays the name of the ELF image where the samples come from (the name [kernel.kallsyms] is used when the samples come from the kernel). The 'Symbol' column Displays the function name or symbol. In default mode, the functions are sorted in descending order with those with the highest overhead displayed first. 21.8. Generating a perf.data file that is readable on a different deviceYou can use the perf tool to record performance data into a perf.data file to be analyzed on a different device. Prerequisites
Procedure
Verification steps
21.9. Analyzing a perf.data file that was created on a different deviceYou can use the perf tool to analyze a perf.data file that was generated on a different device. Prerequisites
Procedure
21.10. Why perf displays some function names as raw function addressesFor kernel functions, perf uses the information from the /proc/kallsyms file to map the samples to their respective function names or symbols. For functions executed in the user space, however, you might see raw function addresses because the binary is stripped. The debuginfo package of the executable must be installed or, if the executable is a locally developed application, the application must be compiled with debugging information turned on (the -g option in GCC) to display the function names or symbols in such a situation. It is not necessary to re-run the perf record command after installing the debuginfo associated with an executable. Simply re-run the perf report command. 21.11. Enabling debug and source repositoriesA standard installation of Red Hat Enterprise Linux does not enable the debug and source repositories. These repositories contain information needed to debug the system components and measure their performance. Procedure
21.12. Getting debuginfo packages for an application or library using GDBDebugging information is required to debug code. For code that is installed from a package, the GNU Debugger (GDB) automatically recognizes missing debug information, resolves the package name and provides concrete advice on how to get the package. Prerequisites
Procedure
Chapter 22. Investigating busy CPUs with perfWhen investigating performance issues on a system, you can use the perf tool to identify and monitor the busiest CPUs in order to focus your efforts. 22.1. Displaying which CPU events were counted on with perf statYou can use perf stat to display which CPU events were counted on by disabling CPU count aggregation. You must count events in system-wide mode by using the -a flag in order to use this functionality. Prerequisites
Procedure
22.2. Displaying which CPU samples were taken on with perf reportThe perf record command samples performance data and stores this data in a perf.data file which can be read with the perf report command. The perf record command always records which CPU samples were taken on. You can configure perf report to display this information. Prerequisites
Procedure
22.3. Displaying specific CPUs during profiling with perf topYou can configure perf top to display specific CPUs and their relative usage while profiling your system in real time. Prerequisites
Procedure
22.4. Monitoring specific CPUs with perf record and perf reportYou can configure perf record to only sample specific CPUs of interest and analyze the generated perf.data file with perf report for further analysis. Prerequisites
Procedure
Chapter 23. Monitoring application performance with perfThis section describes how to use the perf tool to monitor application performance. 23.1. Attaching perf record to a running processYou can attach perf record to a running process. This will instruct perf record to only sample and record performance data in the specified processes. Prerequisites
Procedure
23.2. Capturing call graph data with perf recordYou can configure the perf record tool so that it records which function is calling other functions in the performance profile. This helps to identify a bottleneck if several processes are calling the same function. Prerequisites
Procedure
Additional resources
23.3. Analyzing perf.data with perf reportYou can use perf report to display and analyze a perf.data file. Prerequisites
Procedure
Additional resources
Chapter 24. Creating uprobes with perf24.1. Creating uprobes at the function level with perfYou can use the perf tool to create dynamic tracepoints at arbitrary points in a process or application. These tracepoints can then be used in conjunction with other perf tools such as perf stat and perf record to better understand the process or applications behavior. Prerequisites
Procedure
24.2. Creating uprobes on lines within a function with perfThese tracepoints can then be used in conjunction with other perf tools such as perf stat and perf record to better understand the process or applications behavior. Prerequisites
Procedure
24.3. Perf script output of data recorded over uprobesA common method to analyze data collected using uprobes is using the perf script command to read a perf.data file and display a detailed trace of the recorded workload. In the perf script example output: * A uprobe is added to the function isprime() in a program called my_prog * a is a function argument added to the uprobe. Alternatively, a could be an arbitrary variable visible in the code scope of where you add your uprobe: # perf script my_prog 1367 [007] 10802159.906593: probe_my_prog:isprime: (400551) a=2 my_prog 1367 [007] 10802159.906623: probe_my_prog:isprime: (400551) a=3 my_prog 1367 [007] 10802159.906625: probe_my_prog:isprime: (400551) a=4 my_prog 1367 [007] 10802159.906627: probe_my_prog:isprime: (400551) a=5 my_prog 1367 [007] 10802159.906629: probe_my_prog:isprime: (400551) a=6 my_prog 1367 [007] 10802159.906631: probe_my_prog:isprime: (400551) a=7 my_prog 1367 [007] 10802159.906633: probe_my_prog:isprime: (400551) a=13 my_prog 1367 [007] 10802159.906635: probe_my_prog:isprime: (400551) a=17 my_prog 1367 [007] 10802159.906637: probe_my_prog:isprime: (400551) a=19Chapter 25. Profiling memory accesses with perf memYou can use the perf mem command to sample memory accesses on your system. 25.1. The purpose of perf memThe mem subcommand of the perf tool enables the sampling of memory accesses (loads and stores). The perf mem command provides information about memory latency, types of memory accesses, functions causing cache hits and misses, and, by recording the data symbol, the memory locations where these hits and misses occur. 25.2. Sampling memory access with perf memThis procedure describes how to use the perf mem command to sample memory accesses on your system. The command takes the same options as perf record and perf report as well as some options exclusive to the mem subcommand. The recorded data is stored in a perf.data file in the current directory for later analysis. Prerequisites
Procedure
Additional resources
25.3. Interpretation of perf mem report outputThe table displayed by running the perf mem report command without any modifiers sorts the data into several columns: The 'Overhead' column Indicates percentage of overall samples collected in that particular function. The 'Samples' column Displays the number of samples accounted for by that row. The 'Local Weight' column Displays the access latency in processor core cycles. The 'Memory Access' column Displays the type of memory access that occurred. The 'Symbol' column Displays the function name or symbol. The 'Shared Object' column Displays the name of the ELF image where the samples come from (the name [kernel.kallsyms] is used when the samples come from the kernel). The 'Data Symbol' column Displays the address of the memory location that row was targeting. Oftentimes, due to dynamic allocation of memory or stack memory being accessed, the 'Data Symbol' column will display a raw address. The "Snoop" column Displays bus transactions. The 'TLB Access' column Displays TLB memory accesses. The 'Locked' column Indicates if a function was or was not memory locked. In default mode, the functions are sorted in descending order with those with the highest overhead displayed first. Chapter 26. Detecting false sharingFalse sharing occurs when a processor core on a Symmetric Multi Processing (SMP) system modifies data items on the same cache line that is in use by other processors to access other data items that are not being shared between the processors. This initial modification requires that the other processors using the cache line invalidate their copy and request an updated one despite the processors not needing, or even necessarily having access to, an updated version of the modified data item. You can use the perf c2c command to detect false sharing. 26.1. The purpose of perf c2cThe c2c subcommand of the perf tool enables Shared Data Cache-to-Cache (C2C) analysis. You can use the perf c2c command to inspect cache-line contention to detect both true and false sharing. Cache-line contention occurs when a processor core on a Symmetric Multi Processing (SMP) system modifies data items on the same cache line that is in use by other processors. All other processors using this cache-line must then invalidate their copy and request an updated one. This can lead to degraded performance. The perf c2c command provides the following information:
26.2. Detecting cache-line contention with perf c2cUse the perf c2c command to detect cache-line contention in a system. The perf c2c command supports the same options as perf record as well as some options exclusive to the c2c subcommand. The recorded data is stored in a perf.data file in the current directory for later analysis. Prerequisites
Procedure
Additional resources
26.3. Visualizing a perf.data file recorded with perf c2c recordThis procedure describes how to visualize the perf.data file, which is recorded using the perf c2c command. Prerequisites
Procedure
26.4. Interpretation of perf c2c report outputThis section describes how to interpret the output of the perf c2c report command. The visualization displayed by running the perf c2c report --stdio command sorts the data into several tables: Trace Events Information This table provides a high level summary of all the load and store samples, which are collected by the perf c2c record command. Global Shared Cache Line Event Information This table provides statistics over the shared cache lines. c2c Details This table provides information about what events were sampled and how the perf c2c report data is organized within the visualization. Shared Data Cache Line Table This table provides a one line summary for the hottest cache lines where false sharing is detected and is sorted in descending order by the amount of remote Hitm detected per cache line by default. Shared Cache Line Distribution Pareto This tables provides a variety of information about each cache line experiencing contention:
26.5. Detecting false sharing with perf c2cThis procedure describes how to detect false sharing using the perf c2c command. Prerequisites
Procedure
Chapter 27. Getting started with flamegraphsAs a system administrator, you can use flamegraphs to to create visualizations of system performance data recorded with the perf tool. As a software developer, you can use flamegraphs to create visualizations of application performance data recorded with the perf tool. Sampling stack traces is a common technique for profiling CPU performance with the perf tool. Unfortunately, the results of profiling stack traces with perf can be extremely verbose and labor-intensive to analyze. flamegraphs are visualizations created from data recorded with perf to make identifying hot code-paths faster and easier. 27.1. Installing flamegraphsTo begin using flamegraphs, install the required package. Procedure
27.2. Creating flamegraphs over the entire systemThis procedure describes how to visualize performance data recorded over an entire system using flamegraphs. Procedure
Verification steps
27.3. Creating flamegraphs over specific processesYou can use flamegraphs to visualize performance data recorded over specific running processes. Procedure
Verification steps
27.4. Interpreting flamegraphsEach box in the flamegraph represents a different function in the stack. The y-axis shows the depth of the stack with the topmost box in each stack being the function that was actually on-CPU and everything below it being ancestry. The x-axis displays the population of the sampled call-graph data. The children of a stack in a given row are displayed based on the number of samples taken of each respective function in descending order along the x-axis; the x-axis does not represent the passing of time. The wider an individual box is, the more frequent it was on-CPU or part of an on-CPU ancestry at the time the data was being sampled. Procedure
Boxes representing user-space functions may be labeled as Unknown in flamegraphs because the binary of the function is stripped. The debuginfo package of the executable must be installed or, if the executable is a locally developed application, the application must be compiled with debugging information. Use the -g option in GCC, to display the function names or symbols in such a situation.
Chapter 28. Monitoring processes for performance bottlenecks using perf circular buffersYou can create circular buffers that take event-specific snapshots of data with the perf tool in order to monitor performance bottlenecks in specific processes or parts of applications running on your system. In such cases, perf only writes data to a perf.data file for later analysis if a specified event is detected. 28.1. Circular buffers and event-specific snapshots with perfWhen investigating performance issues in a process or application with perf, it may not be affordable or appropriate to record data for hours preceding a specific event of interest occurring. In such cases, you can use perf record to create custom circular buffers that take snapshots after specific events. The --overwrite option makes perf record store all data in an overwritable circular buffer. When the buffer gets full, perf record automatically overwrites the oldest records which, therefore, never get written to a perf.data file. Using the --overwrite and --switch-output-event options together configures a circular buffer that records and dumps data continuously until it detects the --switch-output-event trigger event. The trigger event signals to perf record that something of interest to the user has occurred and to write the data in the circular buffer to a perf.data file. This collects specific data you are interested in while simultaneously reducing the overhead of the running perf process by not writing data you do not want to a perf.data file. 28.2. Collecting specific data to monitor for performance bottlenecks using perf circular buffersWith the perf tool, you can create circular buffers that are triggered by events you specify in order to only collect data you are interested in. To create circular buffers that collect event-specific data, use the --overwrite and --switch-output-event options for perf. Prerequisites
Procedure
Chapter 29. Adding and removing tracepoints from a running perf collector without stopping or restarting perfBy using the control pipe interface to enable and disable different tracepoints in a running perf collector, you can dynamically adjust what data you are collecting without having to stop or restart perf. This ensures you do not lose performance data that would have otherwise been recorded during the stopping or restarting process. 29.1. Adding tracepoints to a running perf collector without stopping or restarting perfAdd tracepoints to a running perf collector using the control pipe interface to adjust the data you are recording without having to stop perf and losing performance data. Prerequisites
Procedure
29.2. Removing tracepoints from a running perf collector without stopping or restarting perfRemove tracepoints from a running perf collector using the control pipe interface to reduce the scope of data you are collecting without having to stop perf and losing performance data. Prerequisites
Procedure
Chapter 30. Profiling memory allocation with numastatWith the numastat tool, you can display statistics over memory allocations in a system. The numastat tool displays data for each NUMA node separately. You can use this information to investigate memory performance of your system or the effectiveness of different memory policies on your system. 30.1. Default numastat statisticsBy default, the numastat tool displays statistics over these categories of data for each NUMA node: numa_hit The number of pages that were successfully allocated to this node. numa_miss The number of pages that were allocated on this node because of low memory on the intended node. Each numa_miss event has a corresponding numa_foreign event on another node. numa_foreign The number of pages initially intended for this node that were allocated to another node instead. Each numa_foreign event has a corresponding numa_miss event on another node. interleave_hit The number of interleave policy pages successfully allocated to this node. local_node The number of pages successfully allocated on this node by a process on this node. other_node The number of pages allocated on this node by a process on another node. High numa_hit values and low numa_miss values (relative to each other) indicate optimal performance. 30.2. Viewing memory allocation with numastatYou can view the memory allocation of the system by using the numastat tool. Prerequisites
Procedure
Additional resources
Chapter 31. Configuring an operating system to optimize CPU utilizationThis section describes how to configure the operating system to optimize CPU utilization across their workloads. 31.1. Tools for monitoring and diagnosing processor issuesThe following are the tools available in Red Hat Enterprise Linux 8 to monitor and diagnose processor-related performance issues:
Additional resources
31.2. Types of system topologyIn modern computing, the idea of a CPU is a misleading one, as most modern systems have multiple processors. The topology of the system is the way these processors are connected to each other and to other system resources. This can affect system and application performance, and the tuning considerations for a system. The following are the two primary types of topology used in modern computing: Symmetric Multi-Processor (SMP) topology SMP topology allows all processors to access memory in the same amount of time. However, because shared and equal memory access inherently forces serialized memory accesses from all the CPUs, SMP system scaling constraints are now generally viewed as unacceptable. For this reason, practically all modern server systems are NUMA machines. Non-Uniform Memory Access (NUMA) topology NUMA topology was developed more recently than SMP topology. In a NUMA system, multiple processors are physically grouped on a socket. Each socket has a dedicated area of memory and processors that have local access to that memory, these are referred to collectively as a node. Processors on the same node have high speed access to that node’s memory bank, and slower access to memory banks not on their node. Therefore, there is a performance penalty when accessing non-local memory. Thus, performance sensitive applications on a system with NUMA topology should access memory that is on the same node as the processor executing the application, and should avoid accessing remote memory wherever possible. Multi-threaded applications that are sensitive to performance may benefit from being configured to execute on a specific NUMA node rather than a specific processor. Whether this is suitable depends on your system and the requirements of your application. If multiple application threads access the same cached data, then configuring those threads to execute on the same processor may be suitable. However, if multiple threads that access and cache different data execute on the same processor, each thread may evict cached data accessed by a previous thread. This means that each thread 'misses' the cache and wastes execution time fetching data from memory and replacing it in the cache. Use the perf tool to check for an excessive number of cache misses. 31.2.1. Displaying system topologiesThere are a number of commands that help understand the topology of a system. This procedure describes how to determine the system topology. Procedure
Additional resources
31.3. Configuring kernel tick timeBy default, Red Hat Enterprise Linux 8 uses a tickless kernel, which does not interrupt idle CPUs in order to reduce power usage and allow new processors to take advantage of deep sleep states. Red Hat Enterprise Linux 8 also offers a dynamic tickless option, which is useful for latency-sensitive workloads, such as high performance computing or realtime computing. By default, the dynamic tickless option is disabled. Red Hat recommends using the cpu-partitioning TuneD profile to enable the dynamic tickless option for cores specified as isolated_cores. This procedure describes how to manually persistently enable dynamic tickless behavior. Procedure
Verification steps
Additional resources
31.4. Overview of an interrupt requestAn interrupt request or IRQ is a signal for immediate attention sent from a piece of hardware to a processor. Each device in a system is assigned one or more IRQ numbers which allow it to send unique interrupts. When interrupts are enabled, a processor that receives an interrupt request immediately pauses execution of the current application thread in order to address the interrupt request. Because interrupt halts normal operation, high interrupt rates can severely degrade system performance. It is possible to reduce the amount of time taken by interrupts by configuring interrupt affinity or by sending a number of lower priority interrupts in a batch (coalescing a number of interrupts). Interrupt requests have an associated affinity property, smp_affinity, which defines the processors that handle the interrupt request. To improve application performance, assign interrupt affinity and process affinity to the same processor, or processors on the same core. This allows the specified interrupt and application threads to share cache lines. On systems that support interrupt steering, modifying the smp_affinity property of an interrupt request sets up the hardware so that the decision to service an interrupt with a particular processor is made at the hardware level with no intervention from the kernel. 31.4.1. Balancing interrupts manuallyIf your BIOS exports its NUMA topology, the irqbalance service can automatically serve interrupt requests on the node that is local to the hardware requesting service. Procedure
Additional resources
31.4.2. Setting the smp_affinity maskThe smp_affinity value is stored as a hexadecimal bit mask representing all processors in the system. Each bit configures a different CPU. The least significant bit is CPU 0. The default value of the mask is f, which means that an interrupt request can be handled on any processor in the system. Setting this value to 1 means that only processor 0 can handle the interrupt. Procedure
Additional resources
Chapter 32. Tuning scheduling policyIn Red Hat Enterprise Linux, the smallest unit of process execution is called a thread. The system scheduler determines which processor runs a thread, and for how long the thread runs. However, because the scheduler’s primary concern is to keep the system busy, it may not schedule threads optimally for application performance. For example, say an application on a NUMA system is running on Node A when a processor on Node B becomes available. To keep the processor on Node B busy, the scheduler moves one of the application’s threads to Node B. However, the application thread still requires access to memory on Node A. But, this memory will take longer to access because the thread is now running on Node B and Node A memory is no longer local to the thread. Thus, it may take longer for the thread to finish running on Node B than it would have taken to wait for a processor on Node A to become available, and then to execute the thread on the original node with local memory access. 32.1. Categories of scheduling policiesPerformance sensitive applications often benefit from the designer or administrator determining where threads are run. The Linux scheduler implements a number of scheduling policies which determine where and for how long a thread runs. The following are the two major categories of scheduling policies: Normal policies Normal threads are used for tasks of normal priority. Realtime policies Realtime policies are used for time-sensitive tasks that must complete without interruptions. Realtime threads are not subject to time slicing. This means the thread runs until they block, exit, voluntarily yield, or are preempted by a higher priority thread. The lowest priority realtime thread is scheduled before any thread with a normal policy. For more information, see Static priority scheduling with SCHED_FIFO and Round robin priority scheduling with SCHED_RR. Additional resources
32.2. Static priority scheduling with SCHED_FIFOThe SCHED_FIFO, also called static priority scheduling, is a realtime policy that defines a fixed priority for each thread. This policy allows administrators to improve event response time and reduce latency. It is recommended to not execute this policy for an extended period of time for time sensitive tasks. When SCHED_FIFO is in use, the scheduler scans the list of all the SCHED_FIFO threads in order of priority and schedules the highest priority thread that is ready to run. The priority level of a SCHED_FIFO thread can be any integer from 1 to 99, where 99 is treated as the highest priority. Red Hat recommends starting with a lower number and increasing priority only when you identify latency issues. Because realtime threads are not subject to time slicing, Red Hat does not recommend setting a priority as 99. This keeps your process at the same priority level as migration and watchdog threads; if your thread goes into a computational loop and these threads are blocked, they will not be able to run. Systems with a single processor will eventually hang in this situation. Administrators can limit SCHED_FIFO bandwidth to prevent realtime application programmers from initiating realtime tasks that monopolize the processor. The following are some of the parameters used in this policy: /proc/sys/kernel/sched_rt_period_us This parameter defines the time period, in microseconds, that is considered to be one hundred percent of the processor bandwidth. The default value is 1000000 μs, or 1 second. /proc/sys/kernel/sched_rt_runtime_us This parameter defines the time period, in microseconds, that is devoted to running real-time threads. The default value is 950000 μs, or 0.95 seconds. 32.3. Round robin priority scheduling with SCHED_RRThe SCHED_RR is a round-robin variant of the SCHED_FIFO. This policy is useful when multiple threads need to run at the same priority level. Like SCHED_FIFO, SCHED_RR is a realtime policy that defines a fixed priority for each thread. The scheduler scans the list of all SCHED_RR threads in order of priority and schedules the highest priority thread that is ready to run. However, unlike SCHED_FIFO, threads that have the same priority are scheduled in a round-robin style within a certain time slice. You can set the value of this time slice in milliseconds with the sched_rr_timeslice_ms kernel parameter in the /proc/sys/kernel/sched_rr_timeslice_ms file. The lowest value is 1 millisecond. 32.4. Normal scheduling with SCHED_OTHERThe SCHED_OTHER is the default scheduling policy in Red Hat Enterprise Linux 8. This policy uses the Completely Fair Scheduler (CFS) to allow fair processor access to all threads scheduled with this policy. This policy is most useful when there are a large number of threads or when data throughput is a priority, as it allows more efficient scheduling of threads over time. When this policy is in use, the scheduler creates a dynamic priority list based partly on the niceness value of each process thread. Administrators can change the niceness value of a process, but cannot change the scheduler’s dynamic priority list directly. 32.5. Setting scheduler policiesCheck and adjust scheduler policies and priorities by using the chrt command line tool. It can start new processes with the desired properties, or change the properties of a running process. It can also be used for setting the policy at runtime. Procedure
32.6. Policy options for the chrt commandUsing the chrt command, you can view and set the scheduling policy of a process. The following table describes the appropriate policy options, which can be used to set the scheduling policy of a process. Table 32.1. Policy Options for the chrt Command
32.7. Changing the priority of services during the boot processUsing the systemd service, it is possible to set up real-time priorities for services launched during the boot process. The unit configuration directives are used to change the priority of a service during the boot process. The boot process priority change is done by using the following directives in the service section: CPUSchedulingPolicy= Sets the CPU scheduling policy for executed processes. It is used to set other, fifo, and rr policies. CPUSchedulingPriority= Sets the CPU scheduling priority for executed processes. The available priority range depends on the selected CPU scheduling policy. For real-time scheduling policies, an integer between 1 (lowest priority) and 99 (highest priority) can be used. The following procedure describes how to change the priority of a service, during the boot process, using the mcelog service. Prerequisites
Procedure
Verification steps
32.8. Priority mapPriorities are defined in groups, with some groups dedicated to certain kernel functions. For real-time scheduling policies, an integer between 1 (lowest priority) and 99 (highest priority) can be used. The following table describes the priority range, which can be used while setting the scheduling policy of a process. Table 32.2. Description of the priority range
32.9. TuneD cpu-partitioning profileFor tuning Red Hat Enterprise Linux 8 for latency-sensitive workloads, Red Hat recommends to use the cpu-partitioning TuneD profile. Prior to Red Hat Enterprise Linux 8, the low-latency Red Hat documentation described the numerous low-level steps needed to achieve low-latency tuning. In Red Hat Enterprise Linux 8, you can perform low-latency tuning more efficiently by using the cpu-partitioning TuneD profile. This profile is easily customizable according to the requirements for individual low-latency applications. The following figure is an example to demonstrate how to use the cpu-partitioning profile. This example uses the CPU and node layout. Figure 32.1. Figure cpu-partitioning You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf file using the following configuration options: Isolated CPUs with load balancing In the cpu-partitioning figure, the blocks numbered from 4 to 23, are the default isolated CPUs. The kernel scheduler’s process load balancing is enabled on these CPUs. It is designed for low-latency processes with multiple threads that need the kernel scheduler load balancing. You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf file using the isolated_cores=cpu-list option, which lists CPUs to isolate that will use the kernel scheduler load balancing. The list of isolated CPUs is comma-separated or you can specify a range using a dash, such as 3-5. This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping CPU. Isolated CPUs without load balancingIn the cpu-partitioning figure, the blocks numbered 2 and 3, are the isolated CPUs that do not provide any additional kernel scheduler process load balancing. You can configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf file using the no_balance_cores=cpu-list option, which lists CPUs to isolate that will not use the kernel scheduler load balancing. Specifying the no_balance_cores option is optional, however any CPUs in this list must be a subset of the CPUs listed in the isolated_cores list. Application threads using these CPUs need to be pinned individually to each CPU. Housekeeping CPUs Any CPU not isolated in the cpu-partitioning-variables.conf file is automatically considered a housekeeping CPU. On the housekeeping CPUs, all services, daemons, user processes, movable kernel threads, interrupt handlers, and kernel timers are permitted to execute.Additional resources
32.10. Using the TuneD cpu-partitioning profile for low-latency tuningThis procedure describes how to tune a system for low-latency using the TuneD’s cpu-partitioning profile. It uses the example of a low-latency application that can use cpu-partitioning and the CPU layout as mentioned in the cpu-partitioning figure. The application in this case uses:
Prerequisites
Procedure
Additional resources
32.11. Customizing the cpu-partitioning TuneD profileYou can extend the TuneD profile to make additional tuning changes. For example, the cpu-partitioning profile sets the CPUs to use cstate=1. In order to use the cpu-partitioning profile but to additionally change the CPU cstate from cstate1 to cstate0, the following procedure describes a new TuneD profile named my_profile, which inherits the cpu-partitioning profile and then sets C state-0. Procedure
In the shared example, a reboot is not required. However, if the changes in the my_profile profile require a reboot to take effect, then reboot your machine. Additional resources
Chapter 33. Factors affecting I/O and file system performanceThe appropriate settings for storage and file system performance are highly dependent on the storage purpose. I/O and file system performance can be affected by any of the following factors:
33.1. Tools for monitoring and diagnosing I/O and file system issuesThe following tools are available in Red Hat Enterprise Linux 8 for monitoring system performance and diagnosing performance problems related to I/O, file systems, and their configuration:
33.2. Available tuning options for formatting a file systemSome file system configuration decisions cannot be changed after the device is formatted. The following are the options available before formatting a storage device: Size Create an appropriately-sized file system for your workload. Smaller file systems require less time and memory for file system checks. However, if a file system is too small, its performance suffers from high fragmentation. Block size The block is the unit of work for the file system. The block size determines how much data can be stored in a single block, and therefore the smallest amount of data that is written or read at one time. The default block size is appropriate for most use cases. However, your file system performs better and stores data more efficiently if the block size or the size of multiple blocks is the same as or slightly larger than the amount of data that is typically read or written at one time. A small file still uses an entire block. Files can be spread across multiple blocks, but this can create additional runtime overhead. Additionally, some file systems are limited to a certain number of blocks, which in turn limits the maximum size of the file system. Block size is specified as part of the file system options when formatting a device with the mkfs command. The parameter that specifies the block size varies with the file system. GeometryFile system geometry is concerned with the distribution of data across a file system. If your system uses striped storage, like RAID, you can improve performance by aligning data and metadata with the underlying storage geometry when you format the device. Many devices export recommended geometry, which is then set automatically when the devices are formatted with a particular file system. If your device does not export these recommendations, or you want to change the recommended settings, you must specify geometry manually when you format the device with the mkfs command. The parameters that specify file system geometry vary with the file system. External journals Journaling file systems document the changes that will be made during a write operation in a journal file prior to the operation being executed. This reduces the likelihood that a storage device will become corrupted in the event of a system crash or power failure, and speeds up the recovery process.Red Hat does not recommend using the external journals option. Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a device with a metadata-intensive workload by placing its journal on dedicated storage that is as fast as, or faster than, the primary storage. Ensure that external journals are reliable. Losing an external journal device causes file system corruption. External journals must be created at format time, with journal devices being specified at mount time. 33.3. Available tuning options for mounting a file systemThe following are the options available to most file systems and can be specified as the device is mounted: Access Time Every time a file is read, its metadata is updated with the time at which access occurred (atime). This involves additional write I/O. The relatime is the default atime setting for most file systems. However, if updating this metadata is time consuming, and if accurate access time data is not required, you can mount the file system with the noatime mount option. This disables updates to metadata when a file is read. It also enables nodiratime behavior, which disables updates to metadata when a directory is read. Disabling atime updates by using the noatime mount option can break applications that rely on them, for example, backup programs. Read-ahead Read-ahead behavior speeds up file access by pre-fetching data that is likely to be needed soon and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The higher the read-ahead value, the further ahead the system pre-fetches data. Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects about your file system. However, accurate detection is not always possible. For example, if a storage array presents itself to the system as a single LUN, the system detects the single LUN, and does not set the appropriate read-ahead value for an array. Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values. The storage-related tuned profiles provided with Red Hat Enterprise Linux raise the read-ahead value, as does using LVM striping, but these adjustments are not always sufficient for all workloads. Additional resources
33.4. Types of discarding unused blocksRegularly discarding blocks that are not in use by the file system is a recommended practice for both solid-state disks and thinly-provisioned storage. The following are the two methods of discarding unused blocks: Batch discard This type of discard is part of the fstrim command. It discards all unused blocks in a file system that match criteria specified by the administrator. Red Hat Enterprise Linux 8 supports batch discard on XFS and ext4 formatted devices that support physical discard operations. Online discard This type of discard operation is configured at mount time with the discard option, and runs in real time without user intervention. However, it only discards blocks that are transitioning from used to free. Red Hat Enterprise Linux 8 supports online discard on XFS and ext4 formatted devices. Red Hat recommends batch discard, except where online discard is required to maintain performance, or where batch discard is not feasible for the system’s workload. Pre-allocation marks disk space as being allocated to a file without writing any data into that space. This can be useful in limiting data fragmentation and poor read performance. Red Hat Enterprise Linux 8 supports pre-allocating space on XFS, ext4, and GFS2 file systems. Applications can also benefit from pre-allocating space by using the fallocate(2) glibc call. Additional resources
33.5. Solid-state disks tuning considerationsSolid-state disks (SSD) use NAND flash chips rather than rotating magnetic platters to store persistent data. SSD provides a constant access time for data across their full Logical Block Address range, and does not incur measurable seek costs like their rotating counterparts. They are more expensive per gigabyte of storage space and have a lesser storage density, but they also have lower latency and greater throughput than HDDs. Performance generally degrades as the used blocks on an SSD approach the capacity of the disk. The degree of degradation varies by vendor, but all devices experience degradation in this circumstance. Enabling discard behavior can help to alleviate this degradation. For more information, see Types of discarding unused blocks. The default I/O scheduler and virtual memory options are suitable for use with SSDs. Consider the following factors when configuring settings that can affect SSD performance: I/O Scheduler Any I/O scheduler is expected to perform well with most SSDs. However, as with any other storage type, Red Hat recommends benchmarking to determine the optimal configuration for a given workload. When using SSDs, Red Hat advises changing the I/O scheduler only for benchmarking particular workloads. For instructions on how to switch between I/O schedulers, see the /usr/share/doc/kernel-version/Documentation/block/switching-sched.txt file. For single queue HBA, the default I/O scheduler is deadline. For multiple queue HBA, the default I/O scheduler is none. For information on how to set the I/O scheduler, see Setting the disk scheduler. Virtual Memory Like the I/O scheduler, virtual memory (VM) subsystem requires no special tuning. Given the fast nature of I/O on SSD, try turning down the vm_dirty_background_ratio and vm_dirty_ratio settings, as increased write-out activity does not usually have a negative impact on the latency of other operations on the disk. However, this tuning can generate more overall I/O, and is therefore not generally recommended without workload-specific testing. Swap An SSD can also be used as a swap device, and is likely to produce good page-out and page-in performance.33.6. Generic block device tuning parametersThe generic tuning parameters listed in this section are available in the /sys/block/sdX/queue/ directory. The following listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all I/O schedulers: add_random Some I/O events contribute to the entropy pool for the /dev/random. This parameter can be set to 0 if the overhead of these contributions become measurable. iostats By default, iostats is enabled and the default value is 1. Setting iostats value to 0 disables the gathering of I/O statistics for the device, which removes a small amount of overhead with the I/O path. Setting iostats to 0 might slightly improve performance for very high performance devices, such as certain NVMe solid-state storage devices. It is recommended to leave iostats enabled unless otherwise specified for the given storage model by the vendor. If you disable iostats, the I/O statistics for the device are no longer present within the /proc/diskstats file. The content of /sys/diskstats file is the source of I/O information for monitoring I/O tools, such as sar or iostats. Therefore, if you disable the iostats parameter for a device, the device is no longer present in the output of I/O monitoring tools. max_sectors_kbSpecifies the maximum size of an I/O request in kilobytes. The default value is 512 KB. The minimum value for this parameter is determined by the logical block size of the storage device. The maximum value for this parameter is determined by the value of the max_hw_sectors_kb. Red Hat recommends max_sectors_kb to always be a multiple of the optimal I/O size and the internal erase block size. Use a value of logical_block_size for either parameter if they are zero or not specified by the storage device. nomerges Most workloads benefit from request merging. However, disabling merges can be useful for debugging purposes. By default, the nomerges parameter is set to 0, which enables merging. To disable simple one-hit merging, set nomerges to 1. To disable all types of merging, set nomerges to 2. nr_requests It is the maximum allowed number of the queued I/O. If the current I/O scheduler is none, this number can only be reduced; otherwise the number can be increased or reduced. optimal_io_size Some storage devices report an optimal I/O size through this parameter. If this value is reported, Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size wherever possible. read_ahead_kbDefines the maximum number of kilobytes that the operating system may read ahead during a sequential read operation. As a result, the necessary information is already present within the kernel page cache for the next sequential read, which improves read I/O performance. Device mappers often benefit from a high read_ahead_kb value. 128 KB for each device to be mapped is a good starting point, but increasing the read_ahead_kb value up to request queue’s max_sectors_kb of the disk might improve performance in application environments where sequential reading of large files takes place. rotational Some solid-state disks do not correctly advertise their solid-state status, and are mounted as traditional rotational disks. Manually set the rotational value to 0 to disable unnecessary seek-reducing logic in the scheduler. rq_affinity The default value of the rq_affinity is 1. It completes the I/O operations on one CPU core, which is in the same CPU group of the issued CPU core. To perform completions only on the processor that issued the I/O request, set the rq_affinity to 2. To disable the mentioned two abilities, set it to 0. scheduler To set the scheduler or scheduler preference order for a particular storage device, edit the /sys/block/devname/queue/scheduler file, where devname is the name of the device you want to configure.Chapter 34. Configuring an operating system to optimize access to network resourcesThis section describes how to configure the operating system to present optimized access to network resources across their workloads. Network performance problems are sometimes the result of hardware malfunction or faulty infrastructure. Resolving these issues is beyond the scope of this document. The TuneD service provides a number of different profiles to improve performance in a number of specific use cases:
34.1. Tools for monitoring and diagnosing performance issuesThe following are the available tools in Red Hat Enterprise Linux 8, which are used for monitoring system performance and diagnosing performance problems related to the networking subsystem:
Additional resources
34.2. Bottlenecks in a packet receptionWhile the network stack is largely self-optimizing, there are a number of points during network packet processing that can become bottlenecks and reduce the performance. The following are the issues that can cause bottleneck: The buffer or ring buffer of the network card The hardware buffer can be a bottleneck if the kernel drops a large number of packets. Use the ethtool utility for monitoring a system for dropped packets. The hardware or software interrupt queues Interrupts can increase latency and processor contention. For information on how the processor handles interrupts, see Overview of an interrupt request, Balancing interrupts manually, and Setting the smp_affinity mask. The socket receive queue of the application A large number of packets that are not copied or by an increase in the UDP input errors (InErrors) in the /proc/net/snmp file, indicates a bottleneck in an application’s receive queue. If a the hardware buffer drops a large number of packets, the following are the few potential solutions: Slow the input traffic Filter the incoming traffic, reduce the number of joined multicast groups, or reduce the amount of broadcast traffic to decrease the rate at which the queue fills. Resize the hardware buffer queue Resize the hardware buffer queue: Reduce the number of packets being dropped by increasing the size of the queue so that it does not overflow as easily. You can modify the rx/tx parameters of the network device with the ethtool command: ethtool --set-ring device-name value Change the drain rate of the queue
Altering the drain rate of a queue is usually the simplest way to mitigate poor network performance. However, increasing the number of packets that a device can receive at one time uses additional processor time, during which no other processes can be scheduled, so this can cause other performance problems. Additional resources
34.3. Busy pollingIf analysis reveals high latency, your system may benefit from the poll-based rather than interrupt-based packet receipt. Busy polling helps to reduce latency in the network receive path by allowing socket layer code to poll the receive queue of a network device, and disables network interrupts. This removes delays caused by the interrupt and the resultant context switch. However, it also increases CPU utilization. Busy polling also prevents the CPU from sleeping, which can incur additional power consumption. Busy polling behavior is supported by all the device drivers. 34.3.1. Enabling busy pollingBy default, the busy polling is disabled. This procedure describes how to enable busy polling. Procedure
34.4. Receive-Side ScalingReceive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency. By default, RSS is enabled. The number of queues or the CPUs that should process network activity for RSS are configured in the appropriate network device driver:
Regardless, it is typically configured in the /sys/class/net/device/queues/rx-queue/ directory, where device is the name of the network device (such as enp1s0) and rx-queue is the name of the appropriate receive queue. The irqbalance daemon can be used in conjunction with RSS to reduce the likelihood of cross-node memory transfers and cache line bouncing. This lowers the latency of processing network packets. 34.4.1. Viewing the interrupt request queuesWhen configuring Receive-Side Scaling (RSS), Red Hat recommends limiting the number of queues to one per physical CPU core. Hyper-threads are often represented as separate cores in analysis tools, but configuring queues for all cores including logical cores such as hyper-threads has not proven beneficial to network performance. When enabled, RSS distributes network processing equally between available CPUs based on the amount of processing each CPU has queued. However, use the --show-rxfh-indir and --set-rxfh-indir parameters of the ethtool utility, to modify how RHEL distributes network activity, and weigh certain types of network activity as more important than others. This procedure describes how to view the interrupt request queues. Procedure
34.5. Receive Packet SteeringReceive Packet Steering (RPS) is similar to Receive-Side Scaling (RSS) in that it is used to direct packets to specific CPUs for processing. However, RPS is implemented at the software level, and helps to prevent the hardware queue of a single network interface card from becoming a bottleneck in network traffic. RPS has several advantages over hardware-based RSS:
RPS is configured per network device and receive queue, in the /sys/class/net/device/queues/rx-queue/rps_cpus file, where device is the name of the network device, such as enp1s0 and rx-queue is the name of the appropriate receive queue, such as rx-0. The default value of the rps_cpus file is 0. This disables RPS, and the CPU handles the network interrupt and also processes the packet. To enable RPS, configure the appropriate rps_cpus file with the CPUs that should process packets from the specified network device and receive queue. The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of the rps_cpus to f, which is the hexadecimal value for 15. In binary representation, 15 is 00001111 (1+2+4+8). For network devices with single transmit queues, best performance can be achieved by configuring RPS to use CPUs in the same memory domain. On non-NUMA systems, this means that all available CPUs can be used. If the network interrupt rate is extremely high, excluding the CPU that handles network interrupts may also improve performance. For network devices with multiple queues, there is typically no benefit to configure both RPS and RSS, as RSS is configured to map a CPU to each receive queue by default. However, RPS can still be beneficial if there are fewer hardware queues than CPUs, and RPS is configured to use CPUs in the same memory domain. 34.6. Receive Flow SteeringReceive Flow Steering (RFS) extends Receive Packet Steering (RPS) behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS back end to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency. Data received from a single sender is not sent to more than one CPU. If the amount of data received from a single sender is greater than a single CPU can handle, configure a larger frame size to reduce the number of interrupts and therefore the amount of processing work for the CPU. Alternatively, consider NIC offload options or faster CPUs. Consider using numactl or taskset in conjunction with RFS to pin applications to specific cores, sockets, or NUMA nodes. This can help prevent packets from being processed out of order. 34.6.1. Enabling Receive Flow SteeringBy default, Receive Flow Steering (RFS) is disabled. This procedure describes how to enable RFS. Procedure
Verification steps
Additional resources
34.7. Accelerated RFSAccelerated RFS boosts the speed of Receive Flow Steering (RFS) by adding hardware assistance. Like RFS, packets are forwarded based on the location of the application consuming the packet. Unlike traditional RFS, however, packets are sent directly to a CPU that is local to the thread consuming the data:
Accelerated RFS is only available if the following conditions are met:
Once these conditions are met, CPU to queue mapping is deduced automatically based on traditional RFS configuration. That is, CPU to queue mapping is deduced based on the IRQ affinities configured by the driver for each receive queue. For more information on enabling the traditional RFS, see Enabling Receive Flow Steering. 34.7.1. Enabling the ntuple filtersThe ntuple filtering must be enabled. Use the ethtool -k command to enable the ntuple filters. Procedure
If the output is ntuple-filters: off [fixed], then the ntuple filtering is disabled and you cannot configure it: # ethtool -k enp1s0 | grep ntuple-filters ntuple-filters: off [fixed]Verification steps
Additional resources
Chapter 35. Configuring an operating system to optimize memory accessThis section describes how to configure the operating system to optimize memory access across workloads, and the tools you can use to do so. 35.1. Tools for monitoring and diagnosing system memory issuesThe following tools are available in Red Hat Enterprise Linux 8 for monitoring system performance and diagnosing performance problems related to system memory:
Additional resources
35.2. Overview of a system’s memoryThe Linux Kernel is designed to maximize the utilization of a system’s memory resources (RAM). Due to these design characteristics, and depending on the memory requirements of the workload, part of the system’s memory is in use within the kernel on behalf of the workload, while a small part of the memory is free. This free memory is reserved for special system allocations, and for other low or high priority system services. The rest of the system’s memory is dedicated to the workload itself, and divided into the following two categories: File memory Pages added in this category represent parts of files in permanent storage. These pages, from the page cache, can be mapped or unmapped in an application’s address spaces. You can use applications to map files into their address space using the mmap system calls, or to operate on files via the buffered I/O read or write system calls. Buffered I/O system calls, as well as applications that map pages directly, can re-utilize unmapped pages. As a result, these pages are stored in the cache by the kernel, especially when the system is not running any memory intensive tasks, to avoid re-issuing costly I/O operations over the same set of pages. Anonymous memory Pages in this category are in use by a dynamically allocated process, or are not related to files in permanent storage. This set of pages back up the in-memory control structures of each task, such as the application stack and heap areas.Figure 35.1. Memory usage patterns 35.3. Virtual memory parametersThe virtual memory parameters are listed in the /proc/sys/vm directory. The following are the available virtual memory parameters: vm.dirty_ratio Is a percentage value. When this percentage of the total system memory is modified, the system begins writing the modifications to the disk with the pdflush operation. The default value is 20 percent. vm.dirty_background_ratio A percentage value. When this percentage of total system memory is modified, the system begins writing the modifications to the disk in the background. The default value is 10 percent. vm.overcommit_memory Defines the conditions that determine whether a large memory request is accepted or denied.The default value is 0. By default, the kernel performs heuristic memory overcommit handling by estimating the amount of memory available and failing requests that are too large. However, since memory is allocated using a heuristic rather than a precise algorithm, overloading memory is possible with this setting. Setting the overcommit_memory parameter’s value:
Sets the size of the reserved free pages pool. It is also responsible for setting the min_page, low_page, and high_page thresholds that govern the behavior of the Linux kernel’s page reclaim algorithms. It also specifies the minimum number of kilobytes to keep free across the system. This calculates a specific value for each low memory zone, each of which is assigned a number of reserved free pages in proportion to their size. Setting the vm.min_free_kbytes parameter’s value:
In the event that a system runs out of memory, and the panic_on_oom parameter is set to 0, the oom_killer function kills processes, starting with the process that has the highest oom_score, until the system recovers. The oom_adj parameter determines the oom_score of a process. This parameter is set per process identifier. A value of -17 disables the oom_killer for that process. Other valid values range from -16 to 15. Processes created by an adjusted process inherit the oom_score of that process. vm.swappiness The swappiness value, ranging from 0 to 100, controls the degree to which the system favors reclaiming memory from the anonymous memory pool, or the page cache memory pool. Setting the swappiness parameter’s value:
35.4. File system parametersThe file system parameters are listed in the /proc/sys/fs directory. The following are the available file system parameters: aio-max-nr Defines the maximum allowed number of events in all active asynchronous input/output contexts. The default value is 65536, and modifying this value does not pre-allocate or resize any kernel data structures. file-max Determines the maximum number of file handles for the entire system. The default value on Red Hat Enterprise Linux 8 is either 8192 or one tenth of the free memory pages available at the time the kernel starts, whichever is higher. Raising this value can resolve errors caused by a lack of available file handles. Additional resources
35.5. Kernel parametersThe default values for the kernel parameters are located in the /proc/sys/kernel/ directory. These are set default values provided by the kernel or values specified by a user via sysctl. The following are the available kernel parameters used to set up limits for the msg* and shm* System V IPC (sysvipc) system calls: msgmax Defines the maximum allowed size in bytes of any single message in a message queue. This value must not exceed the size of the queue (msgmnb). Use the sysctl msgmax command to determine the current msgmax value on your system. msgmnb Defines the maximum size in bytes of a single message queue. Use the sysctl msgmnb command to determine the current msgmnb value on your system. msgmni Defines the maximum number of message queue identifiers, and therefore the maximum number of queues. Use the sysctl msgmni command to determine the current msgmni value on your system. shmall Defines the total amount of shared memory pages that can be used on the system at one time. For example, a page is 4096 bytes on the AMD64 and Intel 64 architecture. Use the sysctl shmall command to determine the current shmall value on your system. shmmax Defines the maximum size in bytes of a single shared memory segment allowed by the kernel. Use the sysctl shmmax command to determine the current shmmax value on your system. shmmni Defines the system-wide maximum number of shared memory segments. The default value is 4096 on all systems. Additional resources
Chapter 36. Configuring huge pagesPhysical memory is managed in fixed-size chunks called pages. On the x86_64 architecture, supported by Red Hat Enterprise Linux 8, the default size of a memory page is 4 KB. This default page size has proved to be suitable for general-purpose operating systems, such as Red Hat Enterprise Linux, which supports many different kinds of workloads. However, specific applications can benefit from using larger page sizes in certain cases. For example, an application that works with a large and relatively fixed data set of hundreds of megabytes or even dozens of gigabytes can have performance issues when using 4 KB pages. Such data sets can require a huge amount of 4 KB pages, which can lead to overhead in the operating system and the CPU. This section provides information about huge pages available in RHEL 8 and how you can configure them. 36.1. Available huge page featuresWith Red Hat Enterprise Linux 8, you can use huge pages for applications that work with big data sets, and improve the performance of such applications. The following are the huge page methods, which are supported in RHEL 8: HugeTLB pages HugeTLB pages are also called static huge pages. There are two ways of reserving HugeTLB pages:
With THP, the kernel automatically assigns huge pages to processes, and therefore there is no need to manually reserve the static huge pages. The following are the two modes of operation in THP:
36.2. Parameters for reserving HugeTLB pages at boot timeUse the following parameters to influence HugeTLB page behavior at boot time. For more infomration on how to use these parameters to configure HugeTLB pages at boot time, see Configuring HugeTLB at boot time. Table 36.1. Parameters used to configure HugeTLB pages at boot time
36.3. Configuring HugeTLB at boot timeThe page size, which the HugeTLB subsystem supports, depends on the architecture. The x86_64 architecture supports 2 MB huge pages and 1 GB gigantic pages. This procedure describes how to reserve a 1 GB page at boot time. Procedure
Additional resources
36.4. Parameters for reserving HugeTLB pages at run timeUse the following parameters to influence HugeTLB page behavior at run time. For more information on how to use these parameters to configure HugeTLB pages at run time, see Configuring HugeTLB at run time. Table 36.2. Parameters used to configure HugeTLB pages at run time
36.5. Configuring HugeTLB at run timeThis procedure describes how to add 20 2048 kB huge pages to node2. To reserve pages based on your requirements, replace:
Procedure
Verification steps
Additional resources
36.6. Enabling transparent hugepagesTHP is enabled by default in Red Hat Enterprise Linux 8. However, you can enable or disable THP. This procedure describes how to enable THP. Procedure
Sometimes, providing low latency to short-lived allocations has higher priority than immediately achieving the best performance with long-lived allocations. In such cases, you can disable direct compaction while leaving THP enabled. Direct compaction is a synchronous memory compaction during the huge page allocation. Disabling direct compaction provides no guarantee of saving memory, but can decrease the risk of higher latencies during frequent page faults. Note that if the workload benefits significantly from THP, the performance decreases. Disable direct compaction: # echo madvise > /sys/kernel/mm/transparent_hugepage/defrag36.7. Disabling transparent hugepagesTHP is enabled by default in Red Hat Enterprise Linux 8. However, you can enable or disable THP. This procedure describes how to disable THP. Procedure
36.8. Impact of page size on translation lookaside buffer sizeReading address mappings from the page table is time-consuming and resource-expensive, so CPUs are built with a cache for recently-used addresses, called the Translation Lookaside Buffer (TLB). However, the default TLB can only cache a certain number of address mappings. If a requested address mapping is not in the TLB, called a TLB miss, the system still needs to read the page table to determine the physical to virtual address mapping. Because of the relationship between application memory requirements and the size of pages used to cache address mappings, applications with large memory requirements are more likely to suffer performance degradation from TLB misses than applications with minimal memory requirements. It is therefore important to avoid TLB misses wherever possible. Both HugeTLB and Transparent Huge Page features allow applications to use pages larger than 4 KB. This allows addresses stored in the TLB to reference more memory, which reduces TLB misses and improves application performance. Chapter 37. Getting started with SystemTapAs a system administrator, you can use SystemTap to identify underlying causes of a bug or performance problem on a running Linux system. As an application developer, you can use SystemTap to monitor in fine detail how your application behaves within the Linux system. 37.1. The purpose of SystemTapSystemTap is a tracing and probing tool that you can use to study and monitor the activities of your operating system (particularly, the kernel) in fine detail. SystemTap provides information similar to the output of tools such as netstat, ps, top, and iostat. However, SystemTap provides more filtering and analysis options for collected information. In SystemTap scripts, you specify the information that SystemTap gathers. SystemTap aims to supplement the existing suite of Linux monitoring tools by providing users with the infrastructure to track kernel activity and combining this capability with two attributes: Flexibility the SystemTap framework enables you to develop simple scripts for investigating and monitoring a wide variety of kernel functions, system calls, and other events that occur in kernel space. With this, SystemTap is not so much a tool as it is a system that allows you to develop your own kernel-specific forensic and monitoring tools. Ease-of-Use SystemTap enables you to monitor kernel activity without having to recompile the kernel or reboot the system. 37.2. Installing SystemTapTo begin using SystemTap, install the required packages. To use SystemTap on more than one kernel where a system has multiple kernels installed, install the corresponding required kernel packages for each kernel version. Procedure
Verification steps
37.3. Privileges to run SystemTapRunning SystemTap scripts requires elevated system privileges but, in some instances, non-privileged users might need to run SystemTap instrumentation on their machine. To allow users to run SystemTap without root access, add users to both of these user groups: stapdev Members of this group can use stap to run SystemTap scripts, or staprun to run SystemTap instrumentation modules. Running stap involves compiling SystemTap scripts into kernel modules and loading them into the kernel. This requires elevated privileges to the system, which are granted to stapdev members. Unfortunately, such privileges also grant effective root access to stapdev members. As such, only grant stapdev group membership to users who can be trusted with root access. stapusr Members of this group can only use staprun to run SystemTap instrumentation modules. In addition, they can only run those modules from the /lib/modules/kernel_version/systemtap/ directory. This directory must be owned only by the root user, and must only be writable by the root user.37.4. Running SystemTap scriptsYou can run SystemTap scripts from standard input or from a file. Sample scripts that are distributed with the installation of SystemTap can be found in the /usr/share/systemtap/examples directory. Prerequisites
Procedure
Chapter 38. Cross-instrumentation of SystemTapCross-instrumentation of SystemTap is creating SystemTap instrumentation modules from a SystemTap script on one system to be used on another system that does not have SystemTap fully deployed. 38.1. SystemTap cross-instrumentationWhen you run a SystemTap script, a kernel module is built out of that script. SystemTap then loads the module into the kernel. Normally, SystemTap scripts can run only on systems where SystemTap is deployed. To run SystemTap on ten systems, SystemTap needs to be deployed on all those systems. In some cases, this might be neither feasible nor desired. For example, corporate policy might prohibit you from installing packages that provide compilers or debug information on specific machines, which will prevent the deployment of SystemTap. To work around this, use cross-instrumentation. Cross-instrumentation is the process of generating SystemTap instrumentation modules from a SystemTap script on one system to be used on another system. This process offers the following benefits:
instrumentation module The kernel module built from a SystemTap script; the SystemTap module is built on the host system, and will be loaded on the target kernel of the target system. host system The system on which the instrumentation modules (from SystemTap scripts) are compiled, to be loaded on target systems. target system The system in which the instrumentation module is being built (from SystemTap scripts). target kernel The kernel of the target system. This is the kernel that loads and runs the instrumentation module. 38.2. Initializing cross-instrumentation of SystemTapInitialize cross-instrumentation of SystemTap to build SystemTap instrumentation modules from a SystemTap script on one system and use them on another system that does not have SystemTap fully deployed. Prerequisites
Kernel packaging bugs may prevent multiple kernel-debuginfo and kernel-devel packages from being installed on one system. In such cases, the minor version for the host system and target system must match. If a bug occurs, report it at https://bugzilla.redhat.com/. Procedure
Chapter 39. Monitoring network activity with SystemTapYou can use helpful example SystemTap scripts available in the /usr/share/systemtap/testsuite/systemtap.examples/ directory, upon installing the systemtap-testsuite package, to monitor and investigate the network activity of your system. 39.1. Profiling network activity with SystemTapYou can use the nettop.stp example SystemTap script to profile network activity. The script tracks which processes are generating network traffic on the system, and provides the following information about each process: PID The ID of the listed process. UID User ID. A user ID of 0 refers to the root user. DEV Which ethernet device the process used to send or receive data (for example, eth0, eth2). XMIT_PK The number of packets transmitted by the process. RECV_PK The number of packets received by the process. XMIT_KB The amount of data sent by the process, in kilobytes. RECV_KB The amount of data received by the service, in kilobytes. Procedure
39.2. Tracing functions called in network socket code with SystemTapYou can use the socket-trace.stp example SystemTap script to trace functions called from the kernel’s net/socket.c file. This helps you identify, in finer detail, how each process interacts with the network at the kernel level. Procedure
39.3. Monitoring network packet drops with SystemTapThe network stack in Linux can discard packets for various reasons. Some Linux kernels include a tracepoint, kernel.trace("kfree_skb")`, which tracks where packets are discarded. The dropwatch.stp SystemTap script uses kernel.trace("kfree_skb") to trace packet discards; the script summarizes what locations discard packets in every 5-second interval. Procedure
Chapter 40. Profiling kernel activity with SystemTapThe following sections showcase scripts that profile kernel activity by monitoring function calls. 40.1. Counting function calls with SystemTapYou can use the functioncallcount.stp SystemTap script to count specific kernel function calls. You can also use this script to target multiple kernel functions. Procedure
40.2. Tracing function calls with SystemTapYou can use the para-callgraph.stp SystemTap script to trace function calls and function returns. Procedure
The script para-callgraph.stp takes two command-line arguments:
Consider the following example: # stap -wv --example para-callgraph.stp 'kernel.function("*@fs/proc.c*")' 'kernel.function("vfs_read")' -c "cat /proc/sys/vm/* || true"where:
The output should look similar to the following: [...] 267 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5 269 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5 0 gnome-terminal(2921):->fput file=0xffff880111eebbc0 2 gnome-terminal(2921):<-fput 0 gnome-terminal(2921):->fget_light fd=0x3 fput_needed=0xffff88010544df54 3 gnome-terminal(2921):<-fget_light return=0xffff8801116ce980 0 gnome-terminal(2921):->vfs_read file=0xffff8801116ce980 buf=0xc86504 count=0x1000 pos=0xffff88010544df48 4 gnome-terminal(2921): ->rw_verify_area read_write=0x0 file=0xffff8801116ce980 ppos=0xffff88010544df48 count=0x1000 7 gnome-terminal(2921): <-rw_verify_area return=0x1000 12 gnome-terminal(2921): ->do_sync_read filp=0xffff8801116ce980 buf=0xc86504 len=0x1000 ppos=0xffff88010544df48 15 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5 18 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5 0 gnome-terminal(2921):->fput file=0xffff8801116ce98040.3. Determining time spent in kernel and user space with SystemTapYou can use the thread-times.stp SystemTap script to determine the amount of time any given thread is spending in either the kernel or user-space. Procedure
40.4. Monitoring polling applications with SystemTapYou can use timeout.stp SystemTap script to identify and monitor which applications are polling. Doing so allows you to track unnecessary or excessive polling, which helps you pinpoint areas for improvement in terms of CPU usage and power savings. Procedure
In this example output you can see which process used which system call and how many times. uid | poll select epoll itimer futex nanosle signal| process 28937 | 148793 0 0 4727 37288 0 0| firefox 22945 | 0 56949 0 1 0 0 0| scim-bridge 0 | 0 0 0 36414 0 0 0| swapper 4275 | 23140 0 0 1 0 0 0| mixer_applet2 4191 | 0 14405 0 0 0 0 0| scim-launcher 22941 | 7908 1 0 62 0 0 0| gnome-terminal 4261 | 0 0 0 2 0 7622 0| escd 3695 | 0 0 0 0 0 7622 0| gdm-binary 3483 | 0 7206 0 0 0 0 0| dhcdbd 4189 | 6916 0 0 2 0 0 0| scim-panel-gtk 1863 | 5767 0 0 0 0 0 0| iscsid40.5. Tracking most frequently used system calls with SystemTapYou can use the topsys.stp SystemTap script to list the top 20 system calls used by the system per 5-second interval. It also lists how many times each system call was used during that period. Procedure
40.6. Tracking system call volume per process with SystemTap
You can use the syscalls_by_proc.stp SystemTap script to see which processes are performing the highest volume of system calls. It displays 20 processes performing the most of system calls. Procedure
Chapter 41. Monitoring disk and I/O activity with SystemTapThe following sections showcase scripts that monitor disk and I/O activity. 41.1. Summarizing disk read/write traffic with SystemTapYou can use the disktop.stp SystemTap script to identify which processes are performing the heaviest disk reads and writes to the system. Procedure
Output of the disktop.stp script looks similar to the following: [...] Mon Sep 29 03:38:28 2008 , Average: 19Kb/sec, Read: 7Kb, Write: 89Kb UID PID PPID CMD DEVICE T BYTES 0 26319 26294 firefox sda5 W 90229 0 2758 2757 pam_timestamp_c sda5 R 8064 0 2885 1 cupsd sda5 W 1678 Mon Sep 29 03:38:38 2008 , Average: 1Kb/sec, Read: 7Kb, Write: 1Kb UID PID PPID CMD DEVICE T BYTES 0 2758 2757 pam_timestamp_c sda5 R 8064 0 2885 1 cupsd sda5 W 167841.2. Tracking I/O time for each file read or write with SystemTapYou can use the iotime.stp SystemTap script to monitor the amount of time it takes for each process to read from or write to any file. This helps you to determine what files are slow to load on a system. Procedure
Output of the iotime.stp script looks similar to the following: [...] 825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0 825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9 [...] 117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0 117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7 [...] 3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0 3973744 2886 (sendmail) iotime /proc/loadavg time: 11 [...]41.3. Tracking cumulative I/O with SystemTapYou can use the traceio.stp SystemTap script to track the cumulative amount of I/O to the system. Procedure
41.4. Monitoring I/O activity on a specific device with SystemTapYou can use the traceio2.stp SystemTap script to monitor I/O activity on a specific device. Procedure
This script takes the whole device number as an argument. To find this number you can use: # stat -c "0x%D" directoryWhere directory is located on the device you want to monitor. The output contains following:
Consider following output of # stap traceio2.stp 0x805 [...] synergyc(3722) vfs_read 0x800005 synergyc(3722) vfs_read 0x800005 cupsd(2889) vfs_write 0x800005 cupsd(2889) vfs_write 0x800005 cupsd(2889) vfs_write 0x800005 [...]41.5. Monitoring reads and writes to a file with SystemTapYou can use the inodewatch.stp SystemTap script to monitor reads from and writes to a file in real time. Procedure
The script inodewatch.stp takes three command-line arguments:
You can get these numbers using: # stat -c '%D %i' filenameWhere filename is an absolute path. Consider following example: # stat -c '%D %i' /etc/crontabThe output should look like: 805 1078319where:
To start monitoring /etc/crontab, run: # stap inodewatch.stp 0x8 0x05 1078319In the first two arguments you must use 0x prefixes for base-16 numbers. The output contains following:
The output of this example should look like: cat(16437) vfs_read 0x800005/1078319 cat(16437) vfs_read 0x800005/1078319Chapter 42. Analyzing system performance with BPF Compiler CollectionAs a system administrator, you can use the BPF Compiler Collection (BCC) library to create tools for analyzing the performance of your Linux operating system and gathering information, which could be difficult to obtain through other interfaces. 42.1. An introduction to BCCBPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended Berkeley Packet Filter (eBPF) programs. The main utility of eBPF programs is analyzing OS performance and network performance without experiencing overhead or security issues. BCC removes the need for users to know deep technical details of eBPF, and provides many out-of-the-box starting points, such as the bcc-tools package with pre-created eBPF programs. The eBPF programs are triggered on events, such as disk I/O, TCP connections, and process creations. It is unlikely that the programs should cause the kernel to crash, loop or become unresponsive because they run in a safe virtual machine in the kernel.
42.2. Installing the bcc-tools packageThis section describes how to install the bcc-tools package, which also installs the BPF Compiler Collection (BCC) library as a dependency. Procedure
42.3. Using selected bcc-tools for performance analysesThis section describes how to use certain pre-created programs from the BPF Compiler Collection (BCC) library to efficiently and securely analyze the system performance on the per-event basis. The set of pre-created programs in the BCC library can serve as examples for creation of additional programs. Using execsnoop to examine the system processes
To see more details, examples, and options for execsnoop, refer to the /usr/share/bcc/tools/doc/execsnoop_example.txt file. For more information about exec(), see exec(3) manual pages. Using opensnoop to track what files a command opens
To see more details, examples, and options for opensnoop, refer to the /usr/share/bcc/tools/doc/opensnoop_example.txt file. For more information about open(), see open(2) manual pages. Using biotop to examine the I/O operations on the disk
To see more details, examples, and options for biotop, refer to the /usr/share/bcc/tools/doc/biotop_example.txt file. For more information about dd, see dd(1) manual pages. Using xfsslower to expose unexpectedly slow file system operations
To see more details, examples, and options for xfsslower, refer to the /usr/share/bcc/tools/doc/xfsslower_example.txt file. For more information about fsync, see fsync(2) manual pages. Legal NoticeCopyright © 2022 Red Hat, Inc. The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries. Node.js® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the official Joyent Node.js open source or commercial project. The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation's permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community. All other trademarks are the property of their respective owners. Which of the following commands display information about the account that is issuing the commands?15. Which of the following commands displays information about the account issuing the command? (Choose all that apply.) A B E The whoami, who am i and id commands will all display information about the current account that is issuing the commands.
Which of the following Linux commands is used to determine the type of processor?Get CPU Info in Linux
The simplest way to determine what type of CPU you have is by displaying the contents of the /proc/cpuinfo virtual file. Identifying the type of processor using the proc/cpuinfo file does not require installing any additional programs. It will work no matter what Linux distribution you are using.
What folder contains the time zone template files in Linux?The timezone information files used by tzset(3) are typically found under a directory with a name like /usr/share/zoneinfo. These files use the format described in Internet RFC 8536.
Which of the following are fields within a etc passwd file record?The /etc/passwd file is a colon-separated file that contains the following information: User name. Encrypted password. User ID number (UID)
|