Thursday, 1 November 2018

Linux cgroups Resource Constraints

Linux cgroups (control groups) have been around for a long time providing varioius functions for resource management and the ability to segregate workloads with their own constraints. cgroups have provided the basis for container engines such as docker that has become so aggressively adopted in enterprises over the last few years.

cgroups are enabled on all contemporary linuxes with systemd being the API to manage them; the system will manage various cgroups organised in:
  • slices - see systemd-cgls where we will typically see system services under the system slice users services, including the segregation for user sessions in another. Encapsulates scopes and services.
  • scopes - parent for a logical grouping of units/services which can be managed (killed/stop/resource managed)
  • services - logical grouping that provides a service, such as the sshd, that are usual started based on configuration in unit files

But how can cgroup and resource controls be useful for a developer?

It is not uncommon that when debugging some work devs find themselves with runaway processes that can lockup the machine by consuming too much CPU and RAM leading to an unresponsive machine. Fortunately in these situations we can normally ssh in from another machine to kill the rogue process - sshd will be protected from a users runaway process because it lives in a different cgroup with a different resource management structure.

In such a dev scenario, it would be very useful to resource constrain the debugged process and cgroups are perfect for this: it is of course possible to put constraints around the process via ulimit but there is not the same level of fine grain control.

We can control the amount of memory/swap/cpu/io limits etc our process receives so lets go with that.

Before we show how to impose the constraints, We need to understand that there are 2 versions of cgroups available in the mainline kernel: version1 and version2. Fedora 28 ships with both but defaults to version1 - this has a subtle issue shown below.

cgroups v1

What we'd like to do as a non-privileged user is issue:
$ systemd-run --user --unit=dev-limit.service -t -p MemoryMax=64M -p MemorySwapMax=32M -p CPUQuota=50% ...
which will run our process in the user scope with an identifiable name (the unit name) and set mem/swap/cpu limits which will let the system kill the job if it exceeds these constraints.

BUT this does NOT work due to restrictions in unsafe delegation of controllers to unprivileged programs in cgroups v1.

Instead, we need to do the following which explicitly runs the job as the calling user (the dev) and puts the user's slice.
# remove any reference to previously failed item
sudo systemctl reset-failed dev-limit

sudo systemd-run -t \
--slice=user-$(id -u).slice --unit=dev-limit.service \
-p MemoryMax=128M -p MemorySwapMax=32M -p CPUQuota=50% \
--uid $(id -nu) \
"$@"
It'll work but now we need to give escalated access to the dev. cgroups v2 will let us acheive of aim.

cgroups v2

To enable cgroups v2 requires kernel boot time flag: systemd.unified_cgroup_hierarchy=1
NB: When enabled current versions of docker will fail to run as its built on cgroups v1.

Fedora
Edit /etc/default/grub and add to the line GRUB_CMDLINE_LINUX=.. after which run (BIOS) grub2-mkconfig -o /boot/grub2/grub.cfg or (EFI) grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg. If this fails with grub2-editenv: error: environment block too small simply remove rm /boot/grub2/grubenv and try again - this file will be regenerated.
Raspberry Pi
The additional flag cgroup_enable=memory cgroup_memory=1 needs to be added to /boot/cmdline.txt to enable memory resource management can confirmed by looking at /proc/cgroups

There's a couple of ways to solve our original problem of resource constraining our process:

within an explicit slice

$ systemd-run --user --slice=foo --scope -p MemoryMax=128M -p MemorySwapMax=64M -p CPUQuota=50% ...
This will creates dynamic slice and scope with a name that that you can track using:
$ systemd-cgtop /user.slice/user-$(id -u).slice/user@$(id -u).service/foo.slice
The slices remain after the process has gone and we can add other process to the slice. To remove the slice we can issue systemctl --user stop foo

within an explicit unit

Without --slice we need to obtain the created scope to track it:
$ systemctl --user reset-failed dev-limit
$ systemd-run --user -t --unit=dev-limit.service -p MemoryMax=128M -p MemorySwapMax=64M -p CPUQuota=50% ...

and trackable with the explitc named unit:
$ systemd-cgtop /user.slice/user-$(id -u).slice/user@$(id -u).service/dev-limit.service

Without the named unit the system generates a transient name that we use to track:
$ systemctl --user reset-failed dev-limit
$ systemd-run --user -p MemoryMax=128M -p MemorySwapMax=64M -p CPUQuota=50% ...
Running as unit: run-r45edf2288f1b4a5caae55057343decc9.service
$ systemd-cgtop /user.slice/user-$(id -u).slice/user@$(id -u).service/run-r45edf2288f1b4a5caae55057343decc9.service

Verifying constraints within your launched scopes/units can be done via: systemctl --user show foo.scope (or foo.unit)

Finally, once you have your constrained process running inside a scope or an unit we can further tweak the resource constraints by explcitly reference the container the process runs:
$ systemctl --user set-property run-r45edf2288f1b4a5caae55057343decc9.service MemoryHigh=32M
$ systemctl --user set-property run-rabadb09ecbdc47e18fa64ed504869134.scope MemoryHigh=32M

# adjusting at the scope level affects all its children
$ systemctl --user set-property foo.scope MemoryHigh=32M
It is worthwhile restating that constraints are top down: meaining if a parent slice (for example) has been constrained then any child/grandchild etc will be similarly constrained even if it it requests for a higher constraint.

This leads nice to the superuser's ability to dynamically limit hungry user processes using the same mechansims.

Limiting services

The above discussion is also applicable to system wide services: assuming we want add memory limits to the forked-daapd service.
$ systemctl status forked-daapd
● forked-daapd.service - DAAP/DACP (iTunes) and MPD server, supporting AirPlay and Spotify
Loaded: loaded (/lib/systemd/system/forked-daapd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2018-10-23 23:42:29 BST; 1 weeks 1 days ago
Docs: man:forked-daapd(8)
Main PID: 2876 (forked-daapd)
CGroup: /system.slice/forked-daapd.service
└─2876 /usr/sbin/forked-daapd -f
To enable we have to add an override service file:
$ vi /etc/systemd/system/forked-daapd.service
.include /lib/systemd/system/forked-daapd.service
[Service]
MemoryMax=256M
MemorySwapMax=8M

# restart
$ systemctl daemon-reload && \
systemctl restart forked-daapd && \
systemctl status forked-daapd
● forked-daapd.service - DAAP/DACP (iTunes) and MPD server, supporting AirPlay and Spotify
Loaded: loaded (/etc/systemd/system/forked-daapd.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2018-11-01 20:08:37 GMT; 10s ago
Docs: man:forked-daapd(8)
Main PID: 28080 (forked-daapd)
Memory: 28.0M (max: 256.0M swap max: 8.0M)
CGroup: /system.slice/forked-daapd.service
└─28080 /usr/sbin/forked-daapd -f
And we're done.

No comments:

Post a Comment