CAP_NET_ADMIN for non-root user in Docker container
I spent some time trying to get capabilities work in Docker in non-root containers, and it wasn’t a smooth journey. I either stumbled across documentation that would only cover basic use cases or documentation that was outdated and misleading. One possible feedback I would feel like addressing to Docker ecosystem is that it tries to be excessively easy for the end user, hiding any possible source of complexity. Sometimes you do need to be exposed to that complexity, and you are completely on your own, with the codebase being the only source to refer to. In my case, I did have to look into Moby’s codebase to understand how permitted, effective, and inheritable capabilities were managed. This post is an attempt to summarize what I essentially wished I had known when trying to build a non-root container with minimal privileges in which additional network interfaces had to be created.
Docker documentation
Starting from docs.docker.com,
one is pointed to --cap-add
and --cap-drop
to implement fined grain control over which capabilities are given to the
container:
In addition to –privileged, the operator can have fine grain control over the capabilities using –cap-add and –cap-drop. By default, Docker has a default list of capabilities that are kept. The following table lists the Linux capability options which are allowed by default and can be dropped.
By itself, I find this already very confusing. There are multiple set of capabilities assigned to a process, i.e.
permitted
, effective
, inheritable
. These are not mentioned anywhere in Docker documentation. From the docker/labs
repo, there seems to be additional documentation on capabilities, that however appears outdated. It’s useful to try rebuilding the history of capabilities management in
Docker to see how support has evolved over time.
root vs non-root containers
One of the use cases I was working with required adding network interfaces inside the container network namespace. In particular, I needed to add a bridge:
$ sudo docker run -it debian ip link add name br0 type bridge
RTNETLINK answers: Operation not permitted
The failure was expected. The container is running as root and by default docker daemon drops all capabilities, except a default set. The documentation is clear about this and it can be easily verified:
$ sudo docker run -it debian capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+ep
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)
As CAP_NET_ADMIN
is necessary to manipulate network interfaces, ip
command fails. Adding
that specific capability seems to be sufficient for the command to succeed:
$ sudo docker run --cap-add CAP_NET_ADMIN -it debian ip link add name br0 type bridge
$
We can double check that the capability is added to the “current” (i.e. effective) set:
$ sudo docker run --cap-add CAP_NET_ADMIN -it debian capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+ep
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)
and that the same command runs successfully from interactive shell:
$ sudo docker run --cap-add CAP_NET_ADMIN -it debian /bin/bash
root@ee513136aea5:/# ip link add name br0 type bridge
root@ee513136aea5:/#
The behavior however is different when the container does not run as root:
$ sudo docker run --user 1000:100 --cap-add CAP_NET_ADMIN -it debian ip link add name br0 type bridge
RTNETLINK answers: Operation not permitted
A comparison of the capabilities configuration as root and non-root yields the following:
$ diff <(sudo docker run --cap-add CAP_NET_ADMIN -t debian capsh --print ) <(sudo docker run --user 1000:100 --cap-add CAP_NET_ADMIN -t debian capsh --print)
1c1
< Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+ep
---
> Current: =
7,9c7,9
< uid=0(root)
< gid=0(root)
< groups=0(root)
---
> uid=1000(???)
> gid=100(users)
> groups=100(users)
and more in details via /proc/self/status
:
$ sudo docker run --cap-add CAP_NET_ADMIN -t debian cat /proc/self/status | grep Cap
CapInh: 0000000000000000
CapPrm: 00000000a80435fb
CapEff: 00000000a80435fb
CapBnd: 00000000a80435fb
CapAmb: 0000000000000000
$ sudo docker run --user 1000:100 --cap-add CAP_NET_ADMIN -t debian cat /proc/self/status | grep Cap
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80435fb
CapAmb: 0000000000000000
When the container does not run as root, effective and permitted capabilities are all cleared, while there is no difference in the binding set. docker/labs repo, mentions the following:
The above command fails because Docker does not yet support adding capabilities to non-root users.
This seems to be coherent with the output above. This specific behavior for non-root users was introduced by moby/15ff0939, i.e. “If container will run as non root user, drop permitted, effective caps early”. In particular:
s.Process.Capabilities.Bounding = caplist
s.Process.Capabilities.Permitted = caplist
s.Process.Capabilities.Inheritable = caplist
+ // setUser has already been executed here
+ // if non root drop capabilities in the way execve does
+ if s.Process.User.UID != 0 {
+ s.Process.Capabilities.Effective = []string{}
+ s.Process.Capabilities.Permitted = []string{}
+ }
Something further to notice is that CapInh
is also cleared in both cases. This comes instead from moby/dd38613d, which essentially makes inheritable capabilities irrelevant in all cases.
if ec.Privileged {
- if p.Capabilities == nil {
- p.Capabilities = &specs.LinuxCapabilities{}
+ p.Capabilities = &specs.LinuxCapabilities{
+ Bounding: caps.GetAllCapabilities(),
+ Permitted: caps.GetAllCapabilities(),
+ Effective: caps.GetAllCapabilities(),
}
- p.Capabilities.Bounding = caps.GetAllCapabilities()
- p.Capabilities.Permitted = p.Capabilities.Bounding
- p.Capabilities.Inheritable = p.Capabilities.Bounding
- p.Capabilities.Effective = p.Capabilities.Bounding
}
Dropping support for inheritable capabilities is a fix for CVE-2022-24769
.
Event though non-root containers have only the bounding set configured, it should be possible for processes
within the container to acquire effective capabilities, by setting <CAP>+ep
on the executable file:
- Effective bit (
e
is just a single bit on files) is set, duringexecve
all of the permitted capabilities for the thread are also mirrored in the effective set. -
CAP
is set as permitted capability
Capabilities transformation rules during execve are the following:
P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
and permitted capabilities are regulated as follows:
P'(permitted) = (P(inheritable) & F(inheritable)) |
(F(permitted) & P(bounding)) | P'(ambient)
If we had F(effective)
set, then P'(effective)
would become P'(permitted)
, as just mentioned above.
The content of P'(permitted)
effectively depends on (F(permitted) & P(bounding)) | P'(ambient
, given
inheritable
capabilities are always cleared, If CAP
is set on the file as permitted, given P(bounding)
is already set to caps.GetAllCapabilities()
,
then CAP
should be acquired as P'(permitted)
and consequentely as `P’(effective).
Documentation here becomes misleading, in particular docker/labs/security/capabilities mentions the following:
Docker imposes certain limitations that make working with capabilities much simpler. For example, file capabilities are stored within a file’s extended attributes, and extended attributes are stripped out when Docker images are built. This means you will not normally have to concern yourself too much with file capabilities in containers.
A good historical source for xattr support in Docker is issues/35699. xattr were initially not implemented because AUFS storage layer did not support them. Regardless of AUFS limitation, there were concerns on how to support heterogenous systems
that might not all support xattr. In pull/3845, support for
xattr security.capabilities
is added to storage layers. The quote above from docker/labs seems to have
been committed in Oct 2016 with d9273d2c. This is two year after pull/3845. Anyways, security.capabilities
are indeed preserved at least with overlay2
storage engine.
ip
and CAP_NET_ADMIN
I dived into capability support to configure a container for building Openembedded images. One of the requirements
I had was the ability to create a qemu bridge networking setup. According to the research presented above,
adding cap_net_admin+ep
to /bin/ip
(effective bit set, capability set as permitted on the file) should have been
sufficient to manipulate network interfaces without being root. Unfortunately, I would still get a permission denied error:
$ sudo docker run --user 1000:100 --cap-add CAP_NET_ADMIN -it oe_build /bin/sh
$ getcap /bin/ip
/bin/ip = cap_net_admin+ep
$ whoami
dev
$ ip link add name br0 type bridge
RTNETLINK answers: Operation not permitted
While trying to assess where exactly the “Operation not permitted” was coming from, the following caught my
attention in the strace
output:
getuid() = 1000
geteuid() = 1000
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
This looks a lot like an attempt to assess if the process is running as root, followed by a drop of all
capabilities. So, even if cap_net_admin+ep
might be working, the capset
call above makes it a no-op. In fact,
in iproute2/ip/ip.c
(> v4.16
) one can see the following excerpt:
if (argc < 3 || strcmp(argv[1], "vrf") != 0 ||
strcmp(argv[2], "exec") != 0)
drop_cap();
Certainly in this case strcmp(argv[1], "vrf") != 0
is true, so we end up dropping all capabilities.
drop_cap
is implemented as follows:
void drop_cap(void)
{
#ifdef HAVE_LIBCAP
/* don't harmstring root/sudo */
if (getuid() != 0 && geteuid() != 0) {
cap_t capabilities;
cap_value_t net_admin = CAP_NET_ADMIN;
cap_flag_t inheritable = CAP_INHERITABLE;
cap_flag_value_t is_set;
capabilities = cap_get_proc();
if (!capabilities)
exit(EXIT_FAILURE);
if (cap_get_flag(capabilities, net_admin, inheritable,
&is_set) != 0)
exit(EXIT_FAILURE);
/* apps with ambient caps can fork and call ip */
if (is_set == CAP_CLEAR) {
if (cap_clear(capabilities) != 0)
exit(EXIT_FAILURE);
if (cap_set_proc(capabilities) != 0)
exit(EXIT_FAILURE);
}
cap_free(capabilities);
}
#endif
}
This checks if we are running as normal user (user and effective user id are != 0) and whether the process
has CAP_NET_ADMIN
set in the inheritable set, which is the case. If so, all capabilities are dropped,
hence setting cap_net_admin+ep
on ip
becomes a no-op.
fakeroot
and LD_PRELOAD
My first attempt to bypass drop_cap
consisted in running ip
under fakeroot
, an LD_PRELOAD
shared library which overwrites some libc
calls to either make userspace believe we are running as root
(e.g. by overwriting getuid
, geteuid
to return 0) or record that some operations (e.g. open
+ O_CREAT
) should
look like as they have been performed as root to other userspace tools such as tar
. The process that is being
fakeroot-ed remains effectively unprivileged. I did not have much success:
$ fakeroot ip link add name br0 type bridge
ERROR: ld.so: object 'libfakeroot-sysv.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
RTNETLINK answers: Operation not permitted
The error is definitely obscure, and LD_DEBUG=all
doesn’t provide much more information. ld.so
code
itself is not incredibly eloquent:
unsigned int old_nloaded = GL(dl_ns)[LM_ID_BASE]._ns_nloaded;
(void) _dl_catch_error (&objname, &err_str, &malloced, map_doit, &args);
if (__glibc_unlikely (err_str != NULL))
{
_dl_error_printf("\
ERROR: ld.so: object '%s' from %s cannot be preloaded (%s): ignored.\n",
fname, where, err_str);
ld.so
documentation explains why the dynamic linker is failing, even
though the error which is surfaced is aboslutely ambiguous.
Secure-execution mode
For security reasons, if the dynamic linker determines that a binary
should be run in secure-execution mode, the effects of some environment
variables are voided or modified, and furthermore those environment
variables are stripped from the environment, so that the program does
not even see the definitions. Some of these environment variables affect
the operation of the dynamic linker itself, and are described below.
Other environment variables treated in this way include: GCONV_PATH,
GETCONF_DIR, HOSTALIASES, LOCALDOMAIN, LOCPATH, MALLOC_TRACE, NIS_PATH,
NLSPATH, RESOLV_HOST_CONF, RES_OPTIONS, TMPDIR, and TZDIR.
We are in secure mode if the AT_SECURE
entry in the auxiliary vector has a nonzero value.
This might happen in one of the following scenario:
- The process’s real and effective user IDs differ, or the real and effective group IDs differ. This typically occurs as a result of executing set-user-ID or set-group-ID program.
- A process with a non-root user ID executed a binary that conferred capabilities to the process.
- A nonzero value may have been set by a Linux Security Module
We are trying to assign capabilities to the process, so we fall within the the second use case.
For LD_PRELOAD
, which is effectly what fakeroot
uses, documentation further explains the
limitations in secure-execution mode:
In secure-execution mode, preload pathnames containing slashes are ignored.
Furthermore, shared objects are preloaded only from the standard search
directories and only if they have set-user-ID mode bit enabled (which is
not typical).
fakeroot
lib happens to be in a non-standard path in /usr/lib/x86_64-linux-gnu/libfakeroot/libfakeroot-sysv.so
,
neither does it have SUID
set, so ld.so
will refuse to preload it. Note also that LD_DEBUG
won’t work in
secure-execution mode unless /etc/suid-debug
is present on the filesystem.
Alternatives to fakeroot
We could force ip
not to clear capabilities by starting the container as root, retain CAP_NET_ADMIN
as inheritable
through capsh
and drop privileges ourselves instead of asking Docker to do it.
capsh --keep=1 --user=dev --inh=cap_net_admin=i --
This works, but would be a regression with respect to CVE-2022-24769, as the container would not start with empty inheritable capabilities. It would also result in dropping privileges relatively late, compared to starting the container as unprivileged user.
Preferred method
According to the commit which introduced drop_cap
in iproute2
(ba2fc55b), capabilities are dropped so that users can safely
add caps to the binary for ip vrf exec
. I am unclear why only the vrf
use case would be considered as requiring
CAP_NET_ADMIN
, CAP_SYS_ADMIN
and CAP_DAC_OVERRIDE
, while forcing everything else to use root, modulo the check on
the inheritable set).
Starting the container with CAP_SYS_ADMIN
as inheritable capability is a regression with respect to
CVE-2022-24769, but I still consider it preferable compared
to the fragile LD_PRELOAD
approach. Based on my current understanding, the risks coming from a binary
having a file inheritable capability set and acquiring it in the process permitted set is equivalent to the
binary having the same capability set as permitted.