Bug - Docker container eats all resources (CPU & Memory)

I’ve had two incidents in which the Datagerry Docker compose stack eagerly starts eating all CPU and RAM from the host.

Relevant system info

  • CPU: 8vCPU
  • RAM: 12 GB
  • OS: Clear Linux - kernel 5.7 (virtualized with qemu)
  • Docker version: 19.03.8

Changes to the original docker compose file

  • Limited memory of datagerry to 512M (see Incident 1 > Resolution);
  • Removed nginx proxy;
  • Added vhost with TLS to existing nginx reverse-proxy.

Test log

  • created category;
  • created type;
  • created object;
  • added exportd PULL job;
  • renamed exportd PULL job ( ? ) => incident 1;
  • le me discovered the REST API;
  • deleted exportd PULL job;
  • some requests on REST API endpoint;
  • created object;
  • created category;
  • created category;
  • created category;
  • created type;
  • created object;
  • removed category;
  • ? => incident 2.

Incident 1

This incident happened when I was renaming an exportd job, while going to the 2nd step. The web app became unresponsive, and so were my other Docker services. The hypervisor reported 100% CPU usage. I couldn’t check the memory, since the hypervisor doesn’t distinguish cache from used memory.

Resolution: I still had an open SSH session, although it went extremely laggy, in which I could stop the datagerry Docker container. This immediately stabilized the host and other containers (which weren’t responding). Due to this incident I added a memory limit of 512M to the container and interestingly after starting, the memory quickly rose to the limit, although no swapping was occurring and the web application was responsive.

Incident 2

While running idle, the whole docker host suddenly hung and all the sibling Docker services were unreachable.

I couldn’t SSH nor VNC to a terminal. I’m currently trying to cleanly shut it down via the hypervisor, but after half an hour waiting I decided to force a reboot. This was an incident with more impact.

Here’s a screenshot from the hypervisor showing the CPU usage. The Docker container logs show nothing notable. The TTY1 of the docker host was flooding with out_of_memory messages.

I’m now going to (additionally) limit the db and mq containers and see if it happens again. In the mean time, I have no lead on what action could have caused this.

Is there some documentation on the different processes that run within the scope of Datagerry? If so, I could monitor these and see when they blow up. Since the last incident happened while I wasn’t working with Datagerry (note: it was open in a browser tab), I suspect some background job could be related.

Other, maybe related, issues

  • Sometimes the UI doesn’t display data or it (generally) happens very slowly. I haven’t really inspected the browser console logs, since it wasn’t blocking me from checking out functionalities. I’ll keep an eye on it next time.
  • The REST API is also extremely slow (10-15 sec). My current dataset consists of 3 categories, 2 types and 3 objects. I was querying the object list and specific objects.

Perhaps I misinterpreted the graph in de hypervisor. I’ve downloaded more RAM and things look stable:

As displayed in the graph, the memory usage is now ~300M above the previously allocated amount. I think it’s safe to assume that the incidents were caused by RAM page swapping and having little to no space left for buffer/cache, making the host extremely slow and eventually hang.

TL;DR; This issue can be closed, as the most probable cause is not related to Datagerry.

After I scaled up the RAM, I limited the containers each to 2 vcpu’s. I then monitored the 3 containers (datagerry, mq and db). Here I noticed that after browsing the UI and loading the icon set in a dropdown, the memory usage became > 93%. CPU usage on all containers remained relatively low: mostly (far) below 10%. Memory usage on the message queue and database remains below 25%.

I then doubled the memory limit on the Datagerry container (1G). After doing the same UI browsing, the memory usage now stayed below 55%. CPU usage remained stable too.

Perhaps these numbers can be useful for the installation requirement documentation (which currently lacks system resource advice).

Based on my idle test, I’ld say mimimum requirements for the Docker compose stack at the moment are:

  • [datagerry] Memory: 1G; vCPU: 2
  • [mq] Memory: 256M; vCPU: 1
  • [db] Memory: 256M: vCPU: 1

These would boot to a stack where about 50% of the memory is consumed. CPU usage will be in the low 10%.

I’ll do some further tuning as I go and report resource utilization during realistic usage (adding actual data, API automation). Perhaps these can then serve as recommended system requirements.

Hi @dennisd,

thanks for sharing your performance data. I think, we should add some information in our documentation about that topic.

Here are some performance metrics from our own system, which we operate at NETHINKS. At the moment, our setup has around 3500 objects, 74 types and 23 export jobs:

Memory (8GB, around 2GB used):

CPU usage (2 vCPUs):

1 Like