I’ve had two incidents in which the Datagerry Docker compose stack eagerly starts eating all CPU and RAM from the host.
Relevant system info
- CPU: 8vCPU
- RAM: 12 GB
- OS: Clear Linux - kernel 5.7 (virtualized with qemu)
- Docker version: 19.03.8
Changes to the original docker compose file
- Limited memory of datagerry to 512M (see Incident 1 > Resolution);
- Removed nginx proxy;
- Added vhost with TLS to existing nginx reverse-proxy.
Test log
- created category;
- created type;
- created object;
- added exportd PULL job;
- renamed exportd PULL job ( ? ) => incident 1;
- le me discovered the REST API;
- deleted exportd PULL job;
- some requests on REST API endpoint;
- created object;
- created category;
- created category;
- created category;
- created type;
- created object;
- removed category;
- ? => incident 2.
Incident 1
This incident happened when I was renaming an exportd job, while going to the 2nd step. The web app became unresponsive, and so were my other Docker services. The hypervisor reported 100% CPU usage. I couldn’t check the memory, since the hypervisor doesn’t distinguish cache from used memory.
Resolution: I still had an open SSH session, although it went extremely laggy, in which I could stop the datagerry Docker container. This immediately stabilized the host and other containers (which weren’t responding). Due to this incident I added a memory limit of 512M to the container and interestingly after starting, the memory quickly rose to the limit, although no swapping was occurring and the web application was responsive.
Incident 2
While running idle, the whole docker host suddenly hung and all the sibling Docker services were unreachable.
I couldn’t SSH nor VNC to a terminal. I’m currently trying to cleanly shut it down via the hypervisor, but after half an hour waiting I decided to force a reboot. This was an incident with more impact.
Here’s a screenshot from the hypervisor showing the CPU usage. The Docker container logs show nothing notable. The TTY1 of the docker host was flooding with out_of_memory messages.
I’m now going to (additionally) limit the db and mq containers and see if it happens again. In the mean time, I have no lead on what action could have caused this.
Is there some documentation on the different processes that run within the scope of Datagerry? If so, I could monitor these and see when they blow up. Since the last incident happened while I wasn’t working with Datagerry (note: it was open in a browser tab), I suspect some background job could be related.
Other, maybe related, issues
- Sometimes the UI doesn’t display data or it (generally) happens very slowly. I haven’t really inspected the browser console logs, since it wasn’t blocking me from checking out functionalities. I’ll keep an eye on it next time.
- The REST API is also extremely slow (10-15 sec). My current dataset consists of 3 categories, 2 types and 3 objects. I was querying the object list and specific objects.