03.10.2023 - A bad day in the history of vint.ee (postmortem).

Leggi da 57 utenti

MeikopFondatore Vint.ee 2023-10-09T14:18:03+03:00
The Vint.ee environment has been very stable in recent years. The last major incidents have been related to network problems at our service provider (TransIP, Netherlands), and these have also been short-lived and have been years in the past.

But on October 3, 2023, something happened that hasn't happened in the last 10 years - the vint.ee server became completely unavailable due to overload.

Background
On the portal vint.ee you can play chess with a computer - "Computer" is a Stockfish program on the vint.ee server. Each chess game with a computer starts one Stockfish process on the server. This process uses server resources (CPU, memory), but at the end of each game this process is closed and the resource is released.

What went wrong?
I made a small addition to chess for testing - at the beginning of each game, the Stockfish process is also started, and for each move the user makes, Stockfish's recommendation for that state is asked. And then the user's "accuracy" is calculated compared to the computer's recommendation.
And what went wrong was that the Stockfish process wasn't closed when the game ended. Just 1 missing line in the code. And the result was that with every chess game played, the server's memory usage started to increase.
We have decent graphs in Grafana that show server resource usage in real time, but I don't monitor them on a daily basis (precisely because everything has been fine for years). And no alerts are configured there.

03.10.2023
I installed the aforementioned update a week earlier, around September 25, 2023. Memory usage grew so slowly over time that it took a week for all the memory to be used up.
And what happens when the server runs out of memory? The server operating system (Linux) starts arbitrarily closing processes: Java game server, database, etc.
On October 3, 2023, at lunchtime, people started writing to me that vint.ee was down.
Being used to the image of a stable server, I naturally thought at first that it was a network problem.
The fact that I was in Prague for work and the computer used to develop vint.ee was located in Estonia didn't make things any easier.
Fortunately, I got access to another computer and the vint.ee server graphs, and then I realized that some kind of disaster had happened - the server's memory was completely used.
The server was so in a coma that it couldn't even be accessed via the terminal.

Something positive
Since the server could not be accessed via the terminal, the only option was to go to the service provider's ( TransIP ) self-service, and there was a positive surprise there - with one click, the server could be restarted and normal status (the portal was fully operational) was restored in less than a minute .

All's well that ends well.
The server was back up and running, so I had some time to analyze the issue and find the root cause. Naturally, I assumed that the error must have been related to the latest updates, and when I flew to Estonia a few days later, I immediately found the problematic line of code and fixed it.
Additionally: from now on, I will monitor the server graphs at least once a day to see if everything is still working.

All the best,
Marten Meikop
Vint.ee
error-500_4x.gif
error500.png

Risposte al post

This functionality is only for verified or VIP users