Imagine deploying a change that looks a simple fix modifying a db create to update_or_create functionality.
But then that results in 🥵🥵 api timeouts resulting in blocking the daily data pipeline.
Reverting the PR ⏪ is the first step in such cases, let’s go back to where we were.
Once the revert was done ✅ 😌
Started to look into the issue in depth
It was weird 😳🤔 cause we had tested for time taken to return results on Dev environments and there the time taken was around 15-20 seconds for the update_or_create API we did expect it to be few seconds more in prod, owing to more data but definitely not 3x more (1 min is the api timeout).
So what had gone wrong?
Why were we seeing issue on production when the api timeout on production is set to 1 minute?
Then why?
Did retests and scanned through logs for some time and found something. 🕵️♂️🔍
The logs showed three instances of same
Database query retrying each query timing out in around 20 seconds time.
But why was the query timing out so quickly ⌛️ and it was retrying again a total of 3 times?
why?
On Further digging we found out the cause
TRANSACTION.
Yes the api was executing a transaction and transactions acquire a lock on the db rows.
however any transaction acquiring a lock needs to release the lock in 20 seconds or else transaction will timeout.
This resulted in retrying of transaction 3 times and finally when the 1 minute threshold was hit we saw api timeout.
SOLUTION 💡🌟
The solution was the most basic and first approach considered in case of db read optimisation, as we knew finding which record to update was taking most of the time.
We just added a database index to the search field (date).
Sure enough this worked and reduced the time to around 1-2 seconds. A 10X improvement.
And the transaction lock timeout limit was more than enough.
Happy ending 😊
Thank you 🙏,
Akash Deep
Kommentare