Indicent reports
Incident reports describe operations related incidents on the eSeller Cloud platform and related services. Reports are published at the discretion of the eSeller Cloud team and at times chosen by the team.
Incident: January 19th 2018
Published January 22th 2017
All times in UTC+1
At approximately 13:30 a problem was identified on the eSeller Cloud platform, resulting in longer response times and timeouts. After thorough investigation it was established that the problem was caused by an inexpedient configuration of the caching server, causing it to be overloaded and unresponsive.
At approximately 14:42 the configuration was mended and the problem resolved.
Incident: November 28th 2017
Published November 28th 2017
All times in UTC+1
At approximately 01:57 an error occurred on the eSeller Cloud platform, affecting most shops. The affected shops showed an HTTP 500 error page. It was initially unclear what the cause of the problem was but after a prolonged period of investigation it was found that a record in the database contained malformed data.
Due to the nature of the malformed data, this faulty record, was being loaded by all shops causing a fatal error.
A fix for the issue was immediately developed, tested and released. The fix prevents a similar issue from occurring again. The root cause of the malformed database record is still being investigated.
Preventive actions
The code will be rewritten to check for malformed records.
The operations team will update twitter in a more timely fashion
Timeline
01:57 Problem started
02:15 Operations team was alerted
03:04 Operations team failed to resolve the issue and alerted the development team
04:19 The cause of the problem was identified
05:37 An emergency code change had been developed and was ready for test
07:15 The test did not identify any unforeseen problems and the release procedure was initiated
07:40 The problem was resolved
Incident: June 20th 2016
Published June 21th 2016
All times in UTC+2
At approximately 17.45 the operations team noticed increased load on the main database node. While this was investigated, the load rose to levels that prevented the webservers from responding in a timely fashion.
The top priority of any incident is to preserve data and data integrity so actions are taken in cautious manner to prevent loss of data. In affect this means the operations team must work fast to resolve the incident but not so fast that new problems are introduced or data is lost.
The first cause of action was to verify that backups where available in case of emergency. Backup data was available.
It was then noted that the database cluster was fully synchronized and that the secondary node was ready to take over.
Our Microsoft Hosting partner did not report any issues.
The symptoms exhibited by the system was, very slow response time to the point of time-out. As mentioned previously the primary database node showed increased load. The initial investigation focused on identifying the SQL queries that caused the high load. This proved difficult and inconclusive as all queries seemed to generate extreme load.
It was decided to upgrade the secondary node to powerful hardware in preparation for fail over. The upgrade was started at approximately 18.20.
While the secondary node was upgraded the investigation turned to possible bottlenecks on the primary node. Analysis of data showed that IO, Memory and network was not bottlenecks.
Trace was found in the event log that the file system may have become corrupted and it was decided to fail over to the secondary node once it became available.
At 18.35 the secondary node became available with 8 cores and 56 GB ram. The secondary node did however not synchronize with the primary node, rendering fail over without data loss impossible.
It was decided to minimize the load on the primary by shutting down services in order to make the load drop to a level where fail over was possible.
This process took until 20.40, where both clusters where synchronized and fail over was performed.
Fail over completed without data loss.
The integrity of the data was surveyed and the data was found to be complete.
The secondary node showed no signs of high load
Services where started and the system was monitored. The system exhibited no signs of high load or other problems.