Technical issue detected on our SMS provider & sender. They are aware of the issue and investigating.
It impacts the distribution of SMS through campaigns and fidelity (Welcome, ...).
14:50 : Back to a normal state, issue closed by the provider.
07th of DecemberSlowness and errors experienced on access to Loyaltyoperator application.
17:45 : Progressive back to a normal state. Incident caused by pool exhausted from the process of rich campaign statistics. We temporary blocked this ack process in order to avoid the incident to happen again, and started to work on a fix.
23th of NovemberSlowness and timeout experienced on various service in production environements (API, Manager)
17:30 Our investigations lead to an issue (CPU usage & load) on the Ingress. Our partner (maintainer in operating condition) is also investingating on the issue.
19:12 Back to normal on servers. Incident is not closed but monitored.
November 22th : Working on a fallback solution if the case happens again
November 22th 17:00 : Closed incident. The problem did not occur again.
As precised, we added a fallback server for API calls if it occurs again. We are also still trying to find the cause, and monitoring.
General issue on the platform. Missmatching between Database model & platform model
07:15 building of backup image to deploy.
07:40 deployment of backup version.
07:50 progressive back to normal.
November 21thMaintenance planned on databases servers (os security patches). Service shutdown for 1h maximum.
The service will not respond => http 503
3:45 Maintenance completed
7th of November2 services outage, of about 10min each detected by our monitoring
The ingress controller inside the Kubernetes Cluster has been identified as faulty (out of memory). We plan a version upgrade and to add redudancy.
Intevention planned tonight and on next monday night without service shutdown
Redundancy of ingress incresed October 12th
General slowness
stop as quickly has it comes (less than 5 minutes). This left us not enough time to investigate
Some log analysis is nevertheless under review
October 5thMaintenance planned on databases servers (os security patches). Service shutdown for 1h maximum.
The service will not respond => http 503
October 3rd >> 7th of NovemberIssue detected on our API, the use of dates leads to an error during the execution (code 500).
11:40 - Deployement has been successfull, back to normal.
10:40 - Starting the deployment of the new version. This may cause some slowness & concurrent access.
09:25 - Compiling a version fixing the issue
September 19Issue detected on communication between our Azure cluster and our docker image repository. It results that our pods cannot mount new images and cache is not refreshed.
13:58 - The issue has been confirmed on Azure side on their status page : https://azure.status.microsoft/en-us/status, on the item "Azure customers running Canonical Ubuntu 18.04 experiencing DNS errors"
08:35 - Back to normal and set to a monitoring state.
07:55 - Brief cut on our applications during a command to attach manually images to the cluster.
August 31 -Slowness and timeout experienced on various service in production environements
15:01 - Network trafic is now entirely splited and everythin is back to normal
12:38 - Mitigation is launched
09:38 - Issue is under investigation, we will conduct a brief restart on main LB.
July 1st -Maintenance planned on databases servers. Service shutdown for 1h and 30 minutes.
The service will not respond => http 503
June 13thDue to an issue in latest deployment, loyaltyoperator was using too much connections, this issue as breifly affected other apps also
System team restart databaserver to mitigate the problem and our team rollback to previous version.
June 2ndSlowness and timeout experienced on various service in production and in other environements
Issue is under investigation, we will conduct a brief upgrade on main LB around 22:00
Update May 19th, LB is under DOS from remotes IP. No data was compromised but our main LB was loosing some http calls and respond with timeout. A new BlackList is in place, with more supervision tools and monitoring.
WAF rules updated
May 18th - May 20thMaintenance planned on databases servers. Service shutdown for 1h and 30 minutes.
The service will not respond => http 503
Reschuled for June the 13th from 3am to 5am
May 2ndDue to DB desynchronized, some delay in new group creation, document preview and message sending can be experienced
We will do a short reboot between 22:00 and 22:10 to create a new snapshot.
Due to re-synchronization somme delay may have been experienced until Thursday Morning 10:00 am
March 23Following the os upgrade last monday, some ZFS parameters needs update and a reboot.
Some pages loading or connexion may have failed during the reboot.
Jan 5thMaintenance planned in main datacenter. Network shutdown for 15 minutes somewhere between midgnight and 1 O'clock.
The service will not respond => expect some http 503
Dec 15thSchema update on main account table (aka Groups in api), create a deadlock and leads to a shortage of database connections, in the first replica.
Our roadmap toward services isolation should remove in the future the impact of such schema update.
Oct 13thA internal tool, used by our team to copy data from one spot to another created a thousand of api call in a short amount a time. This leads to a major slow down followed by a database saturation. After the core process has been canceled, we restarted and upgrade the main capacity for the api, to be able to process all the remaining calls plus the external (client) calls.
Change in our process and on this tools, will be quickly schedule to avoid those issue in the future.
Note : Web2Store components have been affected a little longer, until 17:15
July 13thWe experience some delay in email and sms delivery. Our team has identified the issue and a fix will be rollout in the next hours, then all emails and sms on stand-by will be resumed.
This issue does not affect transactionnal emails, only scheduled campaigns
July 6thdomain adelya.com unreachable. A destroyed Azure resource was references in a network configuration file and block the start-up of main loadbalancer.
Scheduled processes and campaigns were not impacted
actions taken : pre-checked on configuration is now enforced
Mar. 11thLoyaltyoperator and API unavailable, probably due to a software defect in the promotion code. Our team is still investigating. features desactivation, seems to have solved the problem. We are deeply sorry for the inconvenience and are working hard to isolate the problem.
Feb 12thMain Shard has grown too much, for some DB parameter and need and update. We'll keep you inform. a restart may occurs sooner or later today
Finaly it was something else, a booster for the opening discount period was incorrectly defined, and too database intensive.We have isolated the problem. Performances are back to normal, and we are now investigating the 'why'.
January 20thChristmas is coming, and a lot of request reach our services, for campaigns and loyalty alike. Everything at the same time.
We had Azure VM ready for this situation since a few days. We have activated this scalable solution to enable more request to be fulfilled.
as api calls, are both used on mobile devices and integration, we have prioritized them.
Main Datacenter expeirenced an external network overload. Mitigation is now in place.
October 15thA switch failed in the datacenter, a spare was ready, but some routes are not functionnal.
Note : old front end asp.adelya.com is impacted, new front (TLS 1.2) is functionnal
July 20thHeavy data export in process, draining databse and server capacity, our team work on mitigating the effect of this, throthlling the calls
March 09th