Adelya : Technical Information and Incident Reports

Subscribe to Adelya Partner technical newsletter :
alerte

Technical issue on our SMS sender provider

07th of December - 13:10 to - 14:50

Status : closed (Provider side)

Technical issue detected on our SMS provider & sender. They are aware of the issue and investigating.

It impacts the distribution of SMS through campaigns and fidelity (Welcome, ...).

14:50 : Back to a normal state, issue closed by the provider.

07th of December
alerte

Slowness to reach LO application (Manager & Card)

23th of November - 13:00 to 17:45

Status : monitoring

Slowness and errors experienced on access to Loyaltyoperator application.

17:45 : Progressive back to a normal state. Incident caused by pool exhausted from the process of rich campaign statistics. We temporary blocked this ack process in order to avoid the incident to happen again, and started to work on a fix.

23th of November
alerte

General Network slowness

21th of November - 14:10 to - 19:12

Status : Closed

Slowness and timeout experienced on various service in production environements (API, Manager)

17:30 Our investigations lead to an issue (CPU usage & load) on the Ingress. Our partner (maintainer in operating condition) is also investingating on the issue.

19:12 Back to normal on servers. Incident is not closed but monitored.

November 22th : Working on a fallback solution if the case happens again

November 22th 17:00 : Closed incident. The problem did not occur again.
As precised, we added a fallback server for API calls if it occurs again. We are also still trying to find the cause, and monitoring.

21th of November
alerte

General Issue

21th of November - 02:30 to - 07:50

Status : closed

General issue on the platform. Missmatching between Database model & platform model

07:15 building of backup image to deploy.

07:40 deployment of backup version.

07:50 progressive back to normal.

November 21th
alerte

Planned Maintenance

03:05 - 03:45 (Paris time)

Status: Done

Maintenance planned on databases servers (os security patches). Service shutdown for 1h maximum.

The service will not respond => http 503

3:45 Maintenance completed

7th of November
alerte

Service outage

12th of Octobrer - 18:55 to 19:34

Status : closed

2 services outage, of about 10min each detected by our monitoring

The ingress controller inside the Kubernetes Cluster has been identified as faulty (out of memory). We plan a version upgrade and to add redudancy.

Intevention planned tonight and on next monday night without service shutdown

Redundancy of ingress incresed October 12th

alerte

Slowness

5th of Octobrer - 10:59 to 11:04

Status : Closed

General slowness

stop as quickly has it comes (less than 5 minutes). This left us not enough time to investigate

Some log analysis is nevertheless under review

October 5th
alerte

Planned Maintenance

03:00 - 05:00 (Paris time)

Status: re-scheduled

Maintenance planned on databases servers (os security patches). Service shutdown for 1h maximum.

The service will not respond => http 503

October 3rd >> 7th of November
alerte

API issue

07:10 September 19

Status : Closed

Issue detected on our API, the use of dates leads to an error during the execution (code 500).

11:40 - Deployement has been successfull, back to normal.

10:40 - Starting the deployment of the new version. This may cause some slowness & concurrent access.

09:25 - Compiling a version fixing the issue

September 19
alerte

Azure cluster issue

07:10 August 31 -

Status : Monitoring

Issue detected on communication between our Azure cluster and our docker image repository. It results that our pods cannot mount new images and cache is not refreshed.

13:58 - The issue has been confirmed on Azure side on their status page : https://azure.status.microsoft/en-us/status, on the item "Azure customers running Canonical Ubuntu 18.04 experiencing DNS errors"

08:35 - Back to normal and set to a monitoring state.

07:55 - Brief cut on our applications during a command to attach manually images to the cluster.

August 31 -
alerte

General Network slowness

07:10 - 15:01 July 1 -

Status : Mitigated

Slowness and timeout experienced on various service in production environements

15:01 - Network trafic is now entirely splited and everythin is back to normal

12:38 - Mitigation is launched

09:38 - Issue is under investigation, we will conduct a brief restart on main LB.

July 1st -
alerte

Planned Maintenance

03:00 - 05:00

Status: Success

Maintenance planned on databases servers. Service shutdown for 1h and 30 minutes.

The service will not respond => http 503

June 13th
alerte

loyaltyoperator not available

6:30 - 7:11

Status : closed<

Due to an issue in latest deployment, loyaltyoperator was using too much connections, this issue as breifly affected other apps also

System team restart databaserver to mitigate the problem and our team rollback to previous version.

June 2nd
alerte

General Network slowness

16:00 May 18 - 14:30 May 20

Status : closed<

Slowness and timeout experienced on various service in production and in other environements

Issue is under investigation, we will conduct a brief upgrade on main LB around 22:00

Update May 19th, LB is under DOS from remotes IP. No data was compromised but our main LB was loosing some http calls and respond with timeout. A new BlackList is in place, with more supervision tools and monitoring.

WAF rules updated

May 18th - May 20th
alerte

Planned Maintenance

05:00 - 06:30

Status: Canceled

Maintenance planned on databases servers. Service shutdown for 1h and 30 minutes.

The service will not respond => http 503

Reschuled for June the 13th from 3am to 5am

May 2nd
alerte

statistic database desynchronized

22:00 - 22:10

Due to DB desynchronized, some delay in new group creation, document preview and message sending can be experienced

We will do a short reboot between 22:00 and 22:10 to create a new snapshot.

Due to re-synchronization somme delay may have been experienced until Thursday Morning 10:00 am

March 23
alerte

Short reboot of statistique database server

10:18 - 10:22 Done

Following the os upgrade last monday, some ZFS parameters needs update and a reboot.

Some pages loading or connexion may have failed during the reboot.

Jan 5th
alerte

Planned Maintenance

00:13 - 00:24 - Done

Maintenance planned in main datacenter. Network shutdown for 15 minutes somewhere between midgnight and 1 O'clock.

The service will not respond => expect some http 503

Dec 15th
alerte

Platform issue

21:51 - 22:13 - solved

Schema update on main account table (aka Groups in api), create a deadlock and leads to a shortage of database connections, in the first replica.

Our roadmap toward services isolation should remove in the future the impact of such schema update.

Oct 13th
alerte

Platform issue

15:41 - 16:45 - solved

A internal tool, used by our team to copy data from one spot to another created a thousand of api call in a short amount a time. This leads to a major slow down followed by a database saturation. After the core process has been canceled, we restarted and upgrade the main capacity for the api, to be able to process all the remaining calls plus the external (client) calls.

Change in our process and on this tools, will be quickly schedule to avoid those issue in the future.

Note : Web2Store components have been affected a little longer, until 17:15

July 13th
alerte

Message delivery delayed

solved

We experience some delay in email and sms delivery. Our team has identified the issue and a fix will be rollout in the next hours, then all emails and sms on stand-by will be resumed.

This issue does not affect transactionnal emails, only scheduled campaigns

July 6th
alerte

Network issue

4:00 - 7:42

domain adelya.com unreachable. A destroyed Azure resource was references in a network configuration file and block the start-up of main loadbalancer.

Scheduled processes and campaigns were not impacted

actions taken : pre-checked on configuration is now enforced

Mar. 11th
alerte

Platform issue

11:08 - 12:20

Loyaltyoperator and API unavailable, probably due to a software defect in the promotion code. Our team is still investigating. features desactivation, seems to have solved the problem. We are deeply sorry for the inconvenience and are working hard to isolate the problem.

Feb 12th
alerte

Slowness Detected

9:30 - 13:30

Main Shard has grown too much, for some DB parameter and need and update. We'll keep you inform. a restart may occurs sooner or later today

Finaly it was something else, a booster for the opening discount period was incorrectly defined, and too database intensive.We have isolated the problem. Performances are back to normal, and we are now investigating the 'why'.

January 20th
alerte

Slowness Detected

15:30 - 16:00

Christmas is coming, and a lot of request reach our services, for campaigns and loyalty alike. Everything at the same time.

We had Azure VM ready for this situation since a few days. We have activated this scalable solution to enable more request to be fulfilled.
as api calls, are both used on mobile devices and integration, we have prioritized them.

December 10th
alerte

Network overload

11:45 - 12:04

Main Datacenter expeirenced an external network overload. Mitigation is now in place.

October 15th
alerte

Swicth failure in datacenter - impact TLS 1.0 frontend

16:03 - 16:54

A switch failed in the datacenter, a spare was ready, but some routes are not functionnal.

Note : old front end asp.adelya.com is impacted, new front (TLS 1.2) is functionnal

July 20th
alerte

Slowness detected

14:22 - 15:23

Heavy data export in process, draining databse and server capacity, our team work on mitigating the effect of this, throthlling the calls

March 09th