On April 15, 2024, between 2:00 and 3:00 AM UTC, Atlassian customers using Atlassian Analytics in the us-east-1 region encountered inconsistencies when querying jira_issue data. Subsequent investigation revealed that a bug in our workflow during an internal data migration on the Data Lake caused a backlog of migrated data to accumulate, leading to incomplete data being presented to customers.
Our Data Lake monitoring system can usually identify such issues within 30 minutes, but the high volume of migrations in a short timeframe prolonged the processing time. This led to an incident escalation. The issue was resolved by scaling up infrastructure to process the backlog of data.
Subsequently, additional inconsistencies in data were discovered across Account, Devops, and Jira tables. The impacted data was reinstated from the source and reprocessed into the Data Lake. Despite our team's diligent efforts, it took us longer to resolve the issue due to the extensive scale of affected data.
We are taking remedial actions to enhance data quality checks, along with improving tooling for quicker recovery and progressive deployments to minimize widespread impact
Between April 15, 2024, at 02:00 AM UTC and April 20 at 21:00 PM UTC, some Atlassian Analytics customers experienced service degradation. This led to inconsistencies and incomplete data for the Account, DevOps, and Jira tables in their Atlassian Analytics dashboards.
The problem arose due to a change made to enhance the internal data partitioning structure, aimed at improving performance for upcoming features. The fundamental issue stemmed from a bug in the workflows, causing data to be presented to customers before the accumulated backlog had been processed. Consequently, users of Atlassian Analytics encountered incomplete or missing data in their dashboards.
Although comprehensive testing was conducted prior to deployment in the production environment, this problem arose as an exceptional scenario when dealing with significantly larger volumes of migrated data, resulting in customers seeing the migrated data prior to the completion of data processing.
Subsequent validation revealed that further inconsistencies were caused by another bug in the compaction process of raw data, resulting in the selection of incorrect versions of records. This erroneous data was utilized in the aforementioned partition update process, contributing to the inconsistencies.
The solution involved replicating all affected data from the source and reprocessing it within the Data Lake.
We understand that data inconsistencies can significantly affect your productivity.
To prevent this kind of incident from happening again, we plan to focus on the following measures:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s reliability and data accuracy.
Thanks,
Atlassian Customer Support