Missing data for some tables in Atlassian Analytics

Incident Report for Atlassian Analytics

Postmortem

Summary

On April 15, 2024, between 2:00 and 3:00 AM UTC, Atlassian customers using Atlassian Analytics in the us-east-1 region encountered inconsistencies when querying jira_issue data. Subsequent investigation revealed that a bug in our workflow during an internal data migration on the Data Lake caused a backlog of migrated data to accumulate, leading to incomplete data being presented to customers.

Our Data Lake monitoring system can usually identify such issues within 30 minutes, but the high volume of migrations in a short timeframe prolonged the processing time. This led to an incident escalation. The issue was resolved by scaling up infrastructure to process the backlog of data.

Subsequently, additional inconsistencies in data were discovered across Account, Devops, and Jira tables. The impacted data was reinstated from the source and reprocessed into the Data Lake. Despite our team's diligent efforts, it took us longer to resolve the issue due to the extensive scale of affected data.

We are taking remedial actions to enhance data quality checks, along with improving tooling for quicker recovery and progressive deployments to minimize widespread impact

IMPACT

Between April 15, 2024, at 02:00 AM UTC and April 20 at 21:00 PM UTC, some Atlassian Analytics customers experienced service degradation. This led to inconsistencies and incomplete data for the Account, DevOps, and Jira tables in their Atlassian Analytics dashboards.

ROOT CAUSE

The problem arose due to a change made to enhance the internal data partitioning structure, aimed at improving performance for upcoming features. The fundamental issue stemmed from a bug in the workflows, causing data to be presented to customers before the accumulated backlog had been processed. Consequently, users of Atlassian Analytics encountered incomplete or missing data in their dashboards.

Although comprehensive testing was conducted prior to deployment in the production environment, this problem arose as an exceptional scenario when dealing with significantly larger volumes of migrated data, resulting in customers seeing the migrated data prior to the completion of data processing.

Subsequent validation revealed that further inconsistencies were caused by another bug in the compaction process of raw data, resulting in the selection of incorrect versions of records. This erroneous data was utilized in the aforementioned partition update process, contributing to the inconsistencies.

The solution involved replicating all affected data from the source and reprocessing it within the Data Lake.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We understand that data inconsistencies can significantly affect your productivity.

To prevent this kind of incident from happening again, we plan to focus on the following measures:

  • Enhance our existing data quality tests and expand their scope to identify these issues earlier.
  • Enhance our tools to ensure quicker recovery in similar future incidents.
  • Progressively deploy (by cloud region) our changes to minimize widespread impact.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s reliability and data accuracy.

Thanks,

Atlassian Customer Support

Posted May 14, 2024 - 08:33 UTC

Resolved

This incident has been resolved.
Posted Apr 20, 2024 - 21:14 UTC

Update

All data issues have been resolved and validated. Closing the updates.
Posted Apr 20, 2024 - 21:14 UTC

Monitoring

Most of the DevOps data has been restored. We are continuing to monitor the pipelines, next update in 12hours.
Posted Apr 20, 2024 - 09:01 UTC

Update

We are continuing to monitor restoration of data in the DevOps tables. The next update will be in 8 hours.
Posted Apr 19, 2024 - 13:25 UTC

Update

The account, jira_issue, jira_sprint, teams, and group tables are fixed. We are continuing to address issues with the DevOps tables. The next update will be in 8 hours.
Posted Apr 19, 2024 - 06:20 UTC

Update

The account, jira_issue, and jira_sprint table are fixed. We are continuing to address issues with the DevOps tables, teams table, and group tables. The next update will be in 8 hours.
Posted Apr 19, 2024 - 00:53 UTC

Update

The account and jira_issue tables are fixed. We are also addressing partial data in a few other tables including the DevOps tables and Jira Sprint table. The next update will be in 8 hours.
Posted Apr 18, 2024 - 17:06 UTC

Update

Most of the data for jira_issue table is fixed in all regions. We are finalising the verifications and continuing to monitor the pipelines as an additional measure. Next update in 8 hours
Posted Apr 18, 2024 - 10:13 UTC

Update

We are continuing to process data to fix the jira_issue table. The next update will be in 8 hours or when the job completes, whichever comes first.
Posted Apr 17, 2024 - 22:12 UTC

Update

We have fixed the issue with the account table. The mitigation for the jira_issue table is still in progress. We will share the next update in 8 hours.
Posted Apr 17, 2024 - 13:15 UTC

Update

We have identified the root cause of the new issue and started the mitigation process. Will share an update on the progress in the next 6 hours
Posted Apr 17, 2024 - 01:44 UTC

Update

The data processing has completed however we have identified an additional issue and are working on a mitigation. We will share a progress update in 6 hours.
Posted Apr 16, 2024 - 17:25 UTC

Update

We have scaled up the infrastructure to process the backlog of data, continuing to monitor the data processing job progress. Will share update on the progress in 12 hours.
Posted Apr 16, 2024 - 05:05 UTC

Update

We are continuing to monitor the data processing job progress and expect to have a better estimate on completion in 4 hours. At this time, the primary impacted tables are jira_issue and account that are showing partial results for some customers.
Posted Apr 15, 2024 - 20:57 UTC

Update

We have applied a fix and have scaled our infrastructure to help process the backlog of data.
Posted Apr 15, 2024 - 19:43 UTC

Identified

We have identified an issue stemming from a maintenance job that is causing issues for some tables and are working on a fix. While the jira_issue table appears the most impacted, some other tables may experience similar issues.
Posted Apr 15, 2024 - 18:39 UTC

Investigating

We are currently investigating an issue with the jira_issue table where some customers are seeing partial results when querying data via Atlassian Analytics.
Posted Apr 15, 2024 - 17:24 UTC
This incident affected: Atlassian Data Lake.