Cloud Migrations can be as unpredictable as the rain in London... You can perform 4 successful test migrations, only to be faced with a totally random error on production day.
With the somewhat predictable-yet-surprising announcement of the discontinuation of Atlassian Server, there has been much chatter in the Atlassian ecosystem from customers that are now starting to consider moving their instances to Cloud. They will probably be tempted to do the migration initially on their own, but surely after hitting into a few obstacles that I will discuss in this article; will soon run towards us, looking for some sort of deliverance from this nightmare. I meant nightmare in a literal sense, I’ve woken up so many times from them through all my Cloud migrations so far. Not the Halloween you’ve been waiting for.
The size of attachments can really affect the time it takes to complete your Cloud migration. Atlassian recommends that we first split the attachments into 5GB chunks for the Media upload. This can be really time consuming - especially for instances with 100+ GB of media. And that’s not accounting for the individual upload times, which is at the mercy of your unreliable home wi-fi.
Pro tip: If your On-Premise instance is on a PostgreSQL Database, request Atlassian for the Cloud Uploader tool. This app allows you to synchronize all attachments from your on-premise instance directly to Cloud with a click of a button. Really reliable, doesn’t require manual split of attachments, and you don’t need to download the attachments first to your work laptop. Happy days.
This is caused by a variety of reasons, one of them being this bug (which is strangely resolved as Fixed, but is actually still affecting Cloud migrations involving fixed-tier licensed instances). The cause of this is not clear, though the workaround is straightforward:
Suggestion to Atlassian: Fix simple bugs such as this one, as clearly the licensed seats were sufficient after all users are reactivated. Also, if indeed the licensed seats are insufficient, do not fail the migration, but rather throw a warning and disable site access for the remaining affected users. Customers usually have a short production migration window, and to get the error only after 3 hours of import can be quite frustrating.
The error I’m talking about here doesn’t actually have a root cause, according to Atlassian Support, and is supposed to be “intermittent”. I’ve found a way to reliably reproduce them though, even with an empty Backup (backup taken from a freshly installed Jira Server instance, with just one user and no projects).
My theory as to what causes this is the amount of data already present in the Cloud instance prior to the restore. Usually when restoring an instance, you have the option to either Merge, or Overwrite the users in the instance:
When we choose to overwrite (which is what we usually do with our migration work), the Cloud import process will then have the extra burden to actually remove all existing users in the instance. This can be a really time-consuming process, which often times, results in this error.
This error will only occur at the end of the import, and for a particularly large import, this can happen only after 8 hours of migration. Not very good news for those with a short migration window. The only way to resolve this after that? Re-migrate.
Pro-tip: Hit into this error earlier by first importing an empty Server backup. This will allow you to hit into this error way earlier than after 8 hours of migration. The subsequent full import will have a much higher chance of success. I can’t guarantee 100% success because well, it’s Cloud. Also, plan for an outage window that is at least 2 times the migration time you took during testing. This allows you to have extra time to re-import should this method fail.
Cloud migration can be as unpredictable as the rain in London. You can perform 4 successful test migrations, only to be faced with a totally random error on production day. This is one of them. This is exactly why we usually add an additional line item into our Cloud Migration work estimates, for “unpredictable issues”.
The Dead state is, as the name suggests, even deadlier than the error I mentioned earlier in Issue 3. This one does not allow you to re-migrate. The instance at this point is essentially “dead”, with an import thread stuck in a dead state. This can only be corrected - at the point of this blog’s publish date - by the Atlassian Support team sitting in Sydney. So if you are in the EMEA region, well boo-hoo, you will need to wait til Sydney comes in to reset the instance for you to re-migrate.
Some silver-lining here is the fact that this issue is really rare, and 99% of you would probably not have to face this issue. Turns out I’m the lucky 1%.
The following is relevant only to those performing a “merge + migrate” (Essentially merging an On-Premise instance with a Cloud instance). The Cloud data is usually overwritten with this process, if you are not using tools such as the Jira Cloud Migration Assistant - this tool currently does not support Jira ServiceDesk projects
The Cloud Backup isn’t a full backup. It does not contain any app data so you probably will lose a lot of configuration and customizations if you treat it as that.
Atlassian requires that you speak to all vendors (for all apps that you currently have in Cloud) to find out whether the data will remain intact after you restore this backup. This is time consuming and I can tell you from firsthand experience: most data will be lost.
Pro-tip: If your On-Premise instance only has Jira Software/Core projects, use the Jira Cloud Migration Assistant app to migrate. Otherwise, if full XML restore is the only option, request for an Atlassian Supporter to standby to restore to a last known good snapshot if things really do go south. This needs to be planned way in advance before the production migration. Atlassian cannot guarantee the exact time when the snapshots will be taken, but they can guarantee that it is taken daily.
I’ve never dealt with anything more sensitive than Cloud user checks in my life. Your users have a “+” symbol in their email address? Rejected. More than one user shares the same email address? Rejected.
Suggestion to Atlassian: Warn the users early if such problems exist in the import file. Do not let the import proceed only to fail 3 hours later with user-related issues. The fix also shouldn’t require admins to fix them in the source instance, re-export, and re-import. That is very time consuming. It’s high time that we have the feature to merge users on the fly during import.
Also, fix the email address regex. “+” symbol for example, is actually used in a lot of email scenarios (email aliases are usually created with a + symbol). Why are they allowed in OnPremise and not in Cloud? The regex should be more forgiving.
You perform an import, 4 hours later it fails with a generic error:
“The upgrade manager returns a non-successful result from running the upgrades”
What do you do?
Sadly the only way to find out what truly happened is to raise a ticket with Atlassian Support, and get them to identify the root cause in the logs.
Suggestion to Atlassian: Perhaps letting end users download the import logs isn’t such a bad idea after all? Will definitely reduce the amount of tickets raised too.
Pro-tip: Raise a MOVE ticket early, even when you are just planning for a migration. Atlassian will assign a dedicated engineer to help you navigate through issues along the way, including identifying errors in the logs quickly.
Atlassian Cloud migrations are not as straightforward and easy as you imagine them to be. The issues mentioned here only scratched a tiny bit off the surface from the plethora of problems you may encounter. We can take that off your plate. Just contact us at ServiceRocket and we will help you navigate through what is, perhaps, the most risky task to perform on an Atlassian product.
Check out our webinar on the challenges of migrating to cloud and how to ensure a successful migration.