A fair while ago I went through a data migration project and a few days ago I chatted with one of our architects to go through the various steps and the solution that I had picked up and
developed further. I think he needed the information for another data migration project he’ll be on.
As I went through it with him, I brought up the various data migration gotchas and issues we ran into. I now realise these can be useful to a larger number of people working on data migration projects. I heard these were quite common, mostly for companies that start a digital transformation. A data migration project will generally be about building a solution that will allow you to extract your legacy data, to transform it and eventually load it into your new system storage.
Surprisingly, I only worked on one data migration project since I jumped in the software world twelve years ago. It feels like yesterday that I was struggling to learn SQL. This one quite interesting you revolve exclusively around using procedural SQL on both Oracle and MariaDB. You will guess for yourself which one was the legacy system and which was the new shiny target. But this is not what today is about, today is about the biggest gotchas that I believe delayed our delivery. Views are my own, facts are for everyone, yada yada yada.
Using SQL scripts as a primary tool
I am actually starting with one gotcha that I didn’t highlight to my colleague yesterday that came back to my mind this morning. Don’t get me wrong, SQL is a powerful tool to retrieve and display data. However, when you have a team of several developers working on a codebase, making sure your changes integrate well with others is key.
The problem is that to validate different scenarios, we couldn’t just run classic unit tests for a few seconds or a couple of minutes. We could only run the actual migration, I wouldn’t call these integration tests because the environment would be different from the actual one (more on that later).
We had to spin up docker images then load dummy data for each scenario we prepared to face with the real data. I think our Jenkins builds at one point took up to 2-3 hours to complete. This makes local development harder as nobody would want to apply changes for 5 minutes to run tests for 2-3 hours, at first we moved towards running only the tests we were interested in. That painfully slow CI even inspired a post of mine from last year. Eventually, we got it down to 40ish minutes, still slow but probably the best we could do considering what we were dealing with.
Now, I am not speaking from experience but a couple of architects I discussed with while and even after that project told me that using an actual programming language would have saved us from that pain. There you can test any component for every bit of retrieval, transformation and load in seconds. Then have a single integration test with your happy path scenario. We could have had CI builds running in under a minute. Lots of time that could have been saved.
Mismatching source and target fields
Mismatches were such a big pain point. I do not mean mistakenly pushing data from a legacy field to the wrong target field. I mean pushing data exactly where it should be but the target field has the wrong type. Due to the sensitive nature of the data, there was no getting our hands on the real data while developing the solution.
These issues only exposed themselves during production runs. You can have source fields that are strings matched with target fields that were integers. When all the test data contains numbers for that field it’s okay. But when you get even a couple of entries out of millions that turn out to have letters, all hell breaks loose. Other times data would be truncated because the target database would have target fields smaller than the ones there are meant to replicate from the source system. This one is not so much a data migration issue as we were not responsible for the target system design but that actually forced us to go and fix it as we were delivering that data migration solution. So yeah, not optimal.
The takeaway here would be, if you’re building a newer version of a system, make sure that the new database fields will actually match the format and type of the source data. We can’t afford to truncate addresses or phone numbers. Especially when the system requires that information.
Lack of boundaries with other teams
My team focus at that time is data migration. We build a solution to move data from one place to another. However, as mentioned above, we occasionally had to fix a target database built by other teams for various features. On top of that, I’m not sure how nor why we became the test data providers for other teams. Basically, instead of putting together test data to test their features, these teams would contact us to give them random database dumps.
It is quite silly in retrospect. So we had a class in the testing framework we built that we used to generate data. Data that we pushed to the source databases while we developed, then ran through the migration process to extract, transform and load it to the target database. Then extract a data dump from that target system to send it to them. It made sense to us because we didn’t want to craft target data when it wasn’t our role. However, I believe our boundary was too liberal. We should have put the bound at “you should craft your test data” stop there.
While it is good to help people where you can, you can’t do that to the point where you can’t do your own work. In the end, we were responsible for three massive parts of that project: data migration, fixing the target database and generating test data for everyone.
Different environment setups
I remember that back then, I didn’t think much of different setups between the various deployed environments. From dev to staging to production there were quite a few differences, which obviously we paid a price for down the line. Having different versions of Oracle DB or MariaDB shouldn’t be a big deal right? What about when the difference between a version and the next breaks all your SQL scripts. Like having to replace VALUES to VALUE.
Imagine the pain of getting your migration tool working locally, then push it through a slow CI process. Next, you deploy it to an environment, running your migration process and checks. All this for it to break in production because the MariaDB instance in production is older. Additionally, it was set up on an EC2 instance when the one in the staging environment was an RDS.
Working on a project where all the various environments you use for development are identical to production the output difference screams at me. You avoid a host of pains due to fix your code to work with all your integration environments. There is something wrong if the environments that should prove your solution works in production are nothing like production. Definitely the biggest learning I got from that experience.
I will carry the lessons I learned through that old project for the rest of my life. I might even revisit this post to make sure I do not forget as these will be very useful the next time I work on data migration. Better yet, some of these can apply to more than just data migrations.
Even though I did not get to pick the tool in this case, this experience enforced my belief in picking the best tool for a job. While it is good to trust the information you get, it won’t hurt to have a look and confirm yourself. Sometimes it is as quick as a SQL query.
Next, make sure that wherever possible the environments you use to develop a solution match the production environments. You will avoid a host of integration issues.
And finally, when you have clearly defined responsibilities, avoid picking up new ones that will become a liability for your core work. Sergio Ramos isn’t the best defender in the world because he scored more than Firmino this season. He is the best because he actually does his defending work first which he excels at, then occasional goes and score a goal.