How To Migrate Millions of Records to CRM in Hours (Rather Than Days) Using Scribe
One of the criticisms that Scribe Insight sometimes gets is that it’s slow. While it is true that Scribe is not the speediest ETL tool out there, much of the blame for sluggishness can be attributed to DTS design.
One rule of thumb is that every trip to the source or target slows down the process. That means that dblookups and seek steps, while useful tools, cause the process to go slower. Also, updates and update/inserts hit the target 2-3 times, so updates are slower than inserts.
Here’s an example of what I’m talking about: I integrated a prospect database with Microsoft Dynamics CRM. This is going to be a couple of hundred records inserted or updated per day, but to get started, we first had to load the existing prospect data, roughly 2.5 million records.
I wrote the DTS for the on-going integration, in which I used an update/insert step, as they will be getting updated information for existing prospects, and we want the process to update if it exists, and insert if it doesn’t. This DTS also contained a seek step and two dblookups to populate lookup fields in CRM from other CRM entities.
When I tested the DTS, it imported 10 records every 3 seconds. While this performance would be satisfactory for the on-going integration volume, it would be too slow for the initial migration of legacy data. At that rate, it would take 8 days to load the legacy data.
Here’s what I did to cut the migration of the legacy data down to a reasonable amount of time:
1. I copied the legacy data to a table in the ScribeInternal database, and made ScribeInternal the source for my migration DTS. This made it a shorter trip to get the data.
2. I moved the lookups and seeks to the source query. Instead of doing a lookup to the account table in the DTS, I joined the filteredaccount view from CRM in to my source query, and removed the dblookup and seek steps from my DTS. Now instead of having to seek for the id of a related record, I have the GUID in my source data, and write that directly to the accountid field.
3. Given that the legacy data contains unique records that do not yet exist in CRM, I made my migration DTS an insert step.
The result is that the DTS now only hits the source and the target one time. The difference in performance is pretty astounding. The migration DTS now moves 50 records per second, taking what would have otherwise taken days to hours.
The final thing you can do is to thread your job. As you may know, jobs run through XML message queues are threaded, meaning multiple operations happen simultaneously, but jobs run through the workbench are not threaded. If your on-going integration is going to be a high data volume, using message queues is a good way to go.
However, setting up xml based DTS for a one time migration can add a level of complexity. Another option to thread your jobs is to break your migration DTS into multiple DTS’s and run multiple instances of the workbench simultaneously. In this case, I made a copy of my migration DTS. In the first DTS I filtered the source query to just records created prior to 1/1/2008. In the second DTS, I filtered the source query to just records created on or after 1/1/2008. I then ran them at the same time.
You are still constrained by the resource limitations on your server and the CRM web services, but by following these steps you can optimize your migration processes to run in a reasonable amount of time, without having to resort to unsupported methods, like direct table updates.
One rule of thumb is that every trip to the source or target slows down the process. That means that dblookups and seek steps, while useful tools, cause the process to go slower. Also, updates and update/inserts hit the target 2-3 times, so updates are slower than inserts.
Here’s an example of what I’m talking about: I integrated a prospect database with Microsoft Dynamics CRM. This is going to be a couple of hundred records inserted or updated per day, but to get started, we first had to load the existing prospect data, roughly 2.5 million records.
I wrote the DTS for the on-going integration, in which I used an update/insert step, as they will be getting updated information for existing prospects, and we want the process to update if it exists, and insert if it doesn’t. This DTS also contained a seek step and two dblookups to populate lookup fields in CRM from other CRM entities.
When I tested the DTS, it imported 10 records every 3 seconds. While this performance would be satisfactory for the on-going integration volume, it would be too slow for the initial migration of legacy data. At that rate, it would take 8 days to load the legacy data.
Here’s what I did to cut the migration of the legacy data down to a reasonable amount of time:
1. I copied the legacy data to a table in the ScribeInternal database, and made ScribeInternal the source for my migration DTS. This made it a shorter trip to get the data.
2. I moved the lookups and seeks to the source query. Instead of doing a lookup to the account table in the DTS, I joined the filteredaccount view from CRM in to my source query, and removed the dblookup and seek steps from my DTS. Now instead of having to seek for the id of a related record, I have the GUID in my source data, and write that directly to the accountid field.
3. Given that the legacy data contains unique records that do not yet exist in CRM, I made my migration DTS an insert step.
The result is that the DTS now only hits the source and the target one time. The difference in performance is pretty astounding. The migration DTS now moves 50 records per second, taking what would have otherwise taken days to hours.
The final thing you can do is to thread your job. As you may know, jobs run through XML message queues are threaded, meaning multiple operations happen simultaneously, but jobs run through the workbench are not threaded. If your on-going integration is going to be a high data volume, using message queues is a good way to go.
However, setting up xml based DTS for a one time migration can add a level of complexity. Another option to thread your jobs is to break your migration DTS into multiple DTS’s and run multiple instances of the workbench simultaneously. In this case, I made a copy of my migration DTS. In the first DTS I filtered the source query to just records created prior to 1/1/2008. In the second DTS, I filtered the source query to just records created on or after 1/1/2008. I then ran them at the same time.
You are still constrained by the resource limitations on your server and the CRM web services, but by following these steps you can optimize your migration processes to run in a reasonable amount of time, without having to resort to unsupported methods, like direct table updates.