The primary caveat we have found with such systems is that we must provide a significant amount of very specific data with our deployments...

Managing Data During Deployments

As part of our consulting services we develop applications that depend on data to perform their duties. These applications are commonly data driven. We do that to help the systems we build be flexible in their capabilities. In general, this is a preferred way to build systems in that they become powerful and more useful for our clients. The primary caveat we have found with such systems is that we must provide a significant amount of very specific data with our deployments in order to have the systems behave as expected. Over time and with growing systems this has become a time consuming chore. For example, we have one database that is over one hundred and fifty (150) tables. About eighty (80) of those tables require data we as a development team provide. Including that data with each deployment has become a difficult task for which we have attempted various solutions. We have finally settled on one we think will serve us well for some time to come. I’d like to share the process with you.

Let me start by giving credit where credit is due. We tried a couple of third party tools to do this without ultimate success. After a few tries I asked one of our consultants, Richie Rump, to help find a solution that would work in our environment. We track our database definitions in Visual Studio using the SQL Server Data Tools. These are kept in a Visual Studio project. In addition, for some of our projects we generate objects that represent and manage our data for each table in our system. Those objects are generated with SQL Server Stored Procedures as part of their data manipulation definition. We use the Post Deployment script capabilities of the database project to update the stored procedures each time we make changes to the database definition. The solution we came up with had to run after the stored procedure generation had completed. Our solution runs as a set of T4 templates that are managed within the database project and run after the stored procedures are added to the database.

We have three additional requirements for this process:

  1. Gather Existing Data
  2. Deploy Prime Data
  3. Deploy Seeded Data

Gathering Existing Data

Usually, while doing development, we will add new reference data to the database directly. A developer would add the data, make the functionality changes and test his work. Once satisfied (s)he would need to package the data as part of the code check in so that our QA team and clients can see the changes, test them and approve the work. Given that the data would need to be stored in some text file format for tracking purposes in our code repositories and that a human might want to make changes in a data file directly we settled on xml as the file format to keep the data in. That gives us a source and a target for the data we need to track. We then defined a separate xml file that serves as a dictionary of tables we need to gather at any one time. This also allows us to set some options for the various aspects of the work. Each table in the database that we need to track has an entry in this dictionary file. The dictionary file looks like this:

The listing of the table is the important piece when it comes to gathering the data. Currently, the only attributes we are using are the table name and the index name. The other options are used later in the process. For simplification we copy the lines from one dictionary file to the other. More on that later. Once a table is listed in this file, if we run our template we will have the data we seek in an xml file named after our table. The files are stored in a folder named "Reference Data".


A common lookup table's reference xml file may look like this:


It would have been gathered by running the "1-Gather Reference Data" T4 template through the Run Custom Tool option when you right click the template file in Visual Studio. All tables listed in the Gather Data Metadata.xml file will be gathered when the process is run.

Once we have gathered the data we need we are ready to deploy it. One thing to note, we treat Prime data different than Seed data during our deployment process but we make no distinction in the gathering process.

For clarity, we define Prime data as data that the development team controls and will not be modified by anyone else.
Conversely, Seed data may be modified by someone outside of the development team.

Preparing for Deployment

Due to the reality that Seed data may be modified by someone outside of our control we needed to figure out how to get any new data merged properly with any existing data. This creates a number of challenges. In our situation, we have some tables that will need to be treated differently than other tables. With that in mind we created a few options that help us define how to treat those tables. The management of all of that happens when we generate the SQL script that will be used to deploy the data. The options are defined in a parallel data dictionary we keep as an xml file named "Create Merge Metadata.xml". The format of the file is the same as the one listed above. We keep two copies in different folders because they are used differently. With the "Gather Data Metadata.xml" file we only uncomment any table we want to gather from and keep the others commented out. With the "Create Merge Metadata" file we keep all of the tables uncommented so they are included in the generated SQL script file each time we run the "1-Run to Generate Merge Statements" T4 template using the Run Custom Tool right menu option.

Merge Statement Options

Options are tracked as attributes of each table entry in the data dictionaries. Here is a breakdown of what they are and how they are used.

name The name of the table. Defines the table to gather or merge data for. Must match the actual table name.
type "prime", "seeded" Defines data treatment. Used to manage which process to use for the current table.
forcepk "true", "false" Will the Primary Key be forced on update? Always true for Prime data. Usually false for Seeded data unless we must keep certain values.
indexkey The name of the index to use for matching rows. Defines how to match a row. Usually the Primary Key for a Prime table and an Alternate Key for Seeded data.
skipupdate "true", "false" Do you want to update all columns for Seeded data? There are situations when we could not update existing rows of data.

The templates will do the right thing as each of these options are set for the tables. For Prime data the forcepk, and skipupdate options are ignored. For Prime data we automatically overwrite the whole table. For Seed data we update the existing data. In the case of skipupdate being true all we really do is add new rows.

The Generated Output

As mentioned, the output of this process is an SQL script that provides the ability to deploy the data as defined by the options for each table listed above. In order to make that happen they use a SQL Server MERGE statement for each table in the Deployment data dictionary. The file may take a few seconds to generate. In our case it takes up to 30 seconds. We have the file automatically placed in the Scripts/Post-Deployment folder. From there we call the file using the :r option from the Script.PostDeploymnet.sql file. Another way to do that would be to create a task in your build process that would automatically run the file.

Running this script is what gets the reference data automatically deployed into each database instance on every deployment of the database. For us this has worked really well.

Project File Organization

All of the templates used are in two folders. For further organization we place them in a Templates folder. As you can see from the image below some of the templates mirror each other. We tried to keep as much of the process consistent with their counterparts. The files in the 1-Gather Reference Data folder should be run first to gather the data into the xml files. The files in the 2-Create Merge Statement folder are used to create the Merge SQL statements for each table.

Data Merge Project Files Image

File NameFolderPurpose
1-Run to Gather XML 1-Gather Reference Data The template to execute when you want to gather reference data.
Create XML Data 1-Gather Reference Data Support file for gathering data. Not to be run independently.
Create XML 1-Gather Reference Data Gathers the data and prepares it for SQL generation. Not to be run independently
Gather Data Metadata.xml 1-Gather Reference Data The data dictionary for the data to gather when 1-Run to Gather XML Data is executed. 1-Gather Reference Data This file should be modified for your environment. It defines the data connection to use when running the 1-Run to Gather XML template.
1-Run to Generate Merge 2-Create Merge Statement The template to execute when you want to generate the Merge statements from the reference data.
Create Merge Metadata.xml 2-Create Merge Statement The xml data dictionary that defines which tables to merge with what options.
Create Prime Merge 2-Create Merge Statement The template that creates the SQL for each Prime table. Not to be run independently
Create Seeded Merge 2-Create Merge Statement The template that create the SQL for each Seeded table. Not to be run independetly. 2-Create Merge Statement This file should be modified for your environment. It defines the data connection to use when running the 1-Run to Generate Merge template.

The Code

The Code is available on our VDS Foundry GitHub account. You can access it from here.

Wrap Up

If you have suggestions feel free to do a pull request on the github link above.

We've been running this setup about six months. We have thought of a couple of refinements to it but none of them have been painful enough to implement. The biggest one is to use something like negative primary key values for seeded data rows we are providing so we are always sure of their uniqueness regardless of what may happen elsewhere in the system.

So far we estimate this setup saves us an hour or two per day during days we do development that adds new data and countless hours in testing since all of our data is now consistent on every deployment. For live deployments we estimate this saves us between 1-4 hours of manual labor. All in all it has been a successful solution for us.