It's one thing to access SQL Server data from the System i, but it's another thing to access System i data from SQL Server.
Heterogeneous data is one of the hallmarks of today's IT. Whether due to disparities in disk price or processing power, data in many shops has become fragmented across not only multiple physical machines, but multiple platforms and databases. One of the primary jobs of any IT department is allowing seamless access to those different databases.
If you're a programmer, you can usually find the tools you need to access data on any machine in your network. It's a bit harder to do that without programming--that is, by using the user-oriented data access tools that are prevalent on today's desktops. In fact, as you'll read here, sometimes its just plain impossible.
Getting at Data Is Easy for a Programmer
Whether it's a Type 4 Java Database Connectivity (JDBC) driver or an Object Linking and Embedding Database (OLE DB) provider, objects typically exist to allow a programmer on one platform to access databases on another. This means that, with just a bit of effort, a competent programmer can access data from multiple data sources in the network and provide meaningful information to the user.
Perhaps the simplest route is to use Java and JDBC. JDBC has progressed significantly since its original incarnation. When JDBC first appeared, it was primarily as a wrapper around Open Database Connectivity (ODBC). ODBC was Microsoft's API for allowing standardized access to all databases, and it involved a "native" piece written for the operating system on which the client ran. In the Microsoft world, that meant primarily Windows, so Windows drivers for various databases soon became available. This in turn meant JDBC access for those databases.
This, however, didn't translate to access from the i. While there was plenty of support for ODBC drivers to access DB2 data, it was hard to find someone interested in writing OS/400- or i5/OS-specific ODBC drivers. Thus, it was easy to access DB2 data from Windows yet hard to access SQL Server data from the System i. This changed with JDBC Type 4 drivers, in which the entire driver is written in Java, with no native driver required. With this type of driver, all you need is a JVM, and the i has a very good one. So, now it's relatively easy to access SQL Server data and DB2 data using a Java program.
However, by definition this involves someone writing a program; it doesn't provide the sort of ad hoc access that many users have come to expect. The ability for a user to be able to easily write an SQL statement and import the results of that statement into a desktop document has become a commonplace requirement. I'll resist the urge here to expound on why I don't like direct user access to rows and columns in the database, because that's outside the scope of this article. Instead, I'll assume that you understand the risks, and I'll instead focus on how to make this happen.
Non-Programmatic Access
What is needed is a way for users to write simple queries that access data on two databases simultaneously. Two ways exists to do this: either use a tool that knows how to join multiple databases, or convince the SQL engine on one platform to talk with data on another platform.
Tools for the former exist, although they're usually relatively expensive. For example, IBM has its DB2 Information Integrator. This is a powerful tool that allows you to create queries over widely disparate data sources (not just RDBMS), but it's not cheap. The last pricing I saw was in the realm of $20K per server and $10K per database. That was a long time ago, but it gives you an idea of the value placed on these tools.
Another option is to get a Windows-based package that understands the various databases, such as UniDirect .NET. I haven't used this product; I only bring it up because it really highlights the biggest issue when dealing with System i data: most people in the "outside" world, if they've heard of DB2, only know of it from the small-server world (that would be DB2 LUW, the Linux/UNIX/Windows version). Occasionally, they know the mainframe version (DB2 for z/OS). In any case, they tend to provide support for "DB2," but that doesn't always mean DB2 for the i.
DB2 Data and SQL Server Data Side by Side
The other option is to somehow convince one SQL engine to look at data from another data source. For example, to get SQL Server and DB2 to work together at the SQL engine level, you have two choices: get DB2 to talk to SQL Server or get SQL Server to talk to DB2.
For the first option, IBM uses Distributed Relational Database Architecture (DRDA). DRDA was originally an IBM architecture. It's been around for a decade or so and is currently under the auspices of The Open Group. However, while a number of RDBMS products will participate in a DRDA environment, Microsoft SQL Server is not one of them (surprise, surprise).
Linked Servers in SQL
The SQL Server approach is called a linked server. Michael Sansoterra wrote about this in some detail, and while the article is a couple of years old now, it still has great information. But the basic concept can be boiled down pretty quickly: if your database has an ODBC or OLE DB connector, then it can (probably) be used as a linked server. The "probably" indicates one of two cautions: first, not every feature may be supported by your driver, and second, even if something is supported, the performance may be so bad that you need to find a different approach after all.
Ignoring those caveats for the time being, let's take a quick look at the linked server approach. As I said, Michael's article goes into this subject in depth, and it really covers the topic well from a technical perspective, so all I really have to do here is hit the high points. First, you define your server using some special Transact-SQL (T-SQL) statements that identify the server type and location.
Once that's done, you can use the new server (and its schemas and tables) in your basic SQL statements using a four-part naming syntax that includes the server definition name and schema in the table name. This isn't a perfect solution. The operations you can perform are pretty limited, especially by today's standards. Basically, you're stuck with what I consider to be first-generation SQL statements: SELECT, UPDATE, INSERT. You can also execute basic database management functions like DROP TABLE and SELECT INTO. The latter is a powerful way to move data from SQL Server to another database (or vice versa).
However, linked servers have no support for parameter markers, and stored procedure support is nonexistent for remote databases other than SQL Server. To use those advanced SQL features, you need to use either the EXEC AT feature or Data Transformation Services (DTS), which actually has been replaced by SQL Server Integration Services (SSIS).
DTS Is Itself Transformed, to SSIS
If your data needs include access to stored procedures, Data Transformation Services (DTS) was the route that had been available since version 7 of SQL Server. DTS basically allowed you to create sort of an SQL batch job that was designed to move data from one database to another. Because it was the only way to do certain things, DTS was often used for jobs other than data transformation, but the idea remained the same: you created a set of SQL statements and then invoked them as needed. Note that these statements depend on the capabilities of the driver used, but they usually provide enough capability to make DTS the tool of choice for transformations between different databases.
However, with the introduction of SQL Server 2005, DTA was replaced by SQL Server Integration Services (SSIS), which was available only in the enterprise edition of the product. With the release of SQL Server 2008, SSIS has been broken up into a variety of components with differing availabilities. So, if you plan to use SSIS, you need to take a careful look at your server edition to see if the required features are available.
It's not surprising that Microsoft doesn't include SSIS in all editions of its database (it seems that Redmond is taking a page from IBM and unbundling technology with new releases); the functionality is pretty extensive, including support for things like XML and allowing complex transformations and scheduled data conversions. It really is now much more of a data mining and warehousing utility, even a rudimentary form of business intelligence. However, that unbundling does mean that you have to budget for which components you will need in your environment.
EXEC AT
As of SQL Server 2005, the other option is to use the EXEC AT capability of T-SQL. This is sort of like RCMD in FTP; it passes a statement directly through to the host database and executes it there. In its base form, the EXEC (or EXECUTE) operation in T-SQL lets you execute an arbitrary character string. Similar to prepared statements in embedded SQL languages, this is the T-SQL method of executing dynamic SQL. You can build an entire SQL statement character by character and then execute it.
One of the lesser-known capabilities of this technique is to take that dynamic statement and transmit it for execution at a remote database that has been previously defined as a linked server. The data is literally sent to the other machine and executed there. This very powerful capability has some caveats, however. First, the syntax for the call has to be in the syntax of the remote database. While this makes sense, it can be somewhat confusing to see non-Microsoft syntax in a T-SQL program. Also, you have to take into account character-coding issues; character parameters are subject to translation.
Performance
So it's not impossible to manipulate databases remotely and to copy data from one database to another using simple SQL syntax. The linked server approach in SQL Server makes that syntax relatively easy, and the enhancements to the EXEC AT operation allow even more complex vendor-specific instructions to be passed through to the remote database.
The last issue is performing operations that actually join two databases and, specifically, the performance issue that can occur. One of the primary reasons that SQL is becoming so prevalent is the fact that the query optimization capabilities of the SQL engines make them very fast. Because an SQL engine has a deep understanding of its database, including statistics on the data itself, the engine has the ability to make decisions that improve performance.
Such understanding is lost when two databases are joined, especially two completely different databases such as SQL Server and DB2. The performance statistics gathered by the DB2 database are not available to SQL Server, so it must take the most pessimistic options for every decision. Add to that the pure overhead of having to go across the wire just to get a record, and it's clear that queries can quickly become very complex. As important as database and query design are to standard single-database applications, they are that much more important in multi-database environments.
Another hurdle to the multiple-database environment is the fact that it is often very hard to find appropriate test data. It's difficult enough to get a realistic snapshot of data on one machine, much less get snapshots on two machines, especially if the data needs to be somehow related. With cross-database queries, you need to test using data whose keys are going to coincide in order to properly exercise your join criteria. You have to understand both databases in order to extract the appropriate test information as well as have a place to put it. These issues are magnified in smaller shops with limited resources; it's hard enough finding extra disk space for Windows machines, much less extra room on an i.
Because of the difficulty of setting up proper test data, this is the type of situation where programmers may be less stringent in their testing procedures. However, testing is more rather than less important in this situation, especially load testing. More issues factor into the equation. In addition to raw data size, you have to take into account system load on both of the machines as well as network congestion. Before putting any query into production, it is essential to try it out on production quantities of data when network traffic will be the same as the time that the statements will be executed.
Multi-database queries are not to be taken lightly. For online transactions, they can add a measurable delay to response time. For batch operations, they can add an unwelcome level of overhead. Depending on the business requirement, you may actually be better off with an alternate approach, such as staging data from one machine to another during off-peak hours.
Final Note and Denouement
I noted above that both the EXEC AT and the DTS approaches require that you use DB2 syntax for your SQL code rather than SQL Server's T-SQL syntax. This allows me a quick reference to the fact that SQL dialects differ from one vendor to another; the differences can be subtle and a little tricky. Syntactical differences in stored-procedure syntax can be even trickier. So moving back and forth is not always easy.
Back in my early days of SQL, I used a tool called SwisSQL to convert between various flavors of SQL query syntax. The tool is one of the best I've ever used, although the pricing has risen considerably over the years. I think the basic query converter is $395, while the stored procedure converter is $995. For awhile, AdventNet had free versions of the tools available, but now I think they offer a 30-day, limited-use trial.
Still, you might get a kick out of looking at what happens. What's important is that you realize that while it's probably reasonable to think about allowing users to create ad hoc queries that join disparate databases, as soon as you start talking about more advanced features such as stored procedures, or even trying to do large-scale inserts or updates, you may be better off attacking the problem programmatically.
LATEST COMMENTS
MC Press Online