23
Thu, Jan
4 New Articles

Practical SQL: Selecting One of a Set of Duplicates

SQL
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Today's article shows you how to take advantage of DB2 for i to fix an all-too-common problem.

 

Moving from traditional DDS environments to a pure SQL configuration requires some rethinking and sometimes a little re-engineering. One case has to do with duplicate keys. We sometimes end up with databases where we have "almost" duplicate records. That is, the same record (typically, some type of master data) gets written multiple times with nothing different except the timestamp or some other inconsequential field. It doesn't affect our application because when we CHAIN we get the first record of the set and since the meaningful data is the same for all records, we never notice. But then we go to SQL and we need to create a primary key over the file. We create our new DDL-defined table and attempt to copy our data, but the copy fails because of duplicate keys. Today I'll show you an easy way to clean that data up.

 

Problems with Selecting the Data

When moving from DDS to DDL, the best scenario is when all we have to do is create a new table with all the fields from the original file and then just copy the data in. So the basic process is pretty simple: we select all the fields from all the records from the first file and insert them into the second file. The problem arises when you have more than one record with the same key fields. And remember, I'm asserting that the rest of the important fields in the duplicate records are the same. If they're notif you have records with the same key but different datathen you have a different problem. But if the case is as I described and the only differences in the data are things like timestamps or change counts, then all you want is one record from each set; you don't really care which one.

 

If you're familiar with SQL, you're probably already saying to yourself, "Joe, all we need to do is use the DISTINCT keyword!" And yes, you would be right if all the fields were duplicated. But because the timestamps are different, you would get every record, including the "almost" duplicates. So DISTINCT doesn't really work very well. The next way to go would be to use GROUP BY. For that solution, you would have a GROUP BY clause on the key fields. However, as anyone who has worked with grouping in SQL knows, you can't then just select the non-key fields. Instead, you have to use some sort of aggregate (like MAX) on the other fields. You'll have to list every field in the file in the main SELECT clause with all non-key fields as aggregates (and then list all the important fields again in the GROUP BY). If it's a typical master record with dozens of fields, the result can get quite unwieldy. Imagine a customer master record with only one primary key field but dozens of data fields:

 

SELECT CMCUST, MAX(CMSTAT), MAX(CMNAME), MAX(CMADR1), (…), MAX(CMTIME)
FROM CUSTMAST GROUP BY CMCUST

 

As you can see, the more fields in the file, the longer and more difficult to maintain the statement becomes. Now, we RPG dinosaurs would like to point out that this is a perfect job for the RPG cycle. We'd simply put a good old L1 indicator on the key field(s) and then only do a write at L1 time, but we know that the cycle makes people's heads explode these days, so we need to figure out a more elegant solution. (And when we say "elegant," we really just mean one that doesn't use I-specs.)

 

The Solution

The genesis of this particular solution came from a discussion on the MIDRANGE-L mailing list. I watched as several people discussed the pros and cons of some of the approaches, and the answers became pretty convoluted. I recognized most of the approaches because I have run across this problem many times in the past and used nearly every technique imaginable, although I do admit that there were a couple of new ones in the list that I hadn't seen before. But generally speaking, most of the approaches used grouping in some way or another. One of the more creative approaches used FETCH FIRST ONLY in a subselect, but even that would eventually have required listing out all the fields.

 

I want to point out that if you're doing any sort of data conversion on some of the fields in the file as they move to the new table, such as converting numeric dates to date fields or other modernization techniques, or if you're dropping some fields, or if you're incorporating foreign data of some kind, then the massive SELECT statement may be the correct solution. Or better yet, an RPG program makes those conversions a bit easier to maintain. I try to avoid huge SQL statements that have bits of business logic strewn about throughout them; it's very hard to debug those things for the original programmer, much less for someone down the road who has to pick it up. But I'm going a bit far afield here; let's get back to the topic at hand.

 

So if you're not doing any conversions and if you're simply selecting all the fields from the old file to insert into the new table, I have a solution that works for every file, every time. Here's the technique:

 

SELECT * FROM CUSTMAST WHERE RRN(CUSTMAST) IN
(SELECT MAX(RRN(CUSTMAST)) FROM CUSTMAST GROUP BY CMCUST)

 

That's it! The only thing that changes is the file name, obviously, and the fields specified in the GROUP BY clause. Everything else stays the same. What you get is the data in the record that was most recently written, while MIN(RRN(CUSTMAST)) would get you the first record written. Well, we have to be a little bit careful with that: technically, you get the record with the highest relative record number. If your file has REUSEDLT specified, the relative record number (RRN) is not reliable as an indicator of position. But since we're declaring that the contents of the non-matching fields don't matter, it shouldn't matter which record you get.

 

What I think is neat about this solution is that it's unique to the IBM i, since other databases don't have a concept of a relative record number. That unique RRN lets us do a lot of interesting things. Take a look at Jim Staton's article on deleting duplicates using RRN and SQL. You'll see that the same concepts apply. I hope this helps you in some of your data cleanup and conversion tasks!

 

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$

Book Reviews

Resource Center

  • SB Profound WC 5536 Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. You can find Part 1 here. In Part 2 of our free Node.js Webinar Series, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Brian will briefly discuss the different tools available, and demonstrate his preferred setup for Node development on IBM i or any platform. Attend this webinar to learn:

  • SB Profound WP 5539More than ever, there is a demand for IT to deliver innovation. Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects. The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the company are not aligned with the current IT environment.

  • SB HelpSystems ROBOT Generic IBM announced the E1080 servers using the latest Power10 processor in September 2021. The most powerful processor from IBM to date, Power10 is designed to handle the demands of doing business in today’s high-tech atmosphere, including running cloud applications, supporting big data, and managing AI workloads. But what does Power10 mean for your data center? In this recorded webinar, IBMers Dan Sundt and Dylan Boday join IBM Power Champion Tom Huntington for a discussion on why Power10 technology is the right strategic investment if you run IBM i, AIX, or Linux. In this action-packed hour, Tom will share trends from the IBM i and AIX user communities while Dan and Dylan dive into the tech specs for key hardware, including:

  • Magic MarkTRY the one package that solves all your document design and printing challenges on all your platforms. Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product. Make sure your data survives when catastrophe hits. Request your trial now!  Request Now.

  • SB HelpSystems ROBOT GenericForms of ransomware has been around for over 30 years, and with more and more organizations suffering attacks each year, it continues to endure. What has made ransomware such a durable threat and what is the best way to combat it? In order to prevent ransomware, organizations must first understand how it works.

  • SB HelpSystems ROBOT GenericIT security is a top priority for businesses around the world, but most IBM i pros don’t know where to begin—and most cybersecurity experts don’t know IBM i. In this session, Robin Tatam explores the business impact of lax IBM i security, the top vulnerabilities putting IBM i at risk, and the steps you can take to protect your organization. If you’re looking to avoid unexpected downtime or corrupted data, you don’t want to miss this session.

  • SB HelpSystems ROBOT GenericCan you trust all of your users all of the time? A typical end user receives 16 malicious emails each month, but only 17 percent of these phishing campaigns are reported to IT. Once an attack is underway, most organizations won’t discover the breach until six months later. A staggering amount of damage can occur in that time. Despite these risks, 93 percent of organizations are leaving their IBM i systems vulnerable to cybercrime. In this on-demand webinar, IBM i security experts Robin Tatam and Sandi Moore will reveal:

  • FORTRA Disaster protection is vital to every business. Yet, it often consists of patched together procedures that are prone to error. From automatic backups to data encryption to media management, Robot automates the routine (yet often complex) tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:

  • FORTRAManaging messages on your IBM i can be more than a full-time job if you have to do it manually. Messages need a response and resources must be monitored—often over multiple systems and across platforms. How can you be sure you won’t miss important system events? Automate your message center with the Robot Message Management Solution. Key features include:

  • FORTRAThe thought of printing, distributing, and storing iSeries reports manually may reduce you to tears. Paper and labor costs associated with report generation can spiral out of control. Mountains of paper threaten to swamp your files. Robot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing. Manage your reports with the Robot Report Management Solution. Key features include:

  • FORTRAFor over 30 years, Robot has been a leader in systems management for IBM i. With batch job creation and scheduling at its core, the Robot Job Scheduling Solution reduces the opportunity for human error and helps you maintain service levels, automating even the biggest, most complex runbooks. Manage your job schedule with the Robot Job Scheduling Solution. Key features include:

  • LANSA Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.

  • LANSAWhen it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed. Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

  • LANSASupply Chain is becoming increasingly complex and unpredictable. From raw materials for manufacturing to food supply chains, the journey from source to production to delivery to consumers is marred with inefficiencies, manual processes, shortages, recalls, counterfeits, and scandals. In this webinar, we discuss how:

  • The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit

  • Profound Logic Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.

  • SB Profound WC 5536Join us for this hour-long webcast that will explore:

  • Fortra IT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators with intimate knowledge of the operating system and the applications that run on it is small. This begs the question: How will you manage the platform that supports such a big part of your business? This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn: