24
Sun, Nov
1 New Articles

Sound Off for Soundex

General
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Thank you, Mr. Smith. The telephone receiver goes down. Wait a minute. Was that Smith or Smyth? Maybe Smythe? Too late now; I'll guess at Smith.

The English language has many strange rules and exceptions. Start dealing with the spellings of names and there can be big trouble. The United States census takers had those problems. Imagine trying to get everyone's name recorded correctly for the entire country. In a task this daunting, there are bound to be many errors.

If, after taking this error-prone information, there is a desire to match data by family names, how is this done? The answer was provided by two people, Margaret Odell and Robert Russel, shortly before World War I. They developed a simple algorithm called Soundex to accomplish this task. Their original algorithm can be found in Appendix I, page 39, of The 1910 Federal Population Census: A Catalog of Microfilm Copies of the Schedules. The two patents that existed regarding the Soundex algorithms expired before World War II.

Odell and Russel examined the usage of letters as they were pronounced. They attempted to categorize letters into groups by their sounds and by their significance to the final pronunciation. Each letter of the alphabet was assigned a number. These numbers were then used to produce a Soundex code for a name. Names with the same Soundex code would be pulled out together in a search on a given name. In the Smith example, all three spellings would generate the same Soundex code.

When I was looking to build a Soundex algorithm for a recent project, I searched the Internet and a large, local BBS. What I discovered was that Soundex coding is an art form, with each person interpreting it somewhat differently and adding flourishes. If you are going to build your own Soundex routine, take a spin around cyberspace and see what others have done before you.

PC and UNIX resources have a plethora of information on Soundex algorithms. I found many Soundex routines, some in languages that I've never heard of, and thoughts from various Soundex authors. The grandfather of most modern Soundex algorithms seems to be from The Art of Computer Programming, Volume 3: Sorting and Searching by Donald Knuth. PC Magazine recently ran a small article on Soundex (May 30, 1995) which included C language source code.

How It Works

My Soundex algorithm will take a 50-character string and convert it into a four-position Soundex code. This code can then be stored in a record and used as a key field for a logical file when you want to search by Soundex.

My Soundex implementation is made up of one letter followed by three digits. This is the same method used by Soundex's originators. The letter in the code is the first letter of the word or name being converted to Soundex. If the name is Smith, the first position of the Soundex code will contain the letter S.

Bringing the first letter directly into the Soundex code has both advantages and disadvantages. The advantage is that it narrowly focuses the search for similar sounding names to names that start with a particular letter. The pool of possibilities should be reduced compared to the alternative method.

The alternative is to encode the first letter just like the rest of the name. So, instead of the original letter-digit-digit-digit Soundex code, a digit-digit-digit-digit Soundex code is produced. The advantage is that it will find matches for similar sounding first letters. For example, "staut" and "ztaut" would have equal Soundex codes under the four-digit method. I have seen implementations using both methods. I have also seen implementations that allow for a longer Soundex code.

By now, you are probably wondering about those mysterious digits in the Soundex code. Remember those old secret codes you used with your best friend when you were a kid (there were no PCs and PGP encryption software back then): you would write all of the letters to the alphabet and equate them to some other letter or symbol. Well, this is kind of like that, except all of the letters have codes based on their sound groups. In Soundex, all of the letters of the alphabet correspond to one of seven digits, 0 through 6. 1 contains the translation table my algorithm uses.

By now, you are probably wondering about those mysterious digits in the Soundex code. Remember those old secret codes you used with your best friend when you were a kid (there were no PCs and PGP encryption software back then): you would write all of the letters to the alphabet and equate them to some other letter or symbol. Well, this is kind of like that, except all of the letters have codes based on their sound groups. In Soundex, all of the letters of the alphabet correspond to one of seven digits, 0 through 6. Figure 1 contains the translation table my algorithm uses.

Group 0 is made up of letters considered to have little contribution to overall word sound. It includes all of the vowels and consonants that are often silent. The other groups create similar sounds when pronouncing a word. The greatest variation that I found in Soundex algorithms was in this translation table. The one used here is the most popular, but many programmers juggled the letters to fine-tune the process.

Let's go back to old Mr. Smith. If we wanted to translate his name into Soundex code, we would bring over the first letter of his name and translate the remaining letters based on the translation table in 1. The interim result would be S5030. The algorithm compresses out any irrelevant sounds (e.g., silent letters) that exist between the relevant sounds and shortens the result to the four positions required. For Smith, the final result is S530. The raw code for Smyth is S5030; for Smythe, it's S50300. When compressed and truncated, all three versions end up as Soundex code S530-a match.

Let's go back to old Mr. Smith. If we wanted to translate his name into Soundex code, we would bring over the first letter of his name and translate the remaining letters based on the translation table in Figure 1. The interim result would be S5030. The algorithm compresses out any irrelevant sounds (e.g., silent letters) that exist between the relevant sounds and shortens the result to the four positions required. For Smith, the final result is S530. The raw code for Smyth is S5030; for Smythe, it's S50300. When compressed and truncated, all three versions end up as Soundex code S530-a match.

If I needed to Soundex code the name Who, I would end up with a raw code of W00. Since the raw code is less than four positions long, it is padded to fill out all four positions by adding a 0. Who's Soundex code is W000 (and yes, he is on first).

There is another oft-changed element to Soundex algorithms: redundant sounds. Some versions strip all redundant, contiguous letters. Some versions go further and strip redundant contiguous sounds (i.e., letters in the same sound group). Other versions go even further and strip redundant, contiguous sounds in the compressed Soundex code. To demonstrate this, assume the contrived name Amemnierr. Ms. Amemnierr's raw code would be A50550066. If an algorithm strips only redundant, contiguous letters, the raw code would be changed to A5055006 (the second R was removed) and the resultant Soundex code would be A555. If an algorithm strips redundant contiguous sounds, the raw code would be changed to A505006 (the second R and the N are removed), and the resultant Soundex code would be A556. Another version strips redundant sounds after the compression. We ended up with four-position code in the last example (A556), but stripping the redundant 5 results in the final code A560. (A 0 is placed at the end to pad out to four positions if we run out of relevant sound codes.)

The algorithm in this article follows the last method, stripping all redundant sounds after compression.

The RPG Code

2 contains the ILE RPG program SDX001RG, which executes my Soundex algorithm. SDX001RG accepts a 50-character string and returns a four-position Soundex code. The *Entry parameters are listed in the initialization subroutine.

Figure 2 contains the ILE RPG program SDX001RG, which executes my Soundex algorithm. SDX001RG accepts a 50-character string and returns a four-position Soundex code. The *Entry parameters are listed in the initialization subroutine.

The input string is immediately converted to all uppercase letters. After this is accomplished, a small flourish of my own is executed. A call to subroutine Odd_Start examines the first two characters in the string. If they match any of the combinations that I am trapping, sound equivalencies are made in the starting letter. I did this in an attempt to compensate for words such as phone. This subroutine is a good one to play with and could easily be expanded or removed for your own purposes. As you can see by examining the Odd_Start subroutine in 2, I change PH to F, NM to M, PT to T, and KN to N.

The input string is immediately converted to all uppercase letters. After this is accomplished, a small flourish of my own is executed. A call to subroutine Odd_Start examines the first two characters in the string. If they match any of the combinations that I am trapping, sound equivalencies are made in the starting letter. I did this in an attempt to compensate for words such as phone. This subroutine is a good one to play with and could easily be expanded or removed for your own purposes. As you can see by examining the Odd_Start subroutine in Figure 2, I change PH to F, NM to M, PT to T, and KN to N.

SDX001RG also strips out all non-alphabetic characters. This is designed to clear any insignificant characters that may come along with a name, such as apostrophes (O'hare). The actual Soundex coding is quite simple and relies upon RPG's XLATE operation code to translate a given uppercase letter into its corresponding Soundex number code.

I've provided display file SDX002DF (see 3) and program SDX002RG (see 4) as a small application to test my Soundex program, SDX001RG. The SDX002RG program presents a pop-up window that accepts a character string and displays the resultant Soundex code for the string. This program can be especially helpful as you tweak the program to fit your own needs.

I've provided display file SDX002DF (see Figure 3) and program SDX002RG (see Figure 4) as a small application to test my Soundex program, SDX001RG. The SDX002RG program presents a pop-up window that accepts a character string and displays the resultant Soundex code for the string. This program can be especially helpful as you tweak the program to fit your own needs.

Many Uses

Soundex need not be used only for matching similar names. It can work for all sorts of words and applications. It is difficult to spell medical procedures or medicine names correctly. A database made up of this subject matter could easily be Soundex-coded. Other obvious usages can be found in human resource, customer, patient, or client databases.

What about street or city names? The opportunities to use this simple, malleable algorithm are endless. Now the next time Mr. Smyth, or Mr. Smythe (or was that Mr. Smith?) calls, you can look up his personal data quickly and easily.

Brian Kautz, an application development manager in a Pennsylvania manufacturing business, has worked in the IBM midrange environment for 10 years. He can be reached on the Internet at This email address is being protected from spambots. You need JavaScript enabled to view it..

REFERENCES

Knuth, Donald. The Art of Computer Programming, Volume 3: Sorting and Searching. Reading, Mass.: Addison-Wesley Publishing, 1973 (ISBN 0-201-13803-X).

The 1910 Federal Population Census: A Catalog of Microfilm Copies of the Schedules. Washington, D.C.: National Archives Trust Fund Board, 1982 (ISBN 0-911333-15-0).


Sound Off for Soundex

Figure 1: Soundex Code Translation Table

 Letters Soundex Code a,e,h,i,o,u,w,y 0 b,f,p,v 1 c,g,j,k,q,s,x,z 2 d,t 3 l 4 m,n 5 r 6 
Sound Off for Soundex

Figure 2: RPG Program SDX001RG

 *=============================================================== * To compile: * * CRTBNDRPG PGM(XXX/SDX001RG) SRCFILE(XXX/QRPGLESRC) * *=============================================================== * SDX001RG: Convert a given string to a Soundex code. * Input: Accepts a 50 character string. (In_String) * Output: Returns a 4 character soundex code. (Soundex) * Indicator Usage: * 90 - General Work Indicator * Variable Declarations: *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 DInput_Data S 1A Dim(50) DWork_Data S 1A Dim(50) DIn_String S 50A DSoundex S 4A DSave_Last S 1A INZ(*HIVAL) DFirst_Two S 2A DI1 S 2S 0 INZ(0) DI2 S 2S 0 INZ(0) DUpper C CONST('ABCDEFGHIJKLMNOPQRSTUVWXYZ') DLower C CONST('abcdefghijklmnopqrstuvwxyz') DCodes C CONST('01230120022455012623010202') ********************* * Mainline Program: * ********************* * Strip Non-Alphabet Characters from the input data C EXSR Strip_It * Convert input text to soundex coding C EXSR Convert_it * Strip out Repeat Sound Codes and Build Soundex Variable: C EXSR Build_it * End the program C EVAL *inLR = *ON * Subroutine Section: *********************** * Strip It Subroutine * *********************** C Strip_it BEGSR * Strip all non-alpabetic characters from the input data: *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 C DOW I1 < %ELEM(Input_Data) C EVAL I1 = I1 + 1 * Check for non-alpha characters C Upper CHECK Input_Data(I1) 90 * Continue processing this value only if it was alphabetic C IF *in90<>*ON * Check for repeat letters and fill work file C IF Save_Last <> Input_Data(I1) C EVAL I2 = I2 + 1 C EVAL Work_Data(I2) = Input_Data(I1) C EVAL Save_Last = Input_Data(I1) C ENDIF C ENDIF C ENDDO * Move the Work Array Information back to the Input Data Array * and clear the work array. C EVAL Input_Data = Work_Data C CLEAR Work_Data C ENDSR ************************* * Convert it Subroutine * ************************* C Convert_It BEGSR * Convert input data to Soundex Coding * First element is moved without translation: C EVAL Work_Data(1) = Input_Data(1) C EVAL I1 = 1 C DOW I1 < %ELEM(Input_Data) C EVAL I1 = I1 + 1 C Upper:Codes XLATE Input_Data(I1)Work_Data(I1) C ENDDO * Move the Work Array Information back to the Input Data Array * and clear the work array. C EVAL Input_Data = Work_Data C CLEAR Work_Data *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 C ENDSR *********************** * Build It Subroutine * *********************** C Build_it BEGSR * Build the Soundex Variable * First Character comes over with no further processing: C EVAL %SUBST(Soundex:1:1) = Input_Data(1) * Set up variables for processing loop: C EVAL Save_Last = *HIVAL C EVAL I1 = 1 C EVAL I2 = 1 * Do while index is less than number of input elements * and current element is not blank * and last Soundex element is not filled. C DOW I1 < %ELEM(Input_Data) C AND Input_Data(I1) > ' ' C AND I2 < 4 C EVAL I1 = I1 + 1 C IF Input_Data(I1) > ' ' * If code is not a dupe of the previous code, move it to Sound_X: C IF Save_Last <> Input_Data(I1) AND C Input_Data(I1) <> '0' C EVAL Save_Last = Input_Data(I1) C EVAL I2 = I2 +1 C EVAL %SUBST(Soundex:I2:1) = Input_Data(I1) C ENDIF C ENDIF C ENDDO * Convert any remaining blank values to zeros: C ' ':'0' XLATE Soundex Soundex C ENDSR ****************************** * Alter Odd Starting Letters * ****************************** *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 C Odd_Start BEGSR * Try to raise accuracy by substituting common weird two letter * word starting combinations. * Move the first two letters of our word to be converted: C EVAL First_Two = %SUBST(In_String:1:2) C SELECT C WHEN First_Two = 'PH' C EVAL %SUBST(In_String:1:2) = ' F' C WHEN First_Two = 'NM' C EVAL %SUBST(In_String:1:1) = ' ' C WHEN First_Two = 'PT' C EVAL %SUBST(In_String:1:1) = ' ' C WHEN First_Two = 'KN' C EVAL %SUBST(In_String:1:1) = ' ' C ENDSL C ENDSR ****************************** * Initialization Subroutine: * ****************************** C *Inzsr BEGSR * Parameters Expected by this Program: C *Entry PLIST C PARM In_String C PARM Soundex * Translate all alphabetic characters to upper case: C Lower:Upper XLATE In_String In_String * Alter Odd Starting Letter combinations: C EXSR Odd_Start * Strip Leading Blanks C EVAL In_String = %triml(In_String) * Place our Input Data into an array for further processing: C MOVEA In_String Input_Data * Make certain that the Soundex code variable is empty: C CLEAR Soundex C ENDSR *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 
Sound Off for Soundex

Figure 3: Display File SDX002DF

 A*=============================================================== A* To compile: A* A* CRTDSPF FILE(XXX/SDX002DF) SRCFILE(XXX/QDDSSRC) A* A*=============================================================== A*. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 A DSPSIZ(24 80 *DS3) A CSRINPONLY A* Main Input Screen: A R SCREEN1 A WINDOW(5 10 9 52 *NOMSGLIN) A CF03 A CF12 A PRINT A WDWBORDER((*COLOR WHT) (*DSPATR HI)) A WDWTITLE((*TEXT 'Soundex Test') (*C- A OLOR WHT)) A 2 1'Input String:' A IN_STRING 50A I 3 2CHECK(LC) A 5 2'Results of Last Run:' A OUT_STRING 50A O 6 2COLOR(WHT) A 7 2'Soundex code --------->' A SOUNDEX 4A O 7 26COLOR(WHT) A 8 1'F3=Exit' A COLOR(BLU) * Dummy Record: A R DUMMY A ASSUME A 1 5' ' A*. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 
Sound Off for Soundex

Figure 4: RPG Program SDX002RG

 *=============================================================== * To compile: * * CRTBNDRPG PGM(XXX/SDX002RG) SRCFILE(XXX/QRPGLESRC) * *=============================================================== * SDX002RG: Demonstrate The Soundex Routine *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 FSDX002DF CF E WORKSTN F INFDS(INFDS) * Information Data Structure To Get The Function Key pressed: D INFDS DS D FunctionKy 369 369 * Named Hex Constants For Function Keys D F03 C CONST(X'33') D F12 C CONST(X'3C') D ENTER C CONST(X'F1') **************** * Main Program * **************** C DOW FunctionKy <> F03 C AND FunctionKy <> F12 * Output Entry Screen: C EXFMT Screen1 * Process Screen Based on Key used: C SELECT C WHEN FunctionKy = ENTER C EXSR Process * Verify Input C OTHER * Do Nothing C ENDSL C ENDDO * Terminate the Program: C EVAL *inLR = *ON ***************************************** * Subroutine Section * ***************************************** * Call the Soundex Routine: C Process BEGSR C CALL 'SDX001RG' C PARM In_String C PARM Soundex C EVAL Out_String = In_String C CLEAR In_String C ENDSR *. 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 
BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$

Book Reviews

Resource Center

  • SB Profound WC 5536 Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. You can find Part 1 here. In Part 2 of our free Node.js Webinar Series, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Brian will briefly discuss the different tools available, and demonstrate his preferred setup for Node development on IBM i or any platform. Attend this webinar to learn:

  • SB Profound WP 5539More than ever, there is a demand for IT to deliver innovation. Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects. The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the company are not aligned with the current IT environment.

  • SB HelpSystems ROBOT Generic IBM announced the E1080 servers using the latest Power10 processor in September 2021. The most powerful processor from IBM to date, Power10 is designed to handle the demands of doing business in today’s high-tech atmosphere, including running cloud applications, supporting big data, and managing AI workloads. But what does Power10 mean for your data center? In this recorded webinar, IBMers Dan Sundt and Dylan Boday join IBM Power Champion Tom Huntington for a discussion on why Power10 technology is the right strategic investment if you run IBM i, AIX, or Linux. In this action-packed hour, Tom will share trends from the IBM i and AIX user communities while Dan and Dylan dive into the tech specs for key hardware, including:

  • Magic MarkTRY the one package that solves all your document design and printing challenges on all your platforms. Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product. Make sure your data survives when catastrophe hits. Request your trial now!  Request Now.

  • SB HelpSystems ROBOT GenericForms of ransomware has been around for over 30 years, and with more and more organizations suffering attacks each year, it continues to endure. What has made ransomware such a durable threat and what is the best way to combat it? In order to prevent ransomware, organizations must first understand how it works.

  • SB HelpSystems ROBOT GenericIT security is a top priority for businesses around the world, but most IBM i pros don’t know where to begin—and most cybersecurity experts don’t know IBM i. In this session, Robin Tatam explores the business impact of lax IBM i security, the top vulnerabilities putting IBM i at risk, and the steps you can take to protect your organization. If you’re looking to avoid unexpected downtime or corrupted data, you don’t want to miss this session.

  • SB HelpSystems ROBOT GenericCan you trust all of your users all of the time? A typical end user receives 16 malicious emails each month, but only 17 percent of these phishing campaigns are reported to IT. Once an attack is underway, most organizations won’t discover the breach until six months later. A staggering amount of damage can occur in that time. Despite these risks, 93 percent of organizations are leaving their IBM i systems vulnerable to cybercrime. In this on-demand webinar, IBM i security experts Robin Tatam and Sandi Moore will reveal:

  • FORTRA Disaster protection is vital to every business. Yet, it often consists of patched together procedures that are prone to error. From automatic backups to data encryption to media management, Robot automates the routine (yet often complex) tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:

  • FORTRAManaging messages on your IBM i can be more than a full-time job if you have to do it manually. Messages need a response and resources must be monitored—often over multiple systems and across platforms. How can you be sure you won’t miss important system events? Automate your message center with the Robot Message Management Solution. Key features include:

  • FORTRAThe thought of printing, distributing, and storing iSeries reports manually may reduce you to tears. Paper and labor costs associated with report generation can spiral out of control. Mountains of paper threaten to swamp your files. Robot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing. Manage your reports with the Robot Report Management Solution. Key features include:

  • FORTRAFor over 30 years, Robot has been a leader in systems management for IBM i. With batch job creation and scheduling at its core, the Robot Job Scheduling Solution reduces the opportunity for human error and helps you maintain service levels, automating even the biggest, most complex runbooks. Manage your job schedule with the Robot Job Scheduling Solution. Key features include:

  • LANSA Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.

  • LANSAWhen it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed. Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

  • LANSASupply Chain is becoming increasingly complex and unpredictable. From raw materials for manufacturing to food supply chains, the journey from source to production to delivery to consumers is marred with inefficiencies, manual processes, shortages, recalls, counterfeits, and scandals. In this webinar, we discuss how:

  • The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit

  • Profound Logic Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.

  • SB Profound WC 5536Join us for this hour-long webcast that will explore:

  • Fortra IT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators with intimate knowledge of the operating system and the applications that run on it is small. This begs the question: How will you manage the platform that supports such a big part of your business? This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn: