23
Mon, Dec
3 New Articles

Build a Searchable Web-Service Using OmniFind Text Search Server for DB2 for i

Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Discover the advantages offered by full-text search capabilities in modern applications.

 

In modern IT infrastructures, data stored in DB2 for i is often accessed and interpreted using web interfaces. In order to be successful, these interfaces need to be simple to use and still deliver exceptional value. For some applications, a necessary step in making this a reality is searching and ordering unstructured text data that is stored in a database column, using user-provided search terms. This article presents a sample human resources web service that needs text search capabilities and explains how to implement text searches using embedded SQL within an RPG procedure.

 

Full text search using DB2 for i requires that OmniFind Text Search Server for DB2 for i (5733-OMF) be installed. Although the OmniFind product is provided at no additional charge, it's not shipped as part of the base operating system and must be ordered separately (product 5733-OMF). OmniFind Text Search Server for DB2 for i is available starting with DB2 for i 6.1. Additionally, the capabilities of the product have been significantly enhanced in the V1R2 release available for DB2 for i 7.1.

 

Figure 1 illustrates an example SQL schema; a human resource department might use this schema to recruit prospective new employees. For simplicity, this model assumes that each applicant has a name and resume, and applies for exactly one position in the company. The resume column in the applicant table is a BLOB column that contains the applicants' resumes; the resumes are stored as unstructured text, using common formats such as Microsoft Word, PDF, OpenOffice, Rich Text Format (RTF), etc. The position table contains the title, department, and a unique job code for each position.

 

081413LawrenceFigure1       

Figure 1: This is the RESUME_APP schema for a human resources application.

 

Using this model, it's straightforward to write an SQL query that returns all of the applicants that have applied for a specific position in the company. Listing 1 shows an SQL query that that could be utilized within a web service. The query returns applicants that have applied for a specific job (using the job's code). The result set contains columns for the applicant's name and a URL that might be used by the user of the service to retrieve the applicant's resume. This example assumes that the web service to retrieve the resume document already exists and that the URL for the service has the form http://www.example.com/applicant_resume/applicant_id where applicant_id is the applicant_id column's value from the applicant table.

 

SELECT
applicant.name AS "Applicant Name",
'http://www.example.com/applicant_resume/' ||
               CAST(applicant.applicant_id AS VARCHAR(10)) AS "Resume URL"
FROM resume_app.applicant applicant
     INNER JOIN
     resume_app.position job USING (position_id)
WHERE
job.job_code = 'JB001'
ORDER BY applicant.name ASC;

Listing 1: The SQL query returns applicants for job 'JB001'.

 

Figure 2 shows a sample set of results for the query in Listing 1. It's easy to see that this SQL query is not going to provide results that will satisfy a user of the service. The query returns such a large number of applicants that it would be impossible to review all of the resumes. In addition, the alphabetical ordering of the results is not helpful in determining the best candidates for the position.
 
081413LawrenceFigure2

Figure 2: The results from SQL query in Listing 1 are not specific enough to be helpful.

 

The query would be much more usable if it included the capability to return only the rows with resumes that matched some keywords that are supplied by the user of the service. A query that ordered the resulting rows by relevance to the keywords would be even more powerful.

 

Searching the unstructured text contained within the resume documents and ordering the results requires some advanced capabilities. To understand why, consider a search for variations of the phrase "Agile Methodology." A typical user would expect both of the documents shown in Figure 3 to match this search. Matching these documents is non-trivial; the resume on the left (for an applicant named Frances Allen) is a PDF and contains only a variation on the word Methodology (Methodologies). The document on the right (for an applicant named Edgar Codd) is a Microsoft Word document and uses a very different document layout for similar information.

 

081413LawrenceFigure3

Figure 3: Both documents contain the keywords "Agile" and "Methodology."

 

This is where the OmniFind Text Search Server for DB2 for i product shines. OmniFind implements the SQL built-in CONTAINS and SCORE functions for searching these documents and ordering the results by relevance.

 

In order to use the CONTAINS and SCORE functions to perform searches and orderings, a text search index must first be created and updated. This setup can be performed using IBM i Navigator, or in SQL, using the SYSTS_CREATE and SYSTS_UPDATE stored procedures. Due to the performance considerations of the indexing process, text search indexes are not immediately maintained; however, a text index can be either explicitly updated or scheduled to automatically update on a regular basis. IBM STG Lab Services has published a white paper that discusses the necessary steps for creating and configuring a text index.

 

For this example, a text search index was created over the resume column, and an update was performed using the SQL statements shown in Listing 2. When the index was created, the FORMAT INSO option was used to indicate to OmniFind that the column contains "rich text" documents such as Microsoft Word, RTF, PDF, OpenOffice, etc. The update process will then automatically extract the text from the documents for indexing.

 

CALL SYSPROC.SYSTS_CREATE('RESUME_APP',
                         'RESUME_IDX',
                         'RESUME_APP.APPLICANT(RESUME)',
                         'FORMAT INSO');

CALL SYSPROC.SYSTS_UPDATE('RESUME_APP', 'RESUME_IDX
', '');

Listing 2: These SQL procedure calls create and update a text search index.

 

Once the index has been created and updated, the SQL CONTAINS function can be used to search the documents for keywords. The CONTAINS function returns a value of 1 if the document matches the keyword or a 0 if it does not.

 

While the CONTAINS function indicates whether a document matches a search phrase, the SCORE function can be used to determine how well a document matches the search phrase, in comparison to other documents in the index. The value returned will be between 0 and 1, where 0 is not at all relevant and 1 is extremely relevant. The actual result of the SCORE function is usually used only for ordering search results by relevance; the score value is not meaningful by itself.

 

Listing 3 shows how the SQL query in Listing 1 can be modified to incorporate the CONTAINS and SCORE functions.

 

SELECT
applicant.name                               AS "Applicant Name",
SCORE(applicant.resume, 'Agile Methodology') AS "Score value",
'http://www.example.com/applicant_resume/' ||
   CAST(applicant.applicant_id AS VARCHAR(10)) AS "Resume URL"
FROM resume_app.applicant applicant
     INNER JOIN resume_app.position job USING (position_id)
WHERE job.job_code = 'JB001' AND
     CONTAINS(applicant.resume, 'Agile Methodology') = 1
ORDER BY SCORE(applicant.resume, 'Agile Methodology') DESC;

Listing 3: This SQL query uses CONTAINS and SCORE to filter and order the results.

 

Figure 4 shows the results of the query that was shown in Listing 3. In comparison to Figure 2, the result set includes only the relevant rows, and the most relevant rows appear first in the result set.

 

081413LawrenceFigure4

Figure 4: Listing 3 delivers these results.

 

Embedding the query in Listing 3 into a CGI program that exports the results shown in Figure 4 as HTML is a straightforward task. Listing 4 illustrates an RPG procedure that could be used as part of a CGI program. The procedure takes two parameters as input: a job code and search string. The procedure in Listing 4 runs an SQL query and returns the results as a string containing HTML code. In the result string, each row from the result set becomes a row in an HTML table. For simplicity, the example ignores some trivial tasks, such as escaping characters with special meanings for HTML (such as '<') in the output string.

 

The most interesting bit of code in Listing 4 is the APPLICANTS_CSR cursor, which is defined in a similar way as the query shown in Listing 3. Using static SQL improves performance by preparing the query at precompile time. In addition, the use of host variables reduces the risk of problems due to malevolent user-provided search phrases. Unlike the query in Figure 3, the score value is used only for ordering and does not appear in the result set; this is because the exact score value is not meaningful to the user of the service.

 

     *****************************************************************
     *Implementation of a Function to do a search
     *****************************************************************
     P doSearch       B
     D                 PI         16000A   varying
     D SqlJobCode                   5A   const
     D SqlSearchStr               255A   varying const

     D ApplicantNM     S           200A   varying
     D URLForResume   S           4096A   varying

     D resultStr       S         16000A   varying

     * CRLF is a newline character
     D CRLF           C                   x'0d25'

     /free
       // These DECLARE statements tell SQL what the ccsid of the
       // input or output character variables will be.
       EXEC SQL DECLARE :SqlSearchStr VARIABLE CCSID 37;
       EXEC SQL DECLARE :URLForResume VARIABLE CCSID 37;

       // define the cursor that is used to retrieve the results
       EXEC SQL DECLARE APPLICANTS_CSR CURSOR FOR
           SELECT applicant.name,
                   'http://www.example.com/applicant_resume/' ||
                   CAST(applicant.applicant_id AS VARCHAR(10)) AS url
           FROM   resume_app.applicant applicant
                   INNER JOIN resume_app.position job USING (position_id)
           WHERE job.job_code = :SqlJobCode AND
                   CONTAINS(applicant.resume, :SqlSearchStr) = 1
           ORDER BY SCORE(applicant.resume, :SqlSearchStr) DESC;

       // open cursor
       EXEC SQL OPEN APPLICANTS_CSR;

       // Initialize result with an HTML table tag and headers
       resultStr =
       '<b> Job Code: </b><I>' + SqlJobCode + '</I><BR/>' + CRLF +
       '<b> Search Phrase: </b><I> ' + SqlSearchStr + '</I><HR>' + CRLF +
       '<table border=1>' + CRLF +
       '<tr><td><b>Applicant Name</b></td>' +
       '<td><b>Resume URL </b></td></tr>' + CRLF;

       // fetch first row
       EXEC SQL FETCH APPLICANTS_CSR INTO :ApplicantNM, :URLForResume;

       ///////////////////////////////////////////////////////////////////
       // Loop over each row that is retrieved from the cursor until no
       // more rows.
       // For each row, build a string with HTML code for the table's row
     ///////////////////////////////////////////////////////////////////
       DOW SQLCOD = 0;

         resultStr = resultStr +
                     '<tr>' + CRLF +
                     ' <td>' + ApplicantNM + ' </td> ' + CRLF +
                     ' <td><a href="' + URLForResume   +
                     '"> Resume </a></td>' + CRLF +
                     '</tr>' + CRLF;

         // fetch next row
         EXEC SQL FETCH APPLICANTS_CSR INTO
             :ApplicantNM, :URLForResume;

       ENDDO;

       // close cursor
       EXEC SQL CLOSE APPLICANTS_CSR;

       // close HTML table tag
       resultStr = resultStr + '</table>' + CRLF;

       // return results
       return resultStr;

     /end-free
     P                E          

 

Listing 4: RPG procedure to perform search and return the results as an HTML table

 

Figure 5 shows how the HTML string that is returned by the procedure in Listing 4 might appear in a web browser.
 
081413LawrenceFigure5

Figure 5: Here's the web-browser view of the result from the RPG procedure.

 

An important idea is that the CONTAINS and SCORE functions are understood by the DB2 for i optimizer; thus, the optimizer knows how to optimize the SQL query with them in mind.

 

Figure 6 shows a "visual explain" for the SQL query that ran within the RPG function shown in Listing 4. In this query, the DB2 for i optimizer was able to combine the CONTAINS and SCORE function calls into a single SCORE call and was further able to rewrite the query so that the function call was implemented as a table function. In this example, the optimizer has determined that the join between the applicant table and the table function is faster than a scalar function call for each row in the applicant table. The ability of the optimizer to optimize an SQL query that involves a full text search is a strong motivation to use the built-in CONTAINS and SCORE functions, rather than external solutions for which the optimizer has no awareness.

 

081413LawrenceFigure6

Figure 6: Visual explain shows how the CONTAINS and SCORE functions have been optimized.

 

This article has shown why text search capabilities can be an advantage and how these capabilities can be integrated into an application by using DB2 for i and OmniFind Text Search Server for DB2 for i. The OmniFind Text Search server product provides an efficient way to index and search text data in a database column, even when that data is stored in rich text documents such as Microsoft Word or PDF.

 

It's also worthwhile to look at the OmniFind documentation in the Information Center.

 

You'll also benefit from looking into the new functionality that was added to OmniFind Text Search Server for DB2 for i 7.1. OmniFind V1R2 and DB2 for i 7.1 add extensive capabilities for searching XML data. In addition, there are new capabilities to index Integrated File System (IFS) stream files (also called UNIX or PC files), members of source physical files, and spooled files. The following links are for articles that provide more details on these topics:

 

 

 

Customers interested in the latest and greatest enhancements for OmniFind may find the resources on DeveloperWorks interesting.

 

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$

Book Reviews

Resource Center

  • SB Profound WC 5536 Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. You can find Part 1 here. In Part 2 of our free Node.js Webinar Series, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Brian will briefly discuss the different tools available, and demonstrate his preferred setup for Node development on IBM i or any platform. Attend this webinar to learn:

  • SB Profound WP 5539More than ever, there is a demand for IT to deliver innovation. Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects. The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the company are not aligned with the current IT environment.

  • SB HelpSystems ROBOT Generic IBM announced the E1080 servers using the latest Power10 processor in September 2021. The most powerful processor from IBM to date, Power10 is designed to handle the demands of doing business in today’s high-tech atmosphere, including running cloud applications, supporting big data, and managing AI workloads. But what does Power10 mean for your data center? In this recorded webinar, IBMers Dan Sundt and Dylan Boday join IBM Power Champion Tom Huntington for a discussion on why Power10 technology is the right strategic investment if you run IBM i, AIX, or Linux. In this action-packed hour, Tom will share trends from the IBM i and AIX user communities while Dan and Dylan dive into the tech specs for key hardware, including:

  • Magic MarkTRY the one package that solves all your document design and printing challenges on all your platforms. Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product. Make sure your data survives when catastrophe hits. Request your trial now!  Request Now.

  • SB HelpSystems ROBOT GenericForms of ransomware has been around for over 30 years, and with more and more organizations suffering attacks each year, it continues to endure. What has made ransomware such a durable threat and what is the best way to combat it? In order to prevent ransomware, organizations must first understand how it works.

  • SB HelpSystems ROBOT GenericIT security is a top priority for businesses around the world, but most IBM i pros don’t know where to begin—and most cybersecurity experts don’t know IBM i. In this session, Robin Tatam explores the business impact of lax IBM i security, the top vulnerabilities putting IBM i at risk, and the steps you can take to protect your organization. If you’re looking to avoid unexpected downtime or corrupted data, you don’t want to miss this session.

  • SB HelpSystems ROBOT GenericCan you trust all of your users all of the time? A typical end user receives 16 malicious emails each month, but only 17 percent of these phishing campaigns are reported to IT. Once an attack is underway, most organizations won’t discover the breach until six months later. A staggering amount of damage can occur in that time. Despite these risks, 93 percent of organizations are leaving their IBM i systems vulnerable to cybercrime. In this on-demand webinar, IBM i security experts Robin Tatam and Sandi Moore will reveal:

  • FORTRA Disaster protection is vital to every business. Yet, it often consists of patched together procedures that are prone to error. From automatic backups to data encryption to media management, Robot automates the routine (yet often complex) tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:

  • FORTRAManaging messages on your IBM i can be more than a full-time job if you have to do it manually. Messages need a response and resources must be monitored—often over multiple systems and across platforms. How can you be sure you won’t miss important system events? Automate your message center with the Robot Message Management Solution. Key features include:

  • FORTRAThe thought of printing, distributing, and storing iSeries reports manually may reduce you to tears. Paper and labor costs associated with report generation can spiral out of control. Mountains of paper threaten to swamp your files. Robot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing. Manage your reports with the Robot Report Management Solution. Key features include:

  • FORTRAFor over 30 years, Robot has been a leader in systems management for IBM i. With batch job creation and scheduling at its core, the Robot Job Scheduling Solution reduces the opportunity for human error and helps you maintain service levels, automating even the biggest, most complex runbooks. Manage your job schedule with the Robot Job Scheduling Solution. Key features include:

  • LANSA Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.

  • LANSAWhen it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed. Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

  • LANSASupply Chain is becoming increasingly complex and unpredictable. From raw materials for manufacturing to food supply chains, the journey from source to production to delivery to consumers is marred with inefficiencies, manual processes, shortages, recalls, counterfeits, and scandals. In this webinar, we discuss how:

  • The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit

  • Profound Logic Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.

  • SB Profound WC 5536Join us for this hour-long webcast that will explore:

  • Fortra IT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators with intimate knowledge of the operating system and the applications that run on it is small. This begs the question: How will you manage the platform that supports such a big part of your business? This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn: