WARNING: This tutorial is being written! Do not hesitate to report any errors or suggestions.
Quick access: The engine of database used Activating the indexing engine Adding a scanner for a new type of file Implementing a search page
The NWS framework includes an indexing engine to enable searches among the documents of your web application. This indexing engine was designed to simplify at the extreme its installation within your NWS Web applications. An API, also very simple to use, has been defined to ask the resulting database, from the indexing of the documents, from your web application. Through this API, you can return your users a set of documents corresponding to the different search criteria.
To store the content of the various documents scanned, a relational database is used: it is the Java DB database. For information, Java DB is a supported distribution by SUN Microsystems of the Derby database (developed by the Apache Foundation). Java DB is developed exclusively in Java, therefore portable.
I would remind you that Java DB is integrated with the Java SE since its version 6.0. However, Java DB is not directly accessible through the initial CLASSPATH of the J2SE. ACCORDINGLY, IT IS NECESSARY TO SET YOUR WEB SERVER (TOMCAT OR ANY OTHER J2EE APPLICATION SERVER) TO EXTEND, AT ITS BEGINNING, THE CLASSPATH TO POINT ALSO TO THE DERBY.JAR FILE. Without this step (specific to each server), the following manipulations will not work!
The activation of the indexing engine is very simple. The addition of a single definition of parameter in the file "web.xml" may be sufficient to configure its launch at the startup of your Web application. As the example below (line 26), the parameter SEARCH_ENGINE_ENABLED tells the NWS Web application that the indexing engine is required. As default to define, this parameter will be considered as a false value.
SEARCH_ENGINE_ENABLED
01 <?xml version="1.0" encoding="UTF-8"?> 02 <!DOCTYPE web-app PUBLIC 03 "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN" 04 "http://java.sun.com/dtd/web-app_2_3.dtd"> 05 06 07 <web-app> 08 09 <display-name>Virtual Caddy - NWS Sample</display-name> 10 <description>NWS Sample</description> 11 12 <context-param> 13 <param-name>TRACE_ENABLED</param-name> 14 <param-value>true</param-value> 15 </context-param> 16 <context-param> 17 <param-name>TRACE_LOCALHOST_ONLY</param-name> 18 <param-value>true</param-value> 19 </context-param> 20 <context-param> 21 <param-name>TRACE_URI</param-name> 22 <param-value>Trace.wp</param-value> 23 </context-param> 24 25 <context-param> 26 <param-name>SEARCH_ENGINE_ENABLED</param-name> 27 <param-value>true</param-value> 28 </context-param> 29 30 <servlet> 31 <servlet-name>NWS Servlet</servlet-name> 32 <servlet-class>corelib.services.web.server.ControllerServlet</servlet-class> 33 </servlet> 34 35 <servlet-mapping> 36 <servlet-name>NWS Servlet</servlet-name> 37 <url-pattern>*.wp</url-pattern> 38 </servlet-mapping> 39 40 </web-app>
In the current state of things, indexing of documents is done by a background thread at the start of the web application considered. It is nevertheless possible to force the reconstruction of the database through the following lines of code.
SearchEngine searchEngine = SearchEngine.getInstance(); searchEngine.startScanning();
During a reconstruction of the base, the indexing engine will scan all types of documents supported and this recursively in all the folders of the Web application. Note however that the indexing engine will not treat the folder WEB-INF because it can not contain documents downloaded through HTTP. Also note that the database managed by the indexing engine is stored in this folder.
It is also possible to ask the indexing engine to ignore certain directories of the Web application considered. To do this, we must add the SEARCH_ENGINE_EXCLUDES parameter. The combined value is a list of directories separated by semi-colons. Caution: this value is well-understood "case-sensitive". The example below shows you, for example, how to ignore all the documents contained in the directory admin.
admin
01 <?xml version="1.0" encoding="UTF-8"?> 02 <!DOCTYPE web-app PUBLIC 03 "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN" 04 "http://java.sun.com/dtd/web-app_2_3.dtd"> 05 06 07 <web-app> 08 24 <!-- File begin --> 29 25 <context-param> 26 <param-name>SEARCH_ENGINE_ENABLED</param-name> 27 <param-value>true</param-value> 28 </context-param> 29 25 <context-param> 26 <param-name>SEARCH_ENGINE_EXCLUDES</param-name> 27 <param-value>admin</param-value> 28 </context-param> 29 24 <!-- File end --> 39 40 </web-app>
At the moment only the NWS web pages are accepted by the indexing engine. Nevertheless it is designed to be able to integrate new type of parser, if the need is felt. Within the architecture of the indexing engine, any type of parser must implement the interface corelib.services.web.searchengine.database.scanners.DocumentScanner. Here is an extract from the statement of the interface in question.
corelib.services.web.searchengine.database.scanners.DocumentScanner
01 package corelib.services.web.searchengine.scanners; 02 03 public interface DocumentScanner { 04 05 public String getDocumentTitle(); 06 07 public WordDictionary scanWebPage( String documentFilename ) throws DocumentScannerException; 08 09 }
The first method, getDocumentTitle, will retrieve the title of the scanned document. The second method, scanWebPage, analysis and extracts the words contained in the document passed in parameter. The location of the document, passed in parameter, is expressed through an absolute path on the file system. This method is also required to return a request of type corelib.services.web.searchengine.scanners.WordDictionary. It is through this object that the indexing engine may find the words contained in the document considered.
getDocumentTitle
scanWebPage
corelib.services.web.searchengine.scanners.WordDictionary
The class WordDictionary presents the particular method addWordScore, which adds to the dictionary a word with his associated score. If this word already exists for this document, then only the score is added. If you don't want to take care the cutting of a string of characters in words, you can also use the method public void parseString( String theString, WordScoreEnum currentScore ). The first parameter is the string of characters to cut. The second is the score associated with the words in the chain. The score, represented by an enumerated type depends on the context in which the chain has been found (a document title, a subtitle, ...). For more informations, i refers you to the NWS javadoc.
WordDictionary
addWordScore
public void parseString( String theString, WordScoreEnum currentScore )
Once your DocumentScanner is coded, you just have to include it in your Web application for the extension(s) of associated files. To do so, adding once again, the parameter SEARCH_ENGINE_ADDITIONAL_SCANNERS with the file WEB-INF/web.xml.
DocumentScanner
SEARCH_ENGINE_ADDITIONAL_SCANNERS
WEB-INF/web.xml
01 <?xml version="1.0" encoding="UTF-8"?> 02 <!DOCTYPE web-app PUBLIC 03 "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN" 04 "http://java.sun.com/dtd/web-app_2_3.dtd"> 05 06 07 <web-app> 08 24 <!-- File begin --> 29 25 <context-param> 26 <param-name>SEARCH_ENGINE_ENABLED</param-name> 27 <param-value>true</param-value> 28 </context-param> 29 25 <context-param> 26 <param-name>SEARCH_ENGINE_ADDITIONAL_SCANNERS</param-name> 27 <param-value>ext1=fr.package.ClassBasedOnDocumentScanner;html=fr.package.OtherClass</param-value> 28 </context-param> 29 24 <!-- File end --> 39 40 </web-app>
To close this tutorial, we will see how to interrogate the indexing engine for knowing the list of pages (classed in order of relevance) that responds to a data search criterion. To obtain a set of results, you must invoke the method getSearchResultSet from the singleton SearchEngine.getInstance(). This method accepts two parameters: the string of caracters to search (it will be automatically cut into words through the method WordDictionary.parseString) and the number of results to return. The following code shows you how to encode a class of Web page to query the indexing engine.
getSearchResultSet
SearchEngine.getInstance()
WordDictionary.parseString
01 package corelib.services.web.samples; 02 03 import corelib.services.web.searchengine.SearchEngine; 04 import corelib.services.web.searchengine.SearchResult; 05 import corelib.services.web.searchengine.SearchResultSet; 06 import corelib.services.web.server.WebPage; 07 08 09 public class Search extends WebPage { 10 11 private SearchResult [] results = null; 12 13 14 public String getRequestedString() { 15 return request.getParameter( "txtSearch" ); 16 } 17 18 19 public SearchResult [] getResults() { 20 if ( this.results == null ) { 21 SearchEngine searchEngine = SearchEngine.getInstance(); 22 SearchResultSet resultSet = searchEngine.getSearchResultSet( request.getParameter( "txtSearch" ), 50 ); 23 this.results = resultSet.toArray(); 24 } 25 return this.results; 26 } 27 28 }
WARNING: the lifecycle of a Web page requires a calculation of the results before moving into the managers of click on the buttons, and those because of the data binding. The easiest way is therefore to achieve this step in the method of results recovery (here getResults (lines 19-26)).
getResults
Then, you just have to encode a web page to present the calculated results. Note the use of the data binding through a web component of type <web:Repeater>, for more ease.
01 <?xml version="1.0" encoding="ISO-8859-1" ?> 02 <web:Html xmlns:web="corelib.services.web.components" 03 codeBehind="corelib.services.web.samples.Search"> 04 <head> 05 <title>Search page sample</title> 06 <link rel="stylesheet" type="text/css" href="Nolege.css" /> 07 <script language="javascript" src="Includes/Global.js"></script> 08 </head> 09 <body> 10 <h1 align="center">Search page sample</h1> 11 12 <form action="Search.wp" method="post"> 13 <web:TextBox id="txtSearch" />  14 <web:Button id="btnSearch" value="Rechercher" /> 15 </form> 16 17 18 <h2>Your results for 19 "<web:OutputText text="#{this.requestedString}" />"</h2> 20 21 <ol> 22 <web:Repeater values="#{this.results}" elementAlias="result"> 23 24 <li class="text" style="font-size: 11pt;"> 25 <web:Link href="#{result.filename}"> 26 <b><web:OutputText text="#{result.title}" /></b> <br/> 27 <web:OutputText text="#{result.filename}" /> 28 </web:Link> - 29 <web:OutputText text="#{result.fileSize}" /> ko. 30 </li> 31 32 </web:Repeater> 33 </ol> 34 35 </body> 36 </web:Html>
Dominique LIARD - © 2007 SARL Infini Software - All rights reserved Other brands and product names in these documents are the property of their respective owners.