"Linux Gazette...making Linux just a little more fun!"


Searching a Web Site with Linux

By Branden Williams


As your website grows in size, so will the number of people that visit your site. Now most of these people are just like you and me in the sense that they want to go to your site, click a button, and get exactly what information they were looking for. To serve these kinds of users a bit better, the Internet community responded with the ``Site Search''. A way to search a single website for the information you are looking for. As a system administrator, I have been asked to provide search engines for people to use on their websites so that their clients can get to their information as fast as possible.

Now the trick to most search engines (Internet wide included) is that they index and search entire sites. So for instance, you are looking for used cars. You decide to look for an early 90s model Nissan Truck. You get on the web, and go to AltaVista. If you do a search for ``used Nissan truck'', you will most likely come up with a few pages that have listings of cars. Now the pain comes when you go to that link and see that 400K HTML file with text listings of used trucks. You have to either go line by line until you find your choice, or like most people, find it on your page using your browser's find command.

Now wouldn't it be nice if you could just search for your used truck and get the results you are looking for in one fail swoop?

A recent search CGI that I designed for a company called Resource Spectrum (http://www.spectrumm.com/) is what precipitated DocSearch. Resource Spectrum needed a solution similar to my truck analogy. They are a placement agency for high skilled jobs that needed another alternative to posting their job listing to newsgroups. What was proposed was a searchable Internet listing of the jobs on their new website.

Now as the job listing came to us, it was in a word document that had been exported to HTML. As I searched (no pun intended) long and hard for something that I could use, nothing turned up. All of the search engines I found only searched sites, not single documents.

This is where the idea for DocSearch came from.

I needed a simple, clean way to search that single HTML document so users could get the information they needed quickly and easily.

I got out the old Perl Reference and spent a few afternoons working out a solution to this problem. After a few updates, you see in front of you DocSearch 1.0.4. You can grab the latest version at ftp://ftp.inetinc.net/pub/docsearch/docsearch.tar.gz.

Let's go through the code here so we can see how this works. First before we really get into this though, you need to make sure you have the CGI Library (cgi-lib.pl) installed. If you do not, you can download it from http://www.bio.cam.ac.uk/cgi-lib/. This is simply a Perl library that contains several useful functions for CGIs. Place it in your cgi-bin directory and make it world readable and executable. (chmod a+rx cgi-lib.pl)

Now you can start to configure DocSearch. First off, there are a few constants that need to be set. They are in reference to the characteristics of the document you are searching. For instance...

# The Document you want to search.
$doc = "/path/to/my/list.html";
Set this to the absolute path of the document you are searching.

# Document Title. The text to go inside the
<title></title> HTML tags.
$htmltitle = "Nifty Search Results";
Set this to what you want the results page title to be.

# Optional Back link. If you don't want one, make the string null.
# i.e. $backlink = "";
$backlink = "http://www.inetinc.net/some.html";
If you want to provide a ``Go Back'' link, enter the URL of the file that we will be referencing.

# Record delimiter. The text which separates the records.
$recdelim = " ";
This part is one of the most important aspects of the search. The document you are searching must have something in between the "records" to delimit the html document. In English, you will need to place some HTML comment or something in between each possible result of the search. In my example, MS Word put the $nbsp; tag in between all of the records by default, so I just used that as a delimiter.

Next we ReadParse() our information from the HTML form that was used as a front end to our CGI. Then to simplify things later, we go ahead and set the variable $query to be the term we are searching for.

$query = $input{`term'};
This step can be repeated for each query item you would like to use to narrow your search. If you want any of these items to be optional, just add a line like this in your code.

if ($query eq "") {
 $query = " ";
}
This will match relatively any record you search.

Now comes a very important step. We need to make sure that any meta characters are escaped. Perl's bind operator uses meta characters to modify and change search output. We want to make sure that any characters that are entered into the form are not going to change the output of our search in any way.

$query =~ s/([-+i.<>&|^%=])/\\\1/g;
Boy does that look messy! That is basically just a Regular Expression to escape all of the meta characters. Basically this will change a + into a \+.

Now we need to move right along and open up our target document. When we do this, we will need to read the entire file into one variable. Then we will work from there.

open (SEARCH, "$doc");
undef $/;
$text = <SEARCH>;
close (SEARCH);
The only thing you may not be familiar with is the undef $/; statement you see there. For our search to work correctly, we must undefine the Perl variable that separates the lines of our input file. The reason this is necessary is due to the fact that we must read the entire file into one variable. Unless this is undefined, only one line will be read.

Now we will start the output of the results page. It is good to customize it and make it appealing somehow to the user. This is free form HTML so all you HTML guys, go at it.

Now we will do the real searching job. Here is the meat of our search. You will notice there are two commented regular expressions in the search. If you want to not display any images or show any links, you should uncomment those lines.

@records = split(/$recdelim/,$text);
We want to split up the file into an array of records. Each record is a valid search result, but is separate from the rest. This is where the record delimiter comes into play.

foreach $record (@records)
{
#	$record =~ s/<a.*<\/a>//ig; # Do not print links inside this
#	doc.
#	$record =~ s/<img.*>//ig; # Do not display images inside this
#	doc.
 if ( $record =~ /$query/i ) {
 print $record;
 $matches++;
 }
}
This basically prints out every $record that matches our search criteria. Again you can change the number of search criterion you use by changing that if statement to something like this.

if ( ($record =~ /$query/i) && ($record =~ /$anotheritem/) ) {
This will try to match both queries with $record and upon a successful match, it will dump that $record to our results page. Notice how we also increment a variable called $matches every time a match is made. This is not as much as to tell the user how many different records were found, but more of a count to tell us if no matches were found so we can tell the user that no, the system is not down, but in fact we did not match any records based upon that query.

Now that we are done searching and displaying the results of our search, we need to do a few administrative actions to ensure that we have fully completed our job.

First off, as I was mentioning before, we need to check for zero matches in our search and let the user know that we could not find anything to match his query.

if ($matches eq "0") {
 $query =~ s/\\//g;

print << "End_Again";

 <center>
 <h2>Sorry! "$query" was not found!</h2><p>
 </center>
End_Again
}
Notice that lovely Regular Expression. Now that we had to take all of the trouble to escape those meta characters, we need to remove the escape chars. This way when they see that their $query was not found, they will not look at it and say ``But that is not what I entered!'' Then we want to dump the HTML to disappoint the user.

The only two things left to do is end the HTML document cleanly and allow for the back link.

if ( $backlink ne "" ) {
 print "<center>";
 print "<h3><a href=\"$backlink\">Go
back</a></h3>";
 print "</center>";
}

print << "End_Of_Footer";

</body>
</html>

End_Of_Footer
All done. Now you are happy because the user is happy. Not only have you streamlined your website by allowing to search a single page, but you have increased the user's utility by giving them the results they want. The only result of this is more hits. By helping your user find the information he needs, he will tell his friends about your site. And his friends will tell their friends and so on. Putting the customer first sometimes does work!


Copyright © 1998, Branden Williams
Published in Issue 32 of Linux Gazette, September 1998


[ TABLE OF CONTENTS ] [ FRONT PAGE ]  Back  Next