crawlerLogs : Table of contents

What is crawlerLogs ?
What do I need to run crawlerLogs ?
How to install crawlerLogs ?
Quick start !!!
How to configure crawlerLogs ?
    The browser file
    The codes file
    The configuration file
    The extensions file
    The logs file
    The pages file
    The robots file
    The signatures file
Normal use
Advanced use
For more informations...




What is crawlerLogs ?

crawlerLogs is a freeware designed for generating simple HTML reports about bot's activity on web sites.
Bots, also called spider are automated software which parse web pages in order to index them in search engines and/or web directories.
crawlerLogs parses web sites logs (designed for apache logs) and retrieves informations abouts these bots (names, visited pages, dates).

Finally, these informations are compiled in a signle HTML report (with graphics). There's a possibility to save these reports in a ZIP archive. All configuration are simply writed into text files. No database or obscure third-party librairies are required.

What do I need to run crawlerLogs ?

crawlerLogs simply need a working JAVA Runtime Environment, which means:

Operating systems can be Windows ou Linux, it doesn't really matter as long as it can run java (crawlerLogs has been compiled and tested on an Athlon XP 2000 with 512 Mo of Ram running Windows XP pro and Linux mandrake LE 2005).

You also need a web server (Apache is strongly recommended because all developmment has been based on apache's common logs) which produces logs in this format :

192.168.0.1 - - [27/Nov/2005:06:32:53 +0100] "GET / HTTP/1.1" 200 4834 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; fr-FR; rv:1.7.12) Gecko/20050919 Firefox/1.0.7"

How to install crawlerLogs ?

crawlerLogs is provided in a zip archive (which contains 4 folders, a jar archive and, depending of the version, one or more bash scripts). Here's the procedure (assuming JAVA is correctly installed and path to java binaries and classpath correctly defined !!)

Quick start !!!

First, install the crawlerLogs software. (see previous section).

Go to crawlerLogs /conf folder : you'll see 8 text files. These files are used to configure crawlerLogs .(They can be edited with vi or emacs or wordpad)

Here are defined the various path for files in crawlerLogs : Just configure properly and then run crawlerLogs ! Go to the folder where the report has been saved (see the configuration file) and open logs.html in your favorite browser.

here's a screen capture of a report generated :

each icon before bot's name is a link to a more detailled parsing of the bot. See below :


On the right side, the "see filtered requests" link leads to a page with all requested keeped for parsing.
the "See the sandboxed requests" link leads to a page where a stored requests which have not been identified. Check regularly this page and add manualy new bots if you can identify one.

If you do multiple successive parsing on the same log file, you need to type F5 (refresh) to update report's HTML view.

How to configure crawlerLogs ?

Just edit the files and add/replace or delete lines ! then launch the software and see the output log (during parsing) to verify if changes have been correctly made.

Don't change filenames otherwise nothing will work properly ! Here's an explanation of the various configuration files below :

The browser file

Each line of this file contains term related to most common browsers or downloading managers. You don't have to put each browser signature here, just put most frequent term.

For example, there's a lot of browser that include the term Mozilla in their signature (firefox,internet explorer,netscape) so just write Mozilla. Of course you can add any term here, more terms here means more chances to block browser's requests.

The codes file

Each line here means a HTTP response code in numerical form. This file is used when the software makes stats about HTTP responses codes from your server.

It also might be useful to see, for example, how many 404 errors (file not found) are produced by your server...

The configuration file

See the quick start section for an explanation of this file or just follow the comments in the file. It's mandatory to configure at least this file if you want to use crawlerLogs !! And don't make any change on this file's structure !!

The extensions file

Here are listed file extensions you don't want to deal with in your report. For example, you don't want crawlerLogs to deal with numerous requests about the graphical structure of your site (assuming most of them are images in gif and jpg), so just put these extensions in this file (not the dot before, just the extension !) and the parsing would flush away each request concerning .gif and .jpg files !

The logs file

You simply put file logs names you want crawlerLogs to process here. These files must be located in the absolute path defined in the configuration file. For example if your server produces logs for 2 sites and the files are named site1_logs and site2_logs just put site1_logs and site2_logs in this file (one per line as usual) and the software would make reports (and saves) for each file name listed here.

The pages file

Here are listed pages or directory you don't want crawlerLogs to deal with. For example, if you don't want mypage.html and all files in the subdirectory "photos" to be processed, so add mypage and photos/ in this file. Be careful to put / if you specify a subdirectory otherwise any file with the term photo (photo_of_me.jpg for example) would be flushed too.

The robots file

Here are stored ips adresses or names of recognized bots. Requests coming from these adresses would be recognized and affiliated to a bot's request (and be keeped from further parsings). This file is frequently updated because there's a lot of new bots and they are automatically added by the software.

The signatures file

This file contains terms to help the software to recognize a bot. If a bot doesn't parse the robots.txt, there's no (simple) way to make a difference between it and a browser. Sometimes, these kind of bots have a customized signature (see the last field in the line of apache log in the What do I need to run crawlerLogs ? chapter) that contains such terms like "spider" or "bot" or "crawl". In order to help the software to make the difference automatically, this file is used intensively when parsing unknown requests.

Normal use

Depending on the version you have downloaded, crawlerLogs comes with some scripts.

* : dates MUST be in SQL format like : YYYY-MM-DD. January is 00 and december is 11 !

Advanced use

If you're an experienced user, you can make crawlerLogs generate logs automaticaly by using the OS features.

Editing such files are not in the scope of this tutorial, many tutorials can be found over the net.

For more informations...

This freeware is edited and proposed by www.portail-ile-reunion.com and is the exclusive property of the owners of the www.portail-ile-reunion.com site.

No changes, refactoring, disassembling or reselling are authorized. This software may be redistributed freely.

Go to www.portail-ile-reunion.com/crawlerlogs.php for more informations.