What is crawlerLogs ?
What do I need to run crawlerLogs ?
How to install crawlerLogs ?
Quick start !!!
How to configure crawlerLogs ?
The browser file
The codes file
The configuration file
The extensions file
The logs file
The pages file
The robots file
The signatures file
Normal use
Advanced use
For more informations...
crawlerLogs is a freeware designed for generating simple HTML reports about bot's activity on web sites.
Bots, also called spider are automated software which parse web pages in order to index them in search engines and/or web directories.
crawlerLogs parses web sites logs (designed for apache logs) and retrieves informations abouts these bots (names, visited pages, dates).
Finally, these informations are compiled in a signle HTML report (with graphics).
There's a possibility to save these reports in a ZIP archive. All configuration are simply writed into text files. No database or obscure third-party librairies are required.
crawlerLogs simply need a working JAVA Runtime Environment, which means:
crawlerLogs is provided in a zip archive (which contains 4 folders, a jar archive and, depending of the version, one or more bash scripts).
Here's the procedure (assuming JAVA is correctly installed and path to java binaries and classpath correctly defined !!)
First, install the crawlerLogs software. (see previous section).
Go to crawlerLogs /conf folder : you'll see 8 text files. These files are used to configure crawlerLogs .(They can be edited with vi or emacs or wordpad)
Just edit the files and add/replace or delete lines ! then launch the software and see the output log (during parsing) to verify if changes have been correctly made.
Don't change filenames otherwise nothing will work properly !
Here's an explanation of the various configuration files below :
Each line of this file contains term related to most common browsers or downloading managers.
You don't have to put each browser signature here, just put most frequent term.
For example, there's a lot of browser that include the term Mozilla in their signature (firefox,internet explorer,netscape) so just write Mozilla.
Of course you can add any term here, more terms here means more chances to block browser's requests.
Each line here means a HTTP response code in numerical form. This file is used when the software makes stats about HTTP responses codes from your server.
It also might be useful to see, for example, how many 404 errors (file not found) are produced by your server...
See the quick start section for an explanation of this file or just follow the comments in the file. It's mandatory to configure at least this file if you want to use crawlerLogs !! And don't make any change on this file's structure !!
Here are listed file extensions you don't want to deal with in your report. For example, you don't want crawlerLogs to deal with numerous requests about the graphical structure of your site (assuming most of them are images in gif and jpg), so just put these extensions in this file (not the dot before, just the extension !) and the parsing would flush away each request concerning .gif and .jpg files !
You simply put file logs names you want crawlerLogs to process here. These files must be located in the absolute path defined in the configuration file. For example if your server produces logs for 2 sites and the files are named site1_logs and site2_logs just put site1_logs and site2_logs in this file (one per line as usual) and the software would make reports (and saves) for each file name listed here.
Here are listed pages or directory you don't want crawlerLogs to deal with. For example, if you don't want mypage.html and all files in the subdirectory "photos" to be processed, so add mypage and photos/ in this file. Be careful to put / if you specify a subdirectory otherwise any file with the term photo (photo_of_me.jpg for example) would be flushed too.
Here are stored ips adresses or names of recognized bots. Requests coming from these adresses would be recognized and affiliated to a bot's request (and be keeped from further parsings). This file is frequently updated because there's a lot of new bots and they are automatically added by the software.
This file contains terms to help the software to recognize a bot. If a bot doesn't parse the robots.txt, there's no (simple) way to make a difference between it and a browser. Sometimes, these kind of bots have a customized signature (see the last field in the line of apache log in the What do I need to run crawlerLogs ? chapter) that contains such terms like "spider" or "bot" or "crawl". In order to help the software to make the difference automatically, this file is used intensively when parsing unknown requests.
Depending on the version you have downloaded, crawlerLogs comes with some scripts.
If you're an experienced user, you can make crawlerLogs generate logs automaticaly by using the OS features.
This freeware is edited and proposed by www.portail-ile-reunion.com and is the exclusive property of the owners of the www.portail-ile-reunion.com site.
No changes, refactoring, disassembling or reselling are authorized. This software may be redistributed freely.
Go to www.portail-ile-reunion.com/crawlerlogs.php for more informations.