| |
|
|
DIOGENE technology has been developed by CINECA, the Technological Centre for GEM, to monitor the information on the information industry that is distributed on the web.
DIOGENE consists of two modules: one that recovers data and one that analysis data. A third module, for the treatment of meta-data, can be added.
A ROBOT, or spider, trawls for information in a predefined collection of sites and downloads relevant pages. The collected information is added to a database, upon which an advanced text retrieval engine, specialising in text analysis, can be constructed.
DIOGENE's operation can be distributed or centralised: the search engine provides an answer by referring back to the information maintained on the site that published it, or by referring to a local copy maintained centrally. The former solution has the advantage of a unified point of access to the information, without having to search it directly on the site of origin. The latter solution is useful in the case of sites with very high page refresh frequency (for example the portals of the national daily papers). It can be of interest to maintain a historical archive of such sites.
Where no structured information is maintained in the database searching can be a full text search for key-words. Where there is meta-data associated with documents, the search can also be carried out on specific fields.
|
| |
|
| 1. The spider
|
| |
|
|
|
As explained above, the spider downloads the information from the network from a predefined collection of sites. The system administrator uses an interface to specify the URLs of the sites of interest. The maintenance of the spider configuration files can be distributed or centralised, dependent on whether the Webmasters collaborate or not.
In the former case, every Webmaster maintains a file, named diogene.txt, on his site which contains the instructions the spider follows in order to only download the information of interest. When the spider is launched, it points directly to these sites and captures documents.
Where there is centralised management, a system administrator must still maintain the diogene.txt files. In this case, however, when the spider is launched it points to a local gateway where the configuration files for all the sites of interest are maintained.
The downloading of documents is performed in an analogous way to the distributed operation.
The file diogene.txt respects the standard robots.txt, adopted by Standard Internet search engines.
The system administrator can then, through a specific web interface, start and stop the spider, state the starting frequency and determine how many downloading attempts to carry out per page.
|
| |
|
| 2. The document insertion
|
| |
|
|
|
When the spider has finished downloading the files, these are inserted into the index of the search engine. The insertion criteria are as follows:
- If the document is new (ie it was not present when the index was previously updated) it is inserted
- If the document already exists in the database:
- The date is checked: if it is same the document is ignored, if the document is more recent it is updated;
- In the absence of a date, the document's checksum value is calculated and is compared to the checksum of the new copy of the document. If the checksum is equal the document is ignored, otherwise it is updated.
- If a pre-existing document in the database has since been removed from the site, it is deleted or maintained in the database, depending on whether it was decided to maintain an archive or not.
Once the insertion procedure is finished, if the DIOGENE operation has been centralised the documents are maintained, but if the operation has been distributed the documents are deleted.
|
| |
|
| 3. The search
|
| |
|
|
|
The engine search interface can be customised, according to the application in which DIOGENE is installed. A critical parameter in the definition of the search interface is the presence of meta-data. Where this is present (ie the text on which the search is carried out is structured in some way, or the document is marked in XML), a specific search mask can be constructed that interrogates the meta-data. Where there is no meta-data, the search mask interrogates free text, using Boolean operators (AND, OR, NOT) and specifying a proximity parameter between words.
In both cases it is possible to determine how the results are presented, specifying an ordering parameter of the results for relevance, date, alphabetical order and consultation frequency.
|
| |
|
| 4. The treatment of meta-data
|
| |
|
|
|
DIOGENE is flexible in its treatment of meta-data, which can be imbedded in the files that contain the documents (structured text, XML markup), or external to them (in cards that accompany the document). In the configuration file, diogene.txt, it is possible to indicate where to find meta-data and the meaning of the various fields present. Once extracted, the information is inserted in the database and index and can then be used for the searching.
|
| |
|
| DIOGENE and GEM: I2
|
| |
|
|
|
I2 is a DIOGENE application that manages information from various sites:
- The Web sites of GEM's service providers (FIZ-Karlsruhe, DIMDI,…)
- E-magazines
- Web sites about Information Technology
No meta-data are collected for these sites, the search engine allows you to make full-text searches on the documents.
The I2 server collects only information about the documents. So the results you obtain are directly from documents on the site that published them.
To date the I2 index contains information on approximately 25 000 documents, on 18 Web sites.
|
| |
|
|