Welcome to IRISWEB


Distributed searching across multiple library catalogues: the technical and operational challenges

Many individual library catalogues are now available on the web, with most of these using the Z39.50 information retrieval protocol for data transfer between the user side client application, and the library catalogue database on the server side.

Why use a protocol?

Protocols are useful, and powerful, data transfer mechanisms. A protocol, in software terms, is a set of rules and regulations to allow data transfer between client and server applications. An analogy can be drawn between the data transfer mechanism used to send and receive information between mobile phones and the Internet. If your mobile is WAP enabled, that is, uses the Wireless Application Protocol, then you can connect to the Internet, and information can be transferred from WAP enabled Internet sites to your mobile, whether it is Nokia, Motorola or any other brand of handset, or the underlying hardware or software differences between the brands.

With Z39.50 client software, the end user should be able to retrieve information from library catalogues which can be accessed by Z39.40 server software, regardless of the underlying database used to store the catalogue records, the library system software, the system server hardware or the operating system.

Apart from data transfer, why use Z39.50 to find information?

The major, and probably most important usage of the standard, is to provide a single interface and search form to enable end users to interrogate multiple catalogues or information databases at once. This saves on the sometimes considerable time and effort spent accessing individual catalogues, then re-keying and running the same search queries on each one. It is one of the most useful added value services that can be provided via the protocol.

Z39.50 is especially useful in requesting information from databases, and narrowing down the request to a specific field. Let us say that you are looking for a list of books authored by James Joyce. Typing "James Joyce" into a traditional search engine would retrieve thousands of hits which mentioned James Joyce, but would probably completely omit useful resources such as library catalogues, or other storage sites in which the information is embedded within database records. A search engine would also include lots of marginal or completely off track documents in which the string of text "James Joyce" occurred. Z39.50 is a protocol specifically set up to handle database retrievals and their more complex information transactions, and can find items within databases fronted by Z server, with a greater degree of specificity than is possible with search engines.

The ability to search multiple catalogues also paves the way for subject based, regional, national and international distributed super catalogues, by clustering databases and catalogues together in useful groupings. Additionally, larger organisations in diverse geographic locations that would never normally be loaded together in a centralised union catalogue can be included in cross-catalogue searching. Thus, Z39.50 allows for the creation of important new resource discovery tools for library staff and researchers generally.

The traditional method of creating compilation bibliographic catalogues was through publishing a listing of the catalogue records in printed format. For larger union catalogues, with maybe several hundred thousand records, this was a slow and arduous process. Inevitably, the information could years out of date by the time the catalogues were actually published. With computerisation, it is possible to compile union catalogues electronically, either by setting up centralised databases, with data loads from the catalogues of the participating libraries, as in the COPAC model. COPAC is the centralised database of bibliographic records and holdings of the UK based Consortium of University and Research Libraries. These, of course, are much easier to update, but depending on the frequency of data loads, there will still be a gap between the loading of records for more recently catalogued items and their appearance in the centralised catalogue. The other main union catalogue model grew from co-operative cataloguing systems, such as that provided by OCLC, with participating libraries downloading catalogue records from the main database of records and adding their own holdings information to the centralised repository, or adding new records as appropriate.

Z39.50 allows for the creation of a truly distributed system, without data loads or duplication of effort in terms of data-entry. Searches from the client go directly to the target database, so there is a real-time view of the target system, with no data load gaps. Some target library systems can also provide circulation status information, thereby giving a view of the records that is as up to date as it is possible to achieve.

Distributed searching - the technical and operational challenges

IRIS OPAC currently provides researchers with a single user interface and seamless virtual integration of 6 major Irish University catalogues, the Enterprise Ireland information centre catalogue, plus the UK based Consortium of University and Research Libraries COPAC database, the British Libraries Books and Journals catalogues and the U.S. Library of Congress. In total, researchers using IRIS OPAC have access to a collective set of more that 22,000,000 individual catalogue records. This is a potentially powerful research tool, with plans for more Z links to other national and international collections. Although these records are stored on multiple software and hardware platforms, and in geographically remote locations, the Z39.50 protocol is the underlying data transfer mechanism that allows this to be achieved. The Z39.50 client is supplied by SEREN, (Sharing Resources in an Electronic Network), the Welsh equivalent of IRIS.

Although IRIS OPAC is relatively simple to use, and has a user interface quite like the traditional library online public access catalogue, there are some important technical differences, particularly with regard to information retrieval and cross-catalogue searching. These technical differences impact on the design of the user interface, and also affect the quantity and type of search queries we can offer end users, and the information display at list and full record level.

Level of Z knowledge at local level and at vendor level

To make an initial connection to a Z39.50 target database, it is necessary to set up basic connection links from the web client to the target library Z server. The local library system administrator must provide information with regard to the IP address, database name, and Z39.50 port number. In addition, he/she requires a reasonable working knowledge of the local target Z server. In general, vendors do no seem to be providing local system administrators with the level of documentation and training necessary to provide understandable, complete and accurate information about the configuration and operation of their Z server software. This leads to delays in information provision at local level and makes the information gathering process more protracted than should necessarily be the case.

The provision of Z server information on a local web page at the target end is very useful for external gateway resource discovery administrators. The level of information provided on the Z server configuration web pages at the COPAC and Library of Congress websites may seem detailed, but is probably the minimum required to allow an administrator to set up a valid connection, search on some key access points, and understand the operation of the target Z server.

Firewalls

In an era of tightening Internet security, it is necessary that the Z39.50 web client administrator liaises with the server side system administrator to ensure that the institution firewall is configured to allow searches that come in via the Z web client to reach the Z39.50 port at the server end. Policies on access varies from institution to institution, depending on the security level of the library server. If you are using a Z client, and find that you cannot connect to a target, and your basic level connection information of IP address, Z database name, and port number is correct, it is worth checking to see whether the problem lies at the firewall, and discussing access issues with the firewall administrator.

Maintaining constant connections with targets

Links to target library Z servers can be disconnected if the Z server software itself is not running. If the hardware server has been taken down for backups or other routine maintenance, it may not automatically be restarted when the system is rebooted. Disconnections do not seem to be a problem where Z server software is also used to set up the local OPAC, as any problems with the server and ensuing disconnections would be immediately brought to the attention of the local system administrator and swiftly remedied.

However, within IRIS OPAC, three of our main University libraries are using a library management system which does not use Z39.50 client server software to provide their own proprietary web OPAC. There is Z server software available, but this must be run against a duplicated set of database records, searched by the BRS search engine. In this case, Z39.50 server disconnection problems may not be noticed by local system administrators, and greater vigilance is required. Resetting the Z server automatically after rebooting, by scripting this into the start up procedures, plus automatic monitoring of the connections with a product such as Index Data's Z-Spy, which automatically emails the local system administrator when a target library is not available, can assist in keeping connections constant.

Speed and performance

When executing a query across multiple catalogues, the speed of at which the full set of hits is returned is dictated by the speed of the slowest target database in the group. Furthermore, if a target is disconnected for any reason, it may delay the whole system until it times out. Within IRIS OPAC, record returns for the full set can range from 5-15 seconds, occasionally longer if the search produces a large number of hits from each site. It may be more useful, if technically possible, to incrementally display the hits from faster targets, rather than have users wait for a full set to be returned. Additionally, timeouts may need to be adjusted on client and server side software to ensure that the whole system is not unduly delayed.

The effect on the local library system server of including its library catalogue within the IRIS default group of searchable catalogues needs to be addressed. Smaller library catalogues that previously dealt with a limited number of local users may now be searched automatically when the IRIS group is selected by an end user. At the moment, this is not a particular problem, given the limited number of users, mainly ILL staff, who use the system. Current IRIS targets are from the larger research institutions, with correspondly robust servers. However, if IRIS OPAC is promoted more widely as a resource discovery mechanism, and becomes a popular research tool, and includes targets from smaller libraries, it is possible that one could run into resource problems at the server end local system, which was never scaled to handle that number of users and requests. More research and diagnostic tools are needed to evaluate this important issue. Index Data are currently working with the Danish State Library Service, testing the performance issues associated with large-scale parallel searching. The challenge is to optimise performance to match that of non-distributed large scale centralised union catalogues such as the COPAC union catalogue model, while minimising the effect on systems at local level.

The design of the client-end interface also needs to be carefully looked at in order to speed up the search and retrieval process. The mode of entry, and search forms, on IRIS OPAC have been redesigned to so users need negotiate fewer entry screens before reaching the point of search. In addition, sets of targets can be checked automatically via radio buttons, thereby cutting out the need for individually selecting each target checkbox. SEREN will be also be working on changing the method of retrieval of hits at full record display level in future versions of the software to further enhance performance.

Catalogue record quality

Distributed search systems can only retrieve on access points such as author, title, subject, ISSN, ISBN, and provide display level information such as date of publication, and full record formats, if the original catalogue records contain that information in the first place. It's an obvious point, but one worth highlighting. On a distributed OPAC, good quality, preferably MARC records, which are catalogued to a high standard, provide enough information to retrieve an item on key field indexes. The record should also give sufficent detail to lending staff and other researchers to correctly identify an item. Poorly catalogued records, with missing date fields, inconsistent author formats, non-standard subject indexing and other anomolies, reduce retrieval accuracy. Additionally mappings to the Bib-1 attributes are more difficult to achieve.

The quality of record particularly impacts on the user end display side of IRIS OPAC. If, for example, a date of publication is missing or catalogued in an incorrect format, then obviously we cannot pull that information into our hit list screen, which displays title, author and publication date. At full record display level, the same record can be very brief for one target and very detailed on another. If one does a search for a specific ISBN or ISSN across multiple systems, it is relatively easy to see the differences in the quality of catalogue records between institutions.

Restrictions on record transfer imposed by the server-end library system vendor

Three of our IRIS library targets run on a library system whose vendor also supplies MARC records via a union database to the system purchaser. When IRIS users searched their targets on IRIS OPAC, we found that this system's Z39.50 target servers were configured by default to not to transfer the full MARC record, but instead provide a restricted number of fields. Users could search on author, title, subject, keyword, ISBN and ISSN, but would only receive and view MARC fields up to field 260. This was set in place to protect contractual arrangements with the suppliers of the original MARC records. However, this causes obvious difficulties where users want to view subject fields or indeed any other fields with values higher than tag 260. Each library using that Z server has to negotiate separately with the vendor, so that the settings may be changed and more detailed record formats transferred to the IRIS web client.

Holdings and circulation status information

Questions which need to be answered include whether the local system Z server supports the recently developed Z39.50 holdings schema? If not, then the format of holdings information within each targets library catalogue database and indexes needs to be documented in detail. Generally speaking, holdings are now held in field 852, but initial investigations of the IRIS libraries indicate that there are variations. Where items are held in electronic format only, is field 856 returned? How are holdings of serials at volume and issue level handled?

Retrieval of circulation status information can be also be problematic. Circulation status information is not intrinsic to the MARC record, but is usually held in a separate database and linked to the item via the control number. If the item status information is updated and embedded in field 852 of the MARC record, by the local library system, and the information transfers across to IRIS OPAC, then a view of the status is possible. However, where information on circulation status is linked at local OPAC level to the MARC record, and display of an end user screen format does not require embedding in field 852, then obviously the information does not transfer across. Even if the information can be sent successfully via a Z session, the client end software must be able to interpret and process this information so that it is displays correctly.

As it stands at the moment, TCD, using the GEAC Advance system, is the only library that can currently supply us with circulation status information. Users of IRIS need to switch to a MARC view of the record to actually view the TCD records' circulation status on IRIS OPAC. This whole area is currently under investigation and further development by SEREN and IRIS.

Implementation and Interpretation of the Z39.50 protocol

The protocol itself is designed to cater for a multiplicity of retrieval situations within database environments, and within different domains, and so is suitably wide ranging in scope. While some software developers have incorporated most of the Z39.50 protocol features and support most of the Z39.50 services and attribute sets, there are variations between vendors, and these variations, taken across the collective set of targets, may play havoc with cross-catalogue searching.

For example, if a researcher wants to use a Z OPAC for distributed searching across multiple databases, he/she will search on some key fields, such as author, title, subject or keyword. This will only work effectively if the mappings back from these query fields are to attributes supported and configured as active the within the Z server software at the target end. [These attributes are a set of retrieval mappings known as Bib1 Set used by the Z39.50 protocol to let the systems exchange information on fields commonly used for bibliographic information retrieval].

Most targets supported title, author, ISBN and ISSN, but there was variable support for example for subject (Use attribute 21) and keyword (Use attribute 1016), with some systems providing support for both, and some for one or other, and one system not using Use attribute 1016 for keyword, but preferring another attribute completely. This vendor also required that the client send specific attribute combinations, rather than use defaults. So even within these four basic query fields, there were already differences arising because of differing interpretations of the protocol, and variations in how library system vendors implemented it within their server software.

Early experiences with the Telnet version of IRIS has already given the board practical experience of differences in attribute support. In addition to the attributes, there were also significant differences in how systems supported more advanced retrieval tools such as boolean logic. For example one of our targets supported the boolean AND operator only for the combined author plus title queries, with others providing full boolean AND OR NOT options between all the main retrieval fields. Variations in support for truncation also made broadcast searching across targets more difficult. This lack of consistency between systems is known as technical interoperability.

Attempts have been made in the recent past to deal with some of the interoperability problems by defining what are known as profiles, i.e. subsets of the Z39.50 protocol, which must be supported to ensure consistent retrievals. Relatively well known profiles include ATS-1, MODELS, ONE. However, these profiles were in themselves multiplying, and in some ways merely adding to the existing complexity.

To deal with technical problems such as these, and other areas of possible interoperability, the UKOLN set up the post of Interoperability officer, which was taken up by Dr Paul Miller. This appointment, together with initiatives from experts in the field such as Prof. William E Moen's work on the Z Texas project, front line implementors such as the Canadian VUC team, and other experts led to a meeting in Bath in 1999, the outcome of which was the documentation of the Bath Profile. This profile was ratified as an ISO earlier this year.

Quoting directly from the website, also maintained by the National Library of Canada at http://www.nlc-bnc.ca/bath/ "The Bath Profile is an international Z39.50 specification supporting library applications and resource discovery. It describes and specifies a subset of ANSI/NISO Z39.50-1995, Information Retrieval (Z39.50): Application Service Definition and Protocol Specification (ISO 23950).The Profile defines searching across multiple servers to improve international and extranational search and retrieval among library catalogues, union catalogues, and other electronic resources worldwide. The Profile also describes and specifies a subset to allow basic cross-domain search and retrieval of networked resources including library catalogues, government information, museum systems, and archives." There is also an good non-technical article on the profile by Carrol Lunau of the NLC, entitled "The Bath Profile: what is it, and why should I care" at http://www.nlc-bnc.ca/bath/prof.pdf.

Local library system Z target indexing - need for a common standard of indexing?

The other main area of difficulty, in terms of computerised information retrieval, is the lack of a common indexing policy. Local library system administrators naturally set up the Z39.50 indexes for local OPAC functionality. Karen Coyle of the California Digital Library in an article entitled "The Virtual Union Catalog: A Comparative Study" in the March 2000 issue of DLIB, (available at http://www.dlib.org/dlib/march00/coyle/03coyle.html), noted that, in their tests, "that differences occur not only between library system "brands" but also within different installations of the same vendor system due to configuration choices made by the libraries". She also noted that " it was actually more difficult finding indexes common to all of the participating systems than we had anticipated."

This was also the case with IRIS OPAC, and is common to most, if not all, distributed library systems. She comments also "the first step in creating a virtual union catalog is to create compatible local catalogs that are designed to support the virtual environment. It appears that a common use of Z39.50 in libraries today is not a distribution of our catalogs but a kind of harvesting in disparate databases. While this is an obvious statement of fact, we still seem to harbor a somewhat illogical hope that this harvesting will inexplicably yield consistent and accurate results"

While a certain amount can be done, in terms of query testing across multiple library catalogues targets, then refining client end interface and constricting users only to specific input that works across all databases, a common standard of indexing is much more desirable. Within IRIS OPAC, differences in indexing manifested themselves most widely in the variations in author formats and in the indexing of control numbers such as ISBN and ISSN. There is an initiative by the Texas State Library and Archives Commission virtual union catalog (VUC) project, to address this level of interoperability. The TZIG has drafted a set of indexing recommendations, "Recommendations for Indexing MARC 21 Records to support Z Texas and Bath Profile Bibliographic Searches (Functional Area A, Levels 0 & 1)" available at: http://www.unt.edu/wmoen/Z3950/MARC21Indexing/Z3950MARCIndexing.htm. This comprehensive document could serve as a useful model for some form of standardised indexing for distributed search systems such as IRIS OPAC.

It may be possible to influence local support within the IRIS member libraries for specific Bib-1 attributes and attribute combinations, and maybe draft national or regional indexing specifications. It is less easy to influence the indexing or Bib-1 support of collections which may be accessible via a Z client, but may originate in a different country, or outside of local consortial arrangements. IRIS OPAC, for example, also links to the collections of the British Library, the UK based COPAC and the Library of Congress, as these are useful resources for local interlibrary loan staff and general bibliographic research. It is hoped that the Bath and TZIG initiatives will be adopted internationally, so that broadcast searching will become more accurate, and that Z39.50 client software generally can offer higher level retrieval functionality, for whatever grouping of catalogues and databases users would like search.

Copyright @ Margaret Merrick, IRIS Manager, Novermber 2000