Follow Us On Twitter

Record retrieval de-duplication

 

Introduction

This page describes the new de-duplication features of the Z39.50 target used for RLUK Record Retrieval. The de-duplication is still under development and the features detailed here are liable to change without notice.

The de-duplication is performed "on-the-fly" by the Z39.50 target and not when the records are loaded into the database.

 

How to access the Z39.50 target

The access details for the Z39.50 target are as follows:

    Machine: rluk.mimas.ac.uk
    Port: 2102
    Database: dedup-marc8
    Database: dedup-utf8

The "dedup-marc8" database returns records in the MARC21 MARC-8 character set, whilst "dedup-utf8" returns records in Unicode UTF-8.

You will need to provide a valid RLUK username and password.

The same database can be accessed in its normal non-de-duplicated state by using the database name "marc8" (or "utf8" if you want the records in Unicode.)

Further details on using the RLUK Z39.50 target are available.

 

Some aspects of the de-duplication process can be customised by the RLUK Member Institutions as detailed in the notes below. These customisations are made through the RLUK De-Duplication Service configuration page and apply to all users from an institution. A username and password is required to use the configuration page -- please contact Ashley Sanders at the email address below if you wish to use this feature.

If you have any problems accessing the target please contact Ashley Sanders at a.sanders@manchester.ac.uk.

De-duplicated Result Sets

By default, all result sets of under 1000 records will be de-duplicated. Larger result sets are returned without duplicate removal. This limit can be changed by using through the configuration page.

Records that have been clustered together by the de-duplicating algorithm will be found at the "top" of the result set. Otherwise the records in a de-duplicated result set are ordered by their weight. Those records with the highest weight appear first in the result set. De-duplicated result sets cannot be sorted on any other field by the target (though your client may well be able to sort them itself.)

Record weights and the Primary Cluster Member

The record used as the Primary Cluster Member (or base record) is the record with the highest weight and will be first in the list of 952 tags added to the record. This record is the one you get to see when a set of records are identified as being duplicates.

The weight assigned to a record is generally the same as its number of tags. So if a record has 15 tags it gets assigned a weight of 15, ie, each tag has a weight of one. You can give additional weight to records from particular Libraries or records with Library of Congress subject headings. These additional weights are specified through the configuration page.

The 035, 040, 090, 852 and 866 tags are usually given a weight of zero. The exception is when an 035 or 040 matches one of your preferred libraries and then that tag is given the weight you have specified through the configuration page.

The de-duplication tag

By default a 952 tag is added to each record. The tag has the following subfields to make it compatible with the RLUK RLIN21 interface.

  • $a — RLUK record number.
  • $b — Three letter RLUK library code.
  • $c — Contains "scripts" if the record has 880 tags.
  • $d — Contains "URI" if the record has an 856 tag.
  • $h — Record level
  • $i — Contains LCC if the record has a 050 tag.
  • $j — Contains NLM if the record has a 060 tag.
  • $k — Contains DDC if the record has a 082 tag.
  • $n — MARC Organisation code.
  • $z — Record weight.

The name of this tag can be changed through the configuration page. However, for the tag to be recognised by the RLUK RLIN21 Interface it must named "952".

If a record in a result set represents a cluster of records then there will be a de-duplication tag for each record in the cluster. This is the only way of knowing which records have been brought together by the de-duplicating algorithm.

Initial clustering

The following fields are used to form the initial clusters:

  • 010 $a
  • 020 $a $z
  • 021 $a $z
  • 022 $a $z
  • 245 $a $b

The LC number, ISBN and ISSN are all normalized and a 3,2,2,1 key is created from the title. Records are then brought together into initial clusters on these keys. For example, If records A and B match on ISBN and records C and D match on LCCN, then if A and D match on some other key then all 4 four records will be brought together to form the initial cluster.

A record can belong to only one initial cluster.

Final Matching

One we have the initial clusters further fields are compared to create the final clusters. First off, the records in a cluster are sorted in order of the number of tags in a record (largest first.) The second record in the cluster is then compared with the first (using the fields shown below) and if all the comparisons are positive it remains clustered with the first record. If any comparison fails it becomes the primary cluster member (PCM) of a second cluster.

The third record is then compared to the first. If the comparisons are all positive it remains clustered with the first record. If it fails then it is compared against the PCM in the second cluster. If these comparisons are all positive it remains with this cluster, otherwise it becomes the PCM of a third cluster.

Likewise with the fourth and subsequent records in the initial cluster. A record will only ever be allocated to one final cluster.

Fields used in the final comparisons

The final matching process uses the following fields:

  • Leader chars 6, 7
  • Dates from 008, 260 $c
  • 245 $a $b $n $p
  • 300 $a
  • 250 $a
  • 100 $a
  • 110 $a $b $d
  • 111 $a $b $e
  • 130 $a
  • 260 $a $b

All fields are normalized before being used; punctuation is removed, all runs of whitespace are converted to a single space and all letters are converted to upper case.

Leader chars 6 and 7

Leader characters 6 and 7 must be the same for both records or the match fails.

Comparing dates

Dates are not checked for Serials.

If either record has a pre 1800 date, then the match is failed. Ignoring dates of 0000 and 9999, then if either record has a date in common then it is considered a match. If no valid dates can be found in a record, then the match fails.

Comparing titles

The titles are compared (after normalization) by calculating how similar the titles are on a scale of 0 to 1. If the similarity is less than 0.95 then the match fails.

The similarity is calculated from the Edit (or Levenshtein) Distance between the two titles. If the Edit Distance is d and the lengths of the two titles are m and n characters, then the similarity is simply: 1 - (d / max(m, n)).

The Edit Distance is simply the number of changes, insertions or deletions needed to turn one string into the other.

Setting the similarity threshold at 0.95 effectively allows one insertion, deletion or change of character for every twenty characters in the title.

Physical description

If neither record has a 300 $a then it is considered a match. Otherwise the largest number is extracted from each field and the two numbers compared. If no digits are found then the whole field is used in the comparison.

Edition

If neither record has a 250 $a then it is considered a match. Otherwise a number is looked for at the start of the $a and that number is used in the comparison. If no number is found then the whole field is used.

Authors (main entry)

The 1XX fields are only compared if the title of either work consists entirely of the following "generic" title words:

ABSTRACTS, ALUMNI, ANNUAL, ANNUEL, ANUARIO, BERICHT, BIENNIAL, BOLETIN, BOOK, BRIEFE, BULLETIN, CATALOG, CATALOGUE, CHETYREKH, CIRCULAR, COLLECTED, COLLECTION, COMMITTEE, COMPLETAS, COMPLETE, COMPLETES, CONFERENCE, DESIATI, DEVIATI, DIGEST, DIRECTORY, DVENADTSATI, DVUKH, ERKERI, FINAL, HANDBOOK, HATOROV, IEEE, INFORME, INTERIM, IZBRANNYE, JAARVERSLAG, JAHRESBERICHT, JOURNAL, MEETING, MEMBERSHIP, MEMORIA, MENSUEL, MITTEILUNGEN, MONTHLY, NATIONAL, NEWS, NEWSLETTER, NOTES, OBRAS, OCCASIONAL, ODNOM, OEUVRES, PAPER, PAPERS, PIATI, PISEM, PISMA, PLAYS, POEMS, POETRY, POLNOE, POVESTI, PROCEEDINGS, PROGRAM, PROGRESS, PROIZVEDENIA, PROIZVEDENIE, PROIZVEDENIIA, PUBLICATION, PUBLICATIONS, QUARTERLY, RAKSTI, RAPPORT, RASSKAZY, RECORD, REPORT, REPORTS, REPRINTS, RESEARCH, REVIEW, REVISTA, SEJUMOS, SELECTED, SELECTIONS, SERIES, SHESTI, SHORT, SOBRANIE, SOCHINENII, SOCHINENIIA, SPECIAL, STORIES, STUDIES, STUDY, SYMPOSIUM, TECHNICAL, TOMAKH, TOME, TRANSACTIONS, TREKH, TRUDY, VEROEFFENTLICHUNGEN, VEROFFENTLICHUNGEN, VERSE, VOSMI, WERKE, WORKS, WORKSHOP, YEAR, YEARBOOK

Words from the following list are dropped from the 110, 111 and 130 fields before the comparison is made. However, should the field be made up entirely of words from the following list, then the whole field is used unaltered.

ADMINISTRATION, AGRICULTURAL, AGRICULTURE, AMERICAN, ARCHIVES, ART, ASSOCIATION, BOARD, BRITAIN, BUREAU, CALIFORNIA, CANADA, CENTER, COLLEGE, COMITE, COMMISSION, COMMITTEE, COMMUNICATIONS, COMPANY, CONFERENCE, CONGRESS, CONSEIL, COUNCIL, DEPARTMENT, DEPT, DEVELOPMENT, DIVISION, ECONOMIC, EDUCATION, ENGLISH, EXPOSITION, FOER, FRANCE, FUR, GENERAL, GREAT, HEALTH, HISTORY, HOUSE, IEEE, INC, INDIA, INFORMATION, INSTITUT, INSTITUTE, INSTITUTIONEN, INSTITUUT, INTERNATIONAL, ISRAEL, LAW, LIBRARY, MEETING, MUSEUM, NACIONAL, NATIONAL, NEW, OFFICE, POUR, PROVINCE, PUBLIC, QUEBEC, RESEARCH, SCHOOL, SCIENCE, SENATE, SERVICE, SERVICES, SOCIEDAD, SOCIETE, SOCIETY, SOUTH, STATE, STATES, STATISTICS, SYMPOSIUM, UNION, UNITED, UNIVERSITE, UNIVERSITET, UNIVERSITY, VOOR, WORKSHOP, YORK

Once the stop words have been removed the fields are compared to see how many words they have in common. If n is the number of common words and p and q are the number of words in the two author strings, then it is considered a match if n >= min(3, p, q)

If either record has no authors than the match is failed.

Publisher

If either record has no 260 then the match is failed.

The following words are removed from the 260 $b prior to comparison.

A, AB, AN, AND, ASSOCIATION, B, BOOKS, BROS, BROTHERS, BUREAU, BY, C, CENTER, CHULPAN, CHULPANBU, CHULPANGUK, CIE, CO, COMMITTEE, COMPANY, COUNCIL, D, DAIGAKU, DE, DEPT, DEPARTMENT, DIST, DISTRIBUTED, DOCS, E, EDICION, EDICIONES, EDITION, EDITIONS, ET, ETC, EV, F, FOR, FR, FRERES, FROM, G, GEBR, GEBRUEDER, GMBH, GOVT, H, HER, HIS, I, IMPR, IMPRIMERIE, IN, INC, IZD, INCORPORATED, INSTITUTE, INTERNATIONAL, J, K, KENKYU, KENKYUJO, KENKYUKAI, KOKURITSU, KOM, KYOKAI, L, LA, LAS, LIMITED, LTD, M, MADE, N, NATIONAL, NEW, O, OF, OFF, OFFICE, P, PR, PRESS, PRINT, PRINTED, PRINTER, PUB, PUBL, PUBLICATION, PUBLICATIONS, PUBLISHED, PUBLISHER, Q, R, RECORDS, RELEASED, S, SA, SALE, SENTA, SHUPPAN, SHUPPANBU, SHUPPANSHA, SON, SONS, SUPT, T, THE, U, UND, UNIV, UNIVERSITY, US, V, VEB, VERLAG, VO, W, X, Y, YONGUSIL, YONGUSO, YONGUWON, Z

All text up to and including the first comma or colon is removed from the $a.

If both records have a 260 $b then only the $b is used in the comparison (except when the record is a serial, then both the $a and $b are used.) Otherwise the $a and $b are used in the comparison.

Once the stop words have been removed the fields are compared to see how many words they have in common. If n is the number of common words and p and q are the number of words in the two publisher strings, then it is considered a match if n >= min(2, p, q).