New Futures for Authority Data

23 October 2008

These are notes from a talk by Thom Hickey from OCLC at the Libraries Australia Forum 2008.

Thom Hickey
What is the future of authorities – they need to have wider and wider coverage. For instance names used in journal articles and machine generated and machine assisted input into the authority files, it’s not just librarians that are creating this data, but researchers and genealogists adding tags to records and creating relationships between records.

API’s will be used to provide multiple views, better context, better navigation, more mashups.

Hardware has become interesting in the past couple of years with developments like the introduction of cheap low cost  linux clusters (eg: 130 CPU clusters) where all of WorldCat can be stored in memory and things can be done that couldn’t be done when it had to reside on disk.  Gives it about a 100 fold speed increase (eg what used to take a month now takes 8 hours).

Controlling names in WorldCat

Matching names is done manually and wanted to use standard indentifiers.  Every 3 months would go through & try to match them to the authorities (also used the German and French).  So far only 1% of names have been linked to authority file.  WorldCat identities – aggressive linking.

Their cataloguing client just uses HTTP so Thom has written a program that acts like a client, logs in to find a record, verifies the record hasn’t changed, inserts the link into the record and adds it back into the database.  If it takes 2 seconds per update over ~ 26 million records, then it will take 2 years to do it all.  How do you do it quicker? You throw extra computing power at it and do it with 40 concurrent clients.  Of course this ties up resources on the servers so you do it overnight during the typical down time on the server. With 40+ clients updating 20 records/second it will then take 2-3 months to update the links.

VIAF – Virtual International Authority File

Create links between authority files from around the world.  Working with Library of Congress, BnF and DNB. 

  • 7.5 million person name authority records
  • 25 million bibliographic records
  • 1.2 million links between files

Matching names and dates makes it relatively easy.  Authority records and associated bibliographic records brings in a lot more information and makes it a much richer data source to match against. Match against:

  • names and dates in headings, 
  • standard numbers, 
  • titles, 
  • co authors
  • publishers
  • personal names as subjects
Most names are only in 1 file.  Some names are the same name across all three files.  A common case is where the names are different.  The trickiest case is 1 file has 2 names that are the same, but are different people.

2 issues with matching and merging:

  • Pyotr Ilyich Tchaikovsky and Peter Tschaikowski – same person different names
  • Marcel Fournier and Marcel Fournier – same name different person

When they are merging things, they can create some fairly large records. Standard MARC limited to 100,000 bytes so they are doing it in MARC XML where files can be megabytes eg: Shakespeare.

How do you work out which is which?  What makes a match?  Wouldn’t it be nice to use any of these and get the correct match.

  • 1,300,000  titles
  • 526,000 double
  • 67,000 Joint authors
  • 47,000 LCCN
  • 15,000 Partial date, partial title
  • 6,000 partial date & publisher
  • 4,600 Partial title and publisher
  • 4,100 Name as subject
  • 2,100 Standard number

What are the next steps?

  • Merged display
  • Better documentation
  • More participants
  • Geographical information

OCLC is collection all the records.

WorldCat Identities

For every name, organisation or corporation they have made a page.  They have ranked it based on holdings, so a lot of famous musicians, and authors are listed.  The most common are dead men.

They were trying to build a wiki page for every name in VIAF – it didn’t happen and it has become an individual summary page in WorldCat identities.

People typically come into WorldCat identites from somewhere else – Google or WorldCat.  As you navigate through WorldCat, there is an “about the author” link that sends the user off to the WorldCat identities page. On the identities page people like the graphical publishing timelines, showing articles the person has published, works by them, works about them. They are trying to do interesting things with the records by introducing visual clues into the records – the larger the font, the more holdings an institution has (similar to a tag cloud but used as a list).

What’s to stop abuse – people merging things they shouldn’t?  The system has a record of what has been done and has the ability to roll back.  It’s why wikis work so well.  People have the freedom to do things, but not totally ruin the system.  The system has been build so there are unique URL’s for each person.  If records are merged the old URL’s would still get you to the new URL.

The service is about getting the pages used.  They’ve tried to build it so that there are a variety of discovery methods.

  • Searchable via SRU, OpenURL.  The searches from WorldCat into indentities is via OpenURL
  • Sitemaps for harvesters from Google, yahoo etc
  • HTML for harvesters to follow links and so it can be used by mobile devices
  • Links to articles in Wikipedia – they link to wikipedia, but they would like Wikipedia to link to them – a bit tricky.


Comments are closed.