Archive for 2008

New Futures for Authority Data

Thursday, October 23rd, 2008

These are notes from a talk by Thom Hickey from OCLC at the Libraries Australia Forum 2008.

Thom Hickey
What is the future of authorities – they need to have wider and wider coverage. For instance names used in journal articles and machine generated and machine assisted input into the authority files, it’s not just librarians that are creating this data, but researchers and genealogists adding tags to records and creating relationships between records.

API’s will be used to provide multiple views, better context, better navigation, more mashups.

Hardware has become interesting in the past couple of years with developments like the introduction of cheap low cost  linux clusters (eg: 130 CPU clusters) where all of WorldCat can be stored in memory and things can be done that couldn’t be done when it had to reside on disk.  Gives it about a 100 fold speed increase (eg what used to take a month now takes 8 hours).

Controlling names in WorldCat

Matching names is done manually and wanted to use standard indentifiers.  Every 3 months would go through & try to match them to the authorities (also used the German and French).  So far only 1% of names have been linked to authority file.  WorldCat identities – aggressive linking.

Their cataloguing client just uses HTTP so Thom has written a program that acts like a client, logs in to find a record, verifies the record hasn’t changed, inserts the link into the record and adds it back into the database.  If it takes 2 seconds per update over ~ 26 million records, then it will take 2 years to do it all.  How do you do it quicker? You throw extra computing power at it and do it with 40 concurrent clients.  Of course this ties up resources on the servers so you do it overnight during the typical down time on the server. With 40+ clients updating 20 records/second it will then take 2-3 months to update the links.

VIAF – Virtual International Authority File

Create links between authority files from around the world.  Working with Library of Congress, BnF and DNB. 

  • 7.5 million person name authority records
  • 25 million bibliographic records
  • 1.2 million links between files

Matching names and dates makes it relatively easy.  Authority records and associated bibliographic records brings in a lot more information and makes it a much richer data source to match against. Match against:

  • names and dates in headings, 
  • standard numbers, 
  • titles, 
  • co authors
  • publishers
  • personal names as subjects
Most names are only in 1 file.  Some names are the same name across all three files.  A common case is where the names are different.  The trickiest case is 1 file has 2 names that are the same, but are different people.

2 issues with matching and merging:

  • Pyotr Ilyich Tchaikovsky and Peter Tschaikowski – same person different names
  • Marcel Fournier and Marcel Fournier – same name different person

When they are merging things, they can create some fairly large records. Standard MARC limited to 100,000 bytes so they are doing it in MARC XML where files can be megabytes eg: Shakespeare.

How do you work out which is which?  What makes a match?  Wouldn’t it be nice to use any of these and get the correct match.

  • 1,300,000  titles
  • 526,000 double
  • 67,000 Joint authors
  • 47,000 LCCN
  • 15,000 Partial date, partial title
  • 6,000 partial date & publisher
  • 4,600 Partial title and publisher
  • 4,100 Name as subject
  • 2,100 Standard number

What are the next steps?

  • Merged display
  • Better documentation
  • More participants
  • Geographical information

OCLC is collection all the records.

WorldCat Identities

For every name, organisation or corporation they have made a page.  They have ranked it based on holdings, so a lot of famous musicians, and authors are listed.  The most common are dead men.

They were trying to build a wiki page for every name in VIAF – it didn’t happen and it has become an individual summary page in WorldCat identities.

People typically come into WorldCat identites from somewhere else – Google or WorldCat.  As you navigate through WorldCat, there is an “about the author” link that sends the user off to the WorldCat identities page. On the identities page people like the graphical publishing timelines, showing articles the person has published, works by them, works about them. They are trying to do interesting things with the records by introducing visual clues into the records – the larger the font, the more holdings an institution has (similar to a tag cloud but used as a list).

What’s to stop abuse – people merging things they shouldn’t?  The system has a record of what has been done and has the ability to roll back.  It’s why wikis work so well.  People have the freedom to do things, but not totally ruin the system.  The system has been build so there are unique URL’s for each person.  If records are merged the old URL’s would still get you to the new URL.

The service is about getting the pages used.  They’ve tried to build it so that there are a variety of discovery methods.

  • Searchable via SRU, OpenURL.  The searches from WorldCat into indentities is via OpenURL
  • Sitemaps for harvesters from Google, yahoo etc
  • HTML for harvesters to follow links and so it can be used by mobile devices
  • Links to articles in Wikipedia – they link to wikipedia, but they would like Wikipedia to link to them – a bit tricky.

Social networks and the Libraries Australia forum

Friday, October 17th, 2008

On October the 23rd Libraries Australia is holding its annual forum at the Powerhouse Museum.  I’ve decided to use the event as a practical introduction to many staff members of the National Library on how social networks can be used to enhance an event.

I’m going up there with my laptop and will sit in the audience and blog the event, make some twitter posts and upload some photos of the event to Flickr.  After the event I’ll attempt to get the speakers to upload their presentations to slideshare.  All of these resources will be aggregated at a page I’ve made that uses the API’s from Twitter, Flickr, Technorati and Slideshare to display anything that is posted using the tag for the event laf2008.

I’m not sure if they’ll be any other members in the audience that will participate on the day, but hopefully in the days and weeks following the event, as a few people blog about it or take photos, all of their thoughts will be aggregated in one location.

Whatever the outcome is, it will be an interesting first step in how to use a few more of these tools in our work.

Webjam 8 presentation

Monday, October 13th, 2008

A video of the presentation I gave at Webjam 8 has been posted.  Thanks again to Lachlan for organising such a fantastic night.

State Library of New South Wales Flickr mashup

Wednesday, October 1st, 2008

The State Library of New South Wales have joined Flickr commons, which means that with one small change to my code, I can reproduce my “then and now” mashup for their collection.  There isn’t a great deal of photos that have been geo-tagged as yet, but that’s the beauty of this – anyone can go in and add tags and it will evolve into something useful over time.  Check it out.

State Library of New South Wales then and now mashup

Powerhouse street view mashup

Tuesday, August 19th, 2008

Recently I’ve been thinking of more and more ways that museums and libraries can expose their collections via other methods besides typing a search term into a search box – yawn…

Like most Australians I’ve been playing around with the street view data Google have added in for Australia cities and it got me thinking, how could this be used by museums and libraries.  An obvious candidate is photographs as many photographs are of street views. Earlier this year the Powerhouse Museum in Sydney joined the Flickr Commons with their Tyrrell collection.  In a report on their first 3 months at Flickr, Seb Chan noted that over 50% of the photos were geocoded.  This provided the perfect scenario for an experiment as every Flickr account has a geoRSS feed.  Could this RSS feed be incorporated with a Google maps street view to provide historical photos and contemporary street images side by side?

After about 30 minutes of coding I had a nice proof of concept demonstration page happening.  The interface is a little clunky, but it works.  This could be improved by using the Flickr API rather than the RSS feed to generate the images. There are some issues which you can’t resolve, like the rotation of the street view compared to the photo, but this isn’t really a show stopper.

I would love to know your thoughts?  Am I on to something here?