Posts Tagged ‘laf2008’

Communicating University Research Identity

Thursday, October 23rd, 2008

These are notes from a talk by Simon Porter from the University of Melbourne at the Libraries Australia Forum 2008.

Simon Porter

All you needed to know about a University was a book.  Number of pages increase over time from 16 pages in 1870 to 227 pages in 2004.  Although the scope remains the same, the size increases.  Now the Universty calendar is a brand for an online resource.  From the webpage you get sent to a faculty home page.  The information isn’t collated in the way it used to be and it is often stored in many places rather than a central repository.

Important contextual framework for history.  Structure a history.  Because we are now in the online space, we can do different things with it.  We can collect stories not from just one individual, but from many individuals and relate them together.

In 2003 the University was disparate systems, with the information replicated all over the place.  By 2005 much of the information started to be in one place.  By 2006, they could take this information that had been prvate and make it public, giving each academic their own web page showing their information, their publications, their awards and honours.

With the list of publications on their pages they can construct OpenURLs to try and source the publications online.  They can then also link to other academics that have worked on the same projects or grants.  This is required as part of Government reporting.

Cornell Universities VIVO project.  They don’t have the same reporting requirements that we have in Australia, but they’ve build it (using RDF).  Expertise island in Ireland is a similar project.

The data has gone from being facts to being identities, not just representing the information that is there, but making an authority.  They have responsibilities to present the correct information now that the information is public rather than private.

What about privacy?  At the University of Melbourne it was expected that part of your duty was to the public.  There are some issues, they have the option of hiding their contact details or making them available.

Next Generation Library Catalogues

Thursday, October 23rd, 2008

These are notes from a talk by Eric Lease Morgan from University of Notre Dame at the Libraries Australia Forum 2008.

Eric Lease MorganThe environment is changing, cheap computers that are globally connected have changed the way libraries work and what they are about.

When items are analogue it is important to create surrogates of our items.  Libraries had to create a catalogue to be able to describe it as there was no way to directly access physical holdings.  Now that items are often born digital, it’s not as necessary to create surrogates as it used to be.  Things like full text indexing can supplement a catalogue.  Indexing was ignored by libraries for decades, then Google came along and proved it could be done.  As items are born digital, a person coming to a library and accessing an item in a specific physical space is no longer, it can be accessed from anywhere.  Enormous amounts of information are held on things like USB drives (all of WorldCat can be stored on an ipod) and it’s cheaper than in the past.

Librarianship consists of 4 processes:

  1. Collection: done by bibliographers and can be supplemented through the use of databases
  2. Preservation: done by archivists, most challenging in the current environment
  3. Organisation: done by cataloguers supplemented by databases and XML
  4. Re-distribution: done by reference librarians

These processes won’t be outdated due to technology, it’ll just change the way they are done.  If you think about books, you don’t have much of a future, but if you think about what is in books, then you have a future.

There are two services the user can interact with:

  1. query against the index
  2. query against the content

In the past, users could only do queries against an index.  Now users can do queries directly against the content, for example carrying out a full text search on a book or a newspaper. The real future is in the growth of services against the content. This means users can partake in things like:

  • Annotation
  • Create tag clouds
  • Taking quotations and citing it
  • save it to ‘my favourites’
  • working out how often words are included, or what are unique words across a collection

Libraries are always a part of a larger hosting communities.  Learn how to take advantage of this fact and put searches against the catalogue into the users context.  This used to be done face to face, you built a relationship with the librarian, why is it so impersonal on the web?  You can replicate, but not replace this with a computer.

You need to know your user. For example, if a user is searching for nuclear physics, the results you should return are different if the user is a physicist or a high school student.

Database are great for organising and maintaining content, but they are lousy when it comes to search.  You have to know the structure of the database in order to do a search.  Indexes are the opposite.  An index is a list of words with pointers to where the word can be found.  You don’t need to know the structure of the database and you can do things like relevance ranking.

“Next-generation” catalogues such as vu-find, evergreen, primo, aquabrowser….. they are all very, very similar with the exception of evergreen which is an intergrated library system.  Discovery systems deal with MARC records, EAD, XML – these systems normalise them to create an index, most of them using Lucene as the indexer.  Open source with a layer on top.


The library catalogue isn’t really YOUR catalogue.  Include everything related to your audience in an index, not just stuff that you own.  Make sure everything in there is accessible via Google, Yahoo, MSN.  Put as much open access content in there as possible. Gather it and include it in your index, you can’t rely on others to do it and it’s easier to search and do things with the data when you have control of it. Apply a library eye to incoming queries (eg: munge the query into a phrase search to enrich the query).  We need to do less library standards and more W3C standards. Repurpose the system by exploiting SOA and RESTful computing techniques.


How we do things are changing, requiring retraining and a shift in attitudes to investigate ways to exploit the current environment.

New Futures for Authority Data

Thursday, October 23rd, 2008

These are notes from a talk by Thom Hickey from OCLC at the Libraries Australia Forum 2008.

Thom Hickey
What is the future of authorities – they need to have wider and wider coverage. For instance names used in journal articles and machine generated and machine assisted input into the authority files, it’s not just librarians that are creating this data, but researchers and genealogists adding tags to records and creating relationships between records.

API’s will be used to provide multiple views, better context, better navigation, more mashups.

Hardware has become interesting in the past couple of years with developments like the introduction of cheap low cost  linux clusters (eg: 130 CPU clusters) where all of WorldCat can be stored in memory and things can be done that couldn’t be done when it had to reside on disk.  Gives it about a 100 fold speed increase (eg what used to take a month now takes 8 hours).

Controlling names in WorldCat

Matching names is done manually and wanted to use standard indentifiers.  Every 3 months would go through & try to match them to the authorities (also used the German and French).  So far only 1% of names have been linked to authority file.  WorldCat identities – aggressive linking.

Their cataloguing client just uses HTTP so Thom has written a program that acts like a client, logs in to find a record, verifies the record hasn’t changed, inserts the link into the record and adds it back into the database.  If it takes 2 seconds per update over ~ 26 million records, then it will take 2 years to do it all.  How do you do it quicker? You throw extra computing power at it and do it with 40 concurrent clients.  Of course this ties up resources on the servers so you do it overnight during the typical down time on the server. With 40+ clients updating 20 records/second it will then take 2-3 months to update the links.

VIAF – Virtual International Authority File

Create links between authority files from around the world.  Working with Library of Congress, BnF and DNB. 

  • 7.5 million person name authority records
  • 25 million bibliographic records
  • 1.2 million links between files

Matching names and dates makes it relatively easy.  Authority records and associated bibliographic records brings in a lot more information and makes it a much richer data source to match against. Match against:

  • names and dates in headings, 
  • standard numbers, 
  • titles, 
  • co authors
  • publishers
  • personal names as subjects
Most names are only in 1 file.  Some names are the same name across all three files.  A common case is where the names are different.  The trickiest case is 1 file has 2 names that are the same, but are different people.

2 issues with matching and merging:

  • Pyotr Ilyich Tchaikovsky and Peter Tschaikowski – same person different names
  • Marcel Fournier and Marcel Fournier – same name different person

When they are merging things, they can create some fairly large records. Standard MARC limited to 100,000 bytes so they are doing it in MARC XML where files can be megabytes eg: Shakespeare.

How do you work out which is which?  What makes a match?  Wouldn’t it be nice to use any of these and get the correct match.

  • 1,300,000  titles
  • 526,000 double
  • 67,000 Joint authors
  • 47,000 LCCN
  • 15,000 Partial date, partial title
  • 6,000 partial date & publisher
  • 4,600 Partial title and publisher
  • 4,100 Name as subject
  • 2,100 Standard number

What are the next steps?

  • Merged display
  • Better documentation
  • More participants
  • Geographical information

OCLC is collection all the records.

WorldCat Identities

For every name, organisation or corporation they have made a page.  They have ranked it based on holdings, so a lot of famous musicians, and authors are listed.  The most common are dead men.

They were trying to build a wiki page for every name in VIAF – it didn’t happen and it has become an individual summary page in WorldCat identities.

People typically come into WorldCat identites from somewhere else – Google or WorldCat.  As you navigate through WorldCat, there is an “about the author” link that sends the user off to the WorldCat identities page. On the identities page people like the graphical publishing timelines, showing articles the person has published, works by them, works about them. They are trying to do interesting things with the records by introducing visual clues into the records – the larger the font, the more holdings an institution has (similar to a tag cloud but used as a list).

What’s to stop abuse – people merging things they shouldn’t?  The system has a record of what has been done and has the ability to roll back.  It’s why wikis work so well.  People have the freedom to do things, but not totally ruin the system.  The system has been build so there are unique URL’s for each person.  If records are merged the old URL’s would still get you to the new URL.

The service is about getting the pages used.  They’ve tried to build it so that there are a variety of discovery methods.

  • Searchable via SRU, OpenURL.  The searches from WorldCat into indentities is via OpenURL
  • Sitemaps for harvesters from Google, yahoo etc
  • HTML for harvesters to follow links and so it can be used by mobile devices
  • Links to articles in Wikipedia – they link to wikipedia, but they would like Wikipedia to link to them – a bit tricky.

Social networks and the Libraries Australia forum

Friday, October 17th, 2008

On October the 23rd Libraries Australia is holding its annual forum at the Powerhouse Museum.  I’ve decided to use the event as a practical introduction to many staff members of the National Library on how social networks can be used to enhance an event.

I’m going up there with my laptop and will sit in the audience and blog the event, make some twitter posts and upload some photos of the event to Flickr.  After the event I’ll attempt to get the speakers to upload their presentations to slideshare.  All of these resources will be aggregated at a page I’ve made that uses the API’s from Twitter, Flickr, Technorati and Slideshare to display anything that is posted using the tag for the event laf2008.

I’m not sure if they’ll be any other members in the audience that will participate on the day, but hopefully in the days and weeks following the event, as a few people blog about it or take photos, all of their thoughts will be aggregated in one location.

Whatever the outcome is, it will be an interesting first step in how to use a few more of these tools in our work.