Skip to menu | Skip to content |

giCentre - Department of Information Science

Vocabulary Cluster Graph

Our interactive 'clustering' graph was designed to help compare the books in the collection based on the similarity of their vocabularies.

The more similar two publications are, the closer they appear to one another in the graph - groups of books with similar vocabularies cluster.

The tool works by measuring the unique vocabulary of a given text, and comparing that with the vocabularies of other texts. It takes into account vocabulary as a fraction of total word count, meaning that just because a book has an overwhelmingly large vocabulary, that does not in turn mean that it will be deemed the 'most similar' book to all or many shorter books.

The colours of the books represent the authors - and show just how similar a given author's texts tend to be, in terms of the vocabulary used. Shakespeare's plays can be seen to be closely related in terms of vocabulary (see the gray documents) and the light blue cluster shows that the same is true of many, but not all, of Dickens' titles.

In order to draw the graph, each publication is linked to its 6 most similar publications. The colours and tensions used in the linking lines show the strengths of the relationships between books - for example there are strong similarities the two Mark Twain books in the database.

We can interact with the graph to find out more about the books and explore the stability of the clusters with the following keyboard and mouse controls:

  • drag publications and watch them find their 'place' in the vocabulary graph
  • hold down the shift key and the left mouse button to zoom
  • hold down the shift key and the right mouse button to pan
  • click a book for a short summary of the document
  • R restores to the original view
  • L displays the lines that link the most similar publications.
    (Pressing L again will turn these lines off)

Advanced Functionality for Further Exploration

Use the space key to toggle movement on and off. When movement is off, publications can be dragged without them returning to their original positions.

Use the up and down arrows on the keyboard (held down for a second or two) increase or decrease the minimum distance of separation between nodes. This can be useful for unclustering closely associated publications.