Pirarucu

来源:百度文库 编辑:神马文学网 时间:2024/04/28 16:46:01

Pirarucu Digital Library Project

Student employees: Peter Djalaliev and Matthew Grieco

Faculty Advisors: Hugh Blackmer and Skip Williams

Date: September 9, 2002

Project Title: Pirarucu Digital Library project

Abstract

The Pirarucu project was started and is still being developed by faculty members and students from Washington & Lee University.  Our vision of the project is to create a widely accessible digital library, which not only provides an easy-to-understand and efficient way to store and manage information, but also places a strong emphasis on providing accessible, powerful and easy-to-use tools for collaboration between people, a feature we couldn't find in any existing digital library or collaboration tool online.

Final Report


Section 1: Introduction

This design document describes the purpose, vision and technical implementation of a digital library that is to be used in a collaborative knowledge-sharing environment. The document serves to give an overall idea to the reader about the architecture and functionality of the digital library. This design document may be subject to change and expansion as time goes.

MAIN GOALS OF OUR PROJECT

The main goals of our project are to build a digital library, a computer-mediated environment, which is able to:

  • be accessible globally
  • store information in the form of files or references to files in a way that allows them to be accessed easily
  • store metadata about this information
  • provide users with a variety of interfaces to access the information stored in the database.
  • allow for the improvement and meta-improvement (improving the way we improve) of knowledge by a community of people. By a community of people we mean any group of people collaborating with each other.
  • communicate with other digital libraries over the world
SPECIFIC REQUIREMENTS OF THE DIGITAL LIBRARY

More specifically, the digital library will provide users with the following functionality:

  • create their own personal digital libraries inside the digital library in order to manage efficiently the personal files they have to deal with on a daily basis
  • organize the contents in his personal digital library so that he ca access them easily
  • publish documents in their personal digital libraries, so they can be viewed globally
  • collaborate with people sharing the same interests by exchanging knowledge online. They can exchange both files through the digital library server and ideas through the discussion threads the library is going to offer them.
  • manage the access rights to the files in their own personal digital libraries by granting specific rights of specific files to specific people or declaring those files as public.
  • search through the information other users have posted in the digital library and declared as public.
  • pull information out of the digital library in a format suitable for teaching purposes.
  • pull information from other digital libraries across the world
DEFINING OUR TERMINOLOGY

Before explaining in detail our vision about this digital library, we need to define a set of controlled vocabulary we are going to deal with:

OBJECT – the building block of the information being stored and exchanged. It could be a file, a hyperlink, or a reference. A reference is some pointer to an object that is not a file. For example, the reference could be the call number of a book in a library.

RECORD – an entity that is wrapped around an object and describes it. It consists of a unique identifier, metadata and a link to the object itself.

USER – any human being that interacts with the library (as a contributor or recipient of information). Concerning their access privileges, we identified three possible groups of users:

  • An OWNER is a user who submits a record in the library and has full rights to manage it.
  • A COLLABORATION MEMBER is a user who uses the digital library to collaborate with other users and share records with them. He has the right to view, change and delete his own records, but only view the records of the other collaboration members. In the future he will be able to make changes to a record belonging to another user and submit a new version of that record. He is considered the owner of the new version. Each collaboration member should have the ability to receive notifications from the server whenever there is a change to the set of files shared by the group. Every collaboration member will most likely also be an owner and share part of the records in his personal digital library with his collaboration group.
  • A member of the GENERAL PUBLIC has the rights to search and read records already existing in the library. He can also create his own user account and submit new records, thus becoming an owner.

Concerning how users interact with the digital library they fall in at least one of the two categories:

  • A CONTRIBUTOR is a user who submits information
  • A RECIPIENT is a user who looks for information

PERSONAL DIGITAL LIBRARY (PDL) – all the records an owner created and has in the library. For example, my PDL could be all the documentation I am compiling about a project and I keep in the digital library. The PDL has to be on a server that is shareable. It is not necessary for all the elements of a PDL to be shared with other users; the owner might have decided that he doesn’t want to let anybody else to access a part of the records in his PDL.

COLLECTION – a set of records from a personal digital library grouped together because they share a common characteristic, for example a common subject.

COLLABORATION – a set of one or more collections and individual records shared between a group of users who collaborate with each other.

DIGITAL LIBRARY – a set of nodes that communicate to maintain data concurrency. In the beginning out digital library will be standalone – only one node.

VETTER – a human entity, which will approve the content of a record before this record is declared available to the public.

CLASSIFICATION – a specific attribute of a record that allows the user to categorize the record. The classification has two parts, name and value. The name specifies the classifier that the user is classifying the record by and the value is the specific value of that classifier. For example, if the user wants to create a category in which all records are dealing with the culture of Tanzania, the classification’s attribute will have a name "culture" and a value "Tanzania"

Section 2: Our vision - improvement and meta improvement of communities

These concepts are our vision of the digital library. Even though the method of their implementation could change and more ideas might be added over time, these concepts are the building blocks of our digital library.

The currently existing digital libraries can almost always be classified in one of the following two categories:

EDUCATIONAL TOOLS FOR TEACHING AND LEARNING

On one hand, there are the existing digital library projects on the Internet. Almost every big academic institution supports an initiative for building a digital library. However, most of these projects aim to provide educational tools for teaching and learning in a particular field. Few of these libraries provide its users the ability to collaborate with each other. Even the ones who provide this ability limit the users to a specific scientific field. Many of those projects also limit the scope of the user to a particular community, for example the academic body of a particular university. Also, very few of these libraries attempt communicating with other digital libraries over the world.

COMMETRCIAL COLLABORATION SOFTWARE

On the other hand, there are the collaboration software applications offered by commercial software development companies. They provide powerful tools for collaboration and exchanging knowledge, but they are paid. Because they are paid, the users have to be a part of some organization, which has purchased the collaboration software. Also, not all of these tools provide a web interface to make the information shared available globally.

SO, WHAT IS THE PROBLEM?

When a physics professor from Yale, the science librarian from Washington and Lee and a person doing field research in the Amazon valley in Brazil want to collaborate, they form a community – a group of people sharing similar academic interests. Their collaboration results in a number of discoveries they compile into documents. They need a way to exchange ideas about the work they are doing. They need a way to share documents, so that when the geology professor writes a report, the others can get a copy make any changes he thinks should be made and return the new version of the document to the other members of the collaboration groups. The group needs to improve and meta-improve its shared knowledge. In other words, it needs to improve the quality and quantity of the information it shares and also the practices it uses to improve this information.

Neither the existing digital library projects, nor the available commercial tools give this group a suitable environment for them to collaborate through. The existing digital libraries do not provide strong enough support for collaboration and the commercial collaboration tools are too expensive. So, their only means to collaborate is through e-mail, which is slow and gives them very poor abilities to keep their shared knowledge in an organized and easily accessible way.

OUR VISION

We want to build a digital library that is free and open-source. It will allow users to store and manage the documents they have to deal with on an everyday basis in an organized manner that allows them to find specific information they need quickly and easily. It will also place a stronger emphasis on collaboration between communities of people sharing similar interests, no matter whether or not they belong to the same organization or academic field. The collaboration interface the library will offer has to be as powerful and easy-to-use as possible. The library will also offer powerful and easy-to-use educational and research user interfaces, but the powerful collaboration interface is what will make this library different from all other existing projects we have encountered.

Section 3: The Structure of the Digital Library

In order to be able to store information and provide access to it efficiently, our digital library will have the following structure:

CORE

The core of the digital library will provide the main internal functionality of the library – storage and also means for building tools for communicating with the outside environment.

  • DATABASE SERVER
  • - In the center of that core will be the database server, which will store the information. The university has a full version of Microsoft SQL Server 2000, which could be used for that purpose – it is powerful and secure. Another option we mentioned is to use an Oracle database server, which however we don’t own a copy of. An ER model has been created for the rough version of the database, but it is still subject to change. The implementation can be found at http://brazilia.wlu.edu/db/brazilia.mdb
  • MEANS FOR BUILDING COMMUNICATION TOOLS
  • - besides the back-end database the core has to provide means for building tools, with which the database will interact with the outside world.
    • WEB SERVER
    • - we need a web server to make the database visible on the Web. We are planning to use a Microsoft Windows 2000 server.
    • MICROSOFT .NET FRAMEWORK
    • – in order to display the data in the database on the Web, we need some scripting means to handle the translation from the database tables containing the information to HTML or some other format used for transmission of information over the Web. Microsoft Visual Studio .NET enables the creation of powerful ASP.NET web applications running on the web server we have already. We chose the .NET platform for our project because, although it might not provide the best performance possible, it provides easy access to powerful means of web development. This allows us to spend more time think about the concepts of the project and less time about its implementation, which is very important for the initial stage of the project. In the future, when somebody starts using the digital library and requires better performance, we might rewrite the whole thing on another platform.
    • GIS SERVER
    • - since we are also planning to deal with GIS data, the core of the library will also need a GIS server to handle this data and convert it to web-friendly formats.
USER INTERFACES

Each different interaction between the digital library and the outside world requires a different interface. For example, we need interfaces to communicate with other libraries (one for input of information from other libraries and on for the output to other libraries) and human users (one to allow people to submit information to the library and one to allow people to access information from the library). Each of these interfaces is built on top of the core of the library.

COMMUNICATION BETWEEN DIGITAL LIBRARIES

DIGITAL LIBRARIES OF THE SAME TYPE - SERVERS AND PROXIES

The digital library we are going to build consists of a single node, i.e. a standalone server, but in the future it might consist of multiple nodes communicating and exchanging information with each other vie the Internet. A collaboration system between a digital library consisting of two and more nodes usually involves that each node of the system can serve both as a server, hosting records, or as a proxy, storing information about records on another server. In this way if the digital library consists of two nodes –Brazil and US, both of them will be servers because both of them will store records submitted by local users. Moreover, US will be a proxy to the records stored on Brazil and Brazil will be a proxy to the records stored on US. A proxy could act like a HTTP proxy – it caches recently requested records from the server and the next time the same record is requested, it provides its own copy. Under certain condition it can check with the server to make sure its copy of the record is up-to-date. Another way for the proxy to work could be to store only references to the records on the server. In this way the proxy will have to request a copy of the record every time a user looks attempts to access that record.

DIGITAL LIBRARIES OF DIFFERENT TYPES

When we are communicating with different digital libraries, we need to think about exchange formats. Exchanging knowledge with another digital library in the simplest sense is taking a group of records from our digital library and sending them to the other one. However, every database uses its own schema, i.e. the records of every database look in a different way, depending on the underlying concepts the creators had in mind while building the database. That’s why before we transfer information from one database to another, we need to convert its schema to the schema of the database-recipient. This could be realized with moving the data in XML format. SQL Server 2000 has XML input and output services integrated into it, so it can export and import data in XML format. XML is very useful for inter-database communication because XSLT provides us with an easy way to translate the information from one XML schema to another, thus transforming the schema of the information being transported to the schema of the database-recipient of that information. XML has one more advantage that it is an open standard, which guarantees that all data stores will implement the same XML, not some different proprietary versions. The digital libraries will probably be able use the SOAP protocol in the future to transfer the XML data from one to another.

Section 4: Technical implementation

THE INFORMATION CONTRIBUTION PROCESS

Every user has documents on his personal computer or some other storage space that he works with. These documents are often not organized in any other way besides arranged into folders and sub-folders. This, however, makes the information in them difficult to manage and access, especially when the number grows bigger and bigger. The digital library will enable this user to manage and access his personal information much more efficiently. When he submits part of his personal documents into the library, he creates his own personal digital library. In terms of the digital library, each document is known as an object. An object could be either a file or a pointer to a file somewhere on the Internet. With the object the user submits metadata describing the object. The library takes the object, the metadata, assigns them a unique identifier and encapsulates all this into a record. At this point the user becomes an owner and has the access rights to all records that he owns. The owner is provided with a user interface that allows him to view the contents of and search through his own personal digital library.

CREATING A USER ACCOUNT

Before he can start submitting document, one has to become a user in our digital library. He does that simply by filling out a form with personal data that would help us manage his account and communicate with him: first and last name, e-mail address, desired username and password. The e-mail address will be used to send the user notifications connected only to his personal digital library or the collaborations he is participating in. No external sources will have authorized access to this e-mail. A user is also not guaranteed that he will get the username he desires because the digital library will not allow duplicate usernames. The reason for this is mostly security and correct user activities logging. Whenever somebody tries to access a resource in the digital library, he identifies himself with his username and this username together with information about the resource he accessed will be stored in our database for possible future reference. Allowing duplicate usernames would lead to dubious information in our database about who tried to access some resource in the digital library.

INFORMATION SUBMISSION

When a user logs in to the digital library he is displayed his personal digital library (PDL). He creates a new record in his PDL through a form that asks him for some identification of the object that is going to be stored and also some metadata about this object. The object can have three possible types of identification: a direct upload, a hyperlink, or some other reference, let’s say a book call number. An object can have exactly one type of identification, so it can’t have both a reference and a file. The metadata the user will be asked for will help him manage the object, as well as help other people find the object when they are searching for information they need in the digital library. All metadata fields are optional, except for the title and the description fields. The contents of some of these fields come from the user, while other are generated by the digital library when the record is submitted. The user-populated metadata elements and their descriptions are as follows:

  • title
  • – title of the document, mandatory element
  • description
  • – a short description of the contents of the object being submitted, mandatory element
  • subject
  • – a space- or comma-separated list of keywords associated with this object that are not part of the title or the description. If all the keywords are in either the title or the description, this field should be left blank.
  • public
  • – if the record has a public or restricted visibility. If the record is public, it will be seen by every user who searches for some information. Otherwise, only the user who owns the record will have access to it. More detailed access rights will be developed in the future. After this is done, users will be able to assign specific rights to specific users. Access rights will include the right to read from, write to and share the record. A record declared as public means that everybody has all rights to the record.
  • language
  • – the language that the text contents of the object are in
  • date of creation
  • – when the object was created. For example, if the object is a report, the date when the report was composed.
  • creator
  • – the original author of the object
  • contributors
  • – any people who contributed for the creation of the object.
  • publisher
  • – the entity which published the object
  • rights
  • – any additional information about the intellectual rights held by any individuals or institution and associated with the object
  • coverage
  • – the names of any geographical regions associated with the contents of the object
  • decimal longitude
  • – the longitude of any point associated with the current object measured in decimal degrees
  • decimal latitude
  • - the latitude of any point associated with the current object measured in decimal degrees
  • lower time bound
  • – the lower bound of any period of time associated with the contents of the object. If there is a specific date, the lower time bound should be this date
  • upper time bound
  • – the upper bound of any period of time associated with the contents of the object. If there is a specific date, the upper time bound should be this date
  • text of additional hyperlink
  • – if there is any website associated with this object, this fields would contain some text, other than its URL address, identifying this website
  • URL of additional hyperlink
  • – the URL of the website mentioned above

The list of the automatically generated metadata elements and their description includes:

  • filename
  • – the name of the file uploaded, if the object is a file
  • file size
  • – the size of the file uploaded, if the object is a file
  • file format
  • – the format of the object. If the object is a file, the format will be the MIME type of the file. If it is a hyperlink, this will be the address the hyperlink is pointing to. If it is a reference, this field will contain the text of the reference
  • date submitted
  • – the date the record was submitted in the database
  • owner
  • – the username of the user, whose PDL this record is part of.
  • version
  • – the number of the version of the record

After filling any metadata the user wishes to specify for the record, he is asked if he wants the record to be part of any collections and collaborations the user has or is part of. Collections and collaborations will be discussed further down this section.

FORMAT AND CONTENT CONTROL OF INFORMATION

When a contributor submits a record, the interface he is using will be responsible that the record being submitted follows the requirements imposed by the database schema – having a value for all required fields, etc. This will guarantee that all records that ever come to the database will follow the same schema. This will be particularly important when we want to communicate to other digital libraries. When we translate from one database schema to another, we will have to assume that all the records we are translating are following the database schema we are translating from.

No control over the content of the information will be exercised until the information is declared as public. When an owner declares a record as public, the record first goes through a vetter, who is responsible to approve the content of the document. Then the record is available to the public.

INFORMATION ACCESS – USER GROUPS

Users who come to the digital library to look for information are known as recipients. A recipient can be an owner, a collaboration member or a member of the general public. A user can be an owner and a collaboration member and also a member of the general public. In this cases he will have a variety of interfaces to choose from and each interface will provide him with unique information access abilities.

OWNERS

An owner has set of records stored in the digital library. This set of records is known as his personal digital library. When he logs on, he will be able to access an interface, which will allow him to:

  • view the contents of his personal digital library
  • search through the contents
  • create collections
  • move records in and out of collections
  • manage the access rights of the records and collections in his personal digital library

COLLABORATION MEMBERS

Collaboration members will be provided with an interface to work with the other people from their collaboration group. The collaboration interface will allow users to:

  • view and search through the records shared by the group
  • download objects, make changes and submit the new version back in the collaboration
  • view the activity log for each record shared by the group to see what was accessed when and by whom.
  • engage in discussions with the other members from the collaboration group through a bulletin board or a messaging system.

MEMBERS OF THE GENERAL PUBLIC

Members of the public are all users who are not owner and who are not collaboration members. Their common characteristic is that they have the rights to access only records, which are declared as public by their owners. They will be able to search for information through a category tree-like structure and through a search engine.

INFORMATION ACCESS – USER INTERFACES

PERSONAL DIGITAL LIBRARY

Once the user has logged in, he can access his personal digital library. He can view his account summary, the list of the records in his PDL and the records themselves. He can add new records through the process described above, delete records and edit the current records. Users can access his collections and collaborations, as well as organize their records using classifications. Classifications will be discussed in the SEARCHING CAPABILITIES section of this document.

COLLECTIONS

Collections are a means for users to organize the records in their PDL. A collection is usually a group of records having something in common besides that they are owned by the same person – a shared subject or approximately the same date of creation or some other criteria. User cannot share collections between each other, i.e. all records in a collection have the same owner. Any record can be part of any number of collections or no collections at all. Collections, technically speaking, are only a group of references to records; they do not contain copies of those records. This means that if a record is changed in one collection, the changes will apply to all other collections the record is in. The only way at the moment to make some changes and keep the old record is to create a new record. In the future, users will be able to create different versions of the record.

When a user is in his PDL, he can create a collection by entering the name of the collection he wishes to create. No user can have two collections having the same name. The user can also specify which record he want to put in the collection.

Users can edit the contents of the collection, i.e. which records are in the collection. They do this through a set of check boxes, one for each record which the check or uncheck in order to add or remove a record from the collection. When a record is removed from a collection, it is not deleted permanently. The digital library only removes the reference of the record from the collection list.

Users can use classifications to organize the records in their collections. Read more about classifications further down in the SEARCHING CAPABILITIES section of this design document

While viewing the list of records in a collection, the user can manage these records the same way he manages them from his PDL – view, edit and delete them. After he decides to close the editing view for a specific record, the user will be taken back to the list of records in the collection. When the user edits or deletes the contents of a record or the record itself, he is editing the original record and any changes made will reflect in all other references of the record.

The user can also delete the collaboration. As above, when he deletes the collaboration, the digital library deletes only the references of the records in he collection and not the records themselves.

COLLABORATIONS

The concept of collaborations is one of the cornerstones of our project. The whole digital library was built with the ideas of collaborating largely in mind. In a world where geographical boundaries do not matter as much as it did before, people need some means to collaborate with each other. At the moment, a collaboration in short is a list of records or collections shared between users. Every collaboration has an owner, the user who created the collaboration. He has the rights to manage the access rights of other users to a collaboration and to delete the collaboration. All users have the rights to access file shared in the collaboration, add a record from their PDL to the collaboration, edit from the collaboration view the records they own and delete the records they own from the collaboration. A collaboration, just like a collection, is only a group of references to records and changes made to any record through collaboration are saved permanently.

When a user has logged in to his PDL, he can create a collaboration much the same way he creates a collection – be filling a form with the name of the collaboration and choosing the records and collections he wants to share in the collaboration. The name of the collaboration at the moment has to be unique in the whole digital library. In the future, this might be changed to provide better flexibility.

The owner of the collaboration has the rights to manage and delete the collaboration. He chooses which users have access to a collaboration. When a user is added to a collaboration, he is then able to share records from his PDL into the collaboration. If a user’s access is disabled, the digital library deletes the references of all his records from the collaboration. If the user is given access to the collaboration again, he has to share his document and collections all over.

The owner of the collaboration can also delete the collaboration. Upon deletion, the digital library deletes the collaboration and the references in it without making any changes to the original records.

Once a user is given access to the collaboration, he can view all the records in the collaboration. If there are any collections shared in this collaboration, the library displays the records from those collections in the collaboration list view, not the collections. If a record has been shared by the same owner once as a stand-alone record and once as a part of a collection, the library will display the record only once.

Users can edit the set of records and collections they share in a collaboration through a set of checkboxes, one for each record they have in their PDL. In the future people will be able to search through their PDL and thus not have to browse through a checkbox list of all their records, which could be a very large number. Users can also drop the collaboration from their list of collaboration, thus refusing to participate in it. When a user drops a collaboration, only the references of the records he has shared are deleted from that collaboration.

Collaboration members can also view and edit the original records through a collaboration. If the user is the owner of a record, he can view it, edit it and delete it. Otherwise, he can only view the record without making any changes. Even the owner of the collaboration cannot write to records he doesn’t own. In the future when we implement a more complex access rights system, the owner of a record will be able to grant specific rights to other users. Also, in the future users will be able to take a record, make some changes and submit it back to the digital library as a new version of the record.

SEARCHING CAPABILITIES

Users will have 4 search engines the will be able to use in the digital library. They will search essentially the same way, but they will different in the scope of records they are searching in. Public members will be able to use a public search engine without having to login first. This search engine can search through all records, declared as public by their respective owners. Users who are registered in the database can use three more search engines – on to search in their personal digital library, one to search in a specific collection and one to search in a specific collaboration. The engine that searches in the personal digital libraries also searches through all records in collections and all records owned by the searching user and shared in collaboration. This is because all of these records are also in the user’s personal digital library.

The search engines will be as intuitive as we can build it to be. We think that the most intuitive way for a human being to search for information is to use his own natural language. However, databases do not use the same language as humans. The most common way to communicate with a database is to use SQL. Humans, however, do not speak in SQL. That’s why the search engine we will offer will have a parser that will translate the human language into SQL. The parser will be able to parse questions that limit the search criteria in space or time, i.e. it will be able to understand queries like "rainforests in Asia" and "major battles between 1941 and 1945". The parser, knowing about the structure of the database, will convert the search query into SQL statement and run this statement on the database.

The search engine parser will also exploit one more way of organizing record in a personal digital library or a collection – classifications. Classifications are regular attributes of different versions of a record. A classification is a combination 2 of fields: name and value. The name specifies what the user is classifying the record by. The value field is the value of that classifier. For example, the user might be collecting information about the culture in Tanzania. He creates a collection and gives it a proper name. After collecting information and converting it to records in the digital library, he realizes the big number of records makes it very difficult for him to find information. A portion of his records consists of pictures from different places. That’s why he decides to create a classification for each of them with a name field of "location" a value field of the name of the place where the picture was taken from. That’s why next time he can use the search engine to search using this classification.

ACCESS TO GIS DATA

Another interface users will be able to use will be access to GIS data. Many of the people doing some kind of academic research have to deal with geographical data at some point of time. This interface will use the ESRI’s ArcIMS server to display GIS data in web friendly formats – images a database records. Users will be able to search for information, which is associated with specific geographic locations and together with the search results they will see a dynamically generated map showing the exact locations on the map associated with the search result.

Also, users will be able to click on or drag their mouse cursor over GIS maps, thus selecting a point or a bounding box, which will then be used to search the database for information associated with this point or points inside this bounding box.

EDUCATIONAL INTERFACE

The digital library will also offer an interface with educational purposes that can be used by all types of users. This interface will allow users to select records from the library (no matter if these records are owned, are shared with are collaboration group or are declared as public) and arrange them in a logical format. The library then will generate a web page and the user can use that web page for teaching purposes.

PARTS OF THIS PROJECT THAT WILL BE LEFT FOR THE FUTURE

A part of the features discussed in this document has been implemented and are being tested by different users. However, we knew since the beginning that we will leave some of the features to be implemented in the future. They are as follows:

  • inter-database communication. For the moment, our digital library is standalone, i.e. it consists of only one node from the structure we discussed
  • more detailed user access rights model is yet to be implemented
  • even though we constructed the back-end database with different versions of the same record in mind, the digital library still doesn’t support different versions
  • integrated access to GIS data is not supported yet
  • the search engines provide only basic functionality and have none of the intuition and ease-to-use we have in mind
  • we have yet to implement the teaching and educational interfaces as well as those that will allow to view different documents (such as Microsoft Office documents) without downloading the file.
  • there is no functional vetter to control the information submitted by users in the digital library