Scott Nicholson (scott@scottnicholson.com) and R. David Lankes (rdlankes@iis.syr.edu), Information Institute of Syracuse, Syracuse University School of Information Studies, 245 Hinds Hall, Syracuse, NY 13244
One of the valuable offerings of librarians in the digital age is the human intermediation of information needs. In physical libraries, these reference questions are answered and few artifacts remain from the transaction; therefore, the knowledge created through the work of the librarian leaves with the patron. Due to the medium of communication, digital reference transactions capture the knowledge of information professionals. There are hundreds of digital reference services generating knowledge every day; however, the lack of a schema for archiving reference transactions from multiple services makes it difficult to create a fielded, searchable knowledge base. This schema will also allow researchers to develop tools that practitioners can employ; this will create a collaborative environment for digital reference evaluation. The goal of this work is to outline the steps needed to develop this schema, present the results of a survey of digital reference services, explore some of the pitfalls in the process, and envision the future uses of this Digital Reference Electronic Warehouse (DREW).
The future, and some might even say the present, for the library professional is the digital library. Instead of waiting for the user to come to their information containers in a physical collection, librarians select high-quality materials for users to access through the Internet. It is relatively easy to put a collection of static files online; however, the library is more than just a collection of documents. A crucial part of a library is the human intermediary – the librarian. This intermediary connects the users to the information needed, and can assist with advice about using the information retrieval systems and working with information.
However, many users turn to the Web search tools for their information retrieval needs. While these tools provide the user with Web pages that match a word on the topic, the quality of the results are questionable. Most Web search tools are for-profit companies and bombard users with advertising. In addition, search engine optimizers work to place commercial sites at the top of lists; this has resulted in many searches leading to page after page of commercial results. This commercial information is appropriate for some information-seeking needs, and this is an area where the Web search tools excel. However, it can be frustrating to find non-commercial information, and this is an opportunity for libraries.
There clearly is a need for intermediation with the location of material online. Users have turned to question-based search tools such as AskJeeves with the hopes of finding such assistance; however, these tools perform no better than a general search tool. There is another type of Web search tool that can take a user’s question and match it to a set of results that are likely to be on topic with little advertising and no direct charge – a digital reference service. In fact, those teaching about Web search tools should always take the opportunity to present a digital reference service as a Web search tool with built-in intelligence.
Many libraries have started services where they allow users to submit questions via e-mail or Web forms. Librarians will then research the question and provide an answer and related documents to the user. Some libraries offer this service using a live-chat model, where the user is interacting with a librarian with little time elapsing between question and response. These services are usually free, although the user base may be limited to users that are affiliated with the library offering the services. Google has entered this domain with their “Google Answers” service where a user offers a question and sets the payment for a pre-approved Google Answerer to answer to the question.
Some digital reference services, commonly known as AskA services, connect the user directly to an expert in the field instead of to a librarian. Services such as Ask Dr. Math (http://mathforum.org/dr.math) and AskNSDL (http://nsdl.org/asknsdl) allow users to ask questions of experts in the topic. This is a different model of the reference process, but the information contained in these transactions is valuable. Lankes 1 presented a model that contrasted these two types of services in his research agenda for digital reference.
There are hundreds of these services around the world providing answers and resources in response to user needs. If collected into a knowledge base, it would be incredibly useful for researchers in exploring this process. Information seeking research has been an active line of exploration for decades, and there are many theories developed from small samples that could be explored with this larger dataset. In addition, by examining the common works referred to in different types of questions, automatically generated directories of high-quality material could be created and shared. The goal of the DREW project is to create a large database of reference transactions for researchers to better understand the process and create tools for measurement and evaluation that managers of reference services can employ.
There are several different types of digital multidisciplinary knowledge bases currently available. Precursors to today’s knowledge bases are bibliographic databases such as ArticleFirst and database aggregators like DIALOG. As these tools have grown to include access to full-text resources, they become true multidisciplinary knowledge bases. The difficulty in using these databases comes through the methods of retrieval. Searchers have to match the words used by the author when searching free-text fields such as the title, abstract, and text of the document. Conversely, searchers could attempt to match words selected by indexers such as subject headings. Users can get frustrated with these tools, as they tend to match either too few or too many articles 2.
Another type of multidisciplinary knowledge base available is the World Wide Web. Web search tools provide a portal to this knowledge base. Most current Web search tools allow the user to search large portions of the textual data available on a conveniently-accessed subset of the Web. These search tools cannot access large portions of the World Wide Web know as the Invisible Web 3; in fact, one study claims that the well-known search tools index only about .03% of the Web 4 .
In addition, as these search tools index the words used on the page, the user has to search on the words used by the authors of the page. Due to the commercial nature of these tools, many Web authors use Search Engine Optimization (SEO) techniques to push their pages to the top of listings 5. If these two issues are combined – search tools only index a small portion of the Web and some companies are changing their pages to aggressively hold the top positions in the rankings of search tools – then it is expected that the typical user who only explores the first page of rankings will become frustrated with repetition of results.
One solution to these problems is human intermediation. Some search tools have integrated human intermediation through directory-based search tools; Yahoo, for example, started as a directory-based search tool. These tools allow a user to discover a small subset of resources that were selected using some type of quality criteria through a hierarchical organization structure. Over time, search tool companies have removed or reduced emphasis on these directory tools, promoting the full-text search tools in their stead.
There are some updated directory-based Web search tools that
harness the power of human intermediation.
The Open Directory (http://dmoz.org) and About.com (http://about.com)
use experts to select Web sites on a topic and provide users with a
directory-based access method. For
scholarly research, Infomine (http://infomine.ucr.edu) is a high-quality
directory out of the
The setting for the current paper is in digital reference, which is human intermediation provided in direct response to user’s query. Most of the time, the answer to a digital reference questions contains text as well as links to Web pages, journal articles, and other high-quality information. Therefore, the answer will connect the same types of resources discussed in the previous few paragraphs. The transaction will also have some metadata, such as subject headings, attached to it by either the user or by a staff member during the digital reference process.
In addition, the resources selected by an expert during the digital reference process will be of high quality. By gathering answers from many different resources, directories of these quality materials can be automatically generated. By appending commonly used query terms into the directory, the directory can be more easily searchable. Therefore, the knowledge base created through the archiving of digital reference transactions will be more easily searchable, contain references to high-quality resources, and provide indirect access to the human intermediation process of librarians and experts from a multitude of backgrounds.
Most reference services maintain some type of archive. That archive may be accessible only to the administrators, it may be a useful archive for those answering questions, or it may be available to users of the system. There are a few existing publicly accessible projects of archiving digital reference queries. A number of projects, such as Ask-A-Scientist (http://www.madsci.org/) and Google Answers (http://answers.google.com/answers), allow anyone to search their internal archive of question/answer pairs. While this is useful, it lacks the richness available if the transactions are collected by multiple services.
One of the largest shared archives of reference transactions is QuestionPoint's KnowledgeBase 6. The purpose of the QuestionPoint KnowledgeBase is to provide reference librarians and their patrons with a repository for hard-to-find answers, answers to frequently asked questions, pathfinders and bibliographies on specific subjects, and the intellectual content resulting from aiding scholars in their research. Use of QuestionPoint's Knowledge Base is limited to those institutions participating in the QuestionPoint service, which allows for collaborative reference work.
This is a notable project because it is a large-scale shared reference depository with over 7,300 edited transactions as of July 2004; in addition, this knowledge base is growing as there are more than 11,000 transactions submitted and awaiting review (P. Rumbaugh, personal communication, July 6, 2004). Transactions are selected in two ways: any question submitted to the global network of reference librarians for an answer is considered, and individual libraries have the ability to select any local transaction and submit it to QuestionPoint for consideration. Once identified, the transactions are cleaned, removing all personal information about both the user and the librarian. The text of the question and answer are cleaned for clarity, free-text keywords assigned, and classification headings assigned from the top two levels of the Library of Congress Classification scheme. After ensuring that that there are not similar transactions on the topic area, the transaction is placed in the knowledge base. At this time, a "review" date can be set to trigger a manual review of the information in the transaction to ensure it is up to date.
One goal of the DREW project is to maintain a relationship with other major reference archives such as QuestionPoint. Examining these similar projects allows us to determine the needs of DREW and learn from the exploration of others. Due to the time and resources invested by OCLC and the Library of Congress in the development of the QuestionPoint KnowledgeBase, their process and policies can serve as a model to libraries creating a cleaned archive to aid patrons and librarians. DREW, being a project to provide data for researchers about the process, requires a different type of warehouse. The transactions will not be edited for content, although personally identifiable information will be removed. Transactions on the same topic are desired, as that will allow the discovery of trends and changes over time. One of the areas of exploration, to be discussed later, is automation of several of the cleaning processes such as assignment of subject headings.
Therefore, DREW will complement these archives and knowledge bases focused on aiding librarians and their users directly. In order to do this, one goal of DREW is to create a schema that is compatible with different existing knowledge base projects. The challenge of this project is overcoming the complexity of many different services and user types. The landscape of digital reference is one of many types of services, librarians, and users interacting with a similar base of resources. There will be patterns across services, although teasing them out of the complex data is a challenge. The authors turn to complexity theory as the theoretical support for the success of this project.
To date, knowledge base work in digital reference has been primarily a deductive process. That is, either a service makes every transaction searchable, or an extensive transformation process of question selection, editing and incorporation into a pre-determined subject hierarchy. These deductive, and largely manual, processes have obvious scale problems. Further, these processes tend to be input only systems, in that they must be manually weeded of outdated information. Other issues in the deductive construction of knowledge bases are:
Context Dependencies: Information in knowledge bases is very context dependent. It is quite possible that the only application of the information in a digital reference transcript is to that given interchange between librarian and patron.
Metadata Creation: Time, labor and money are involved in creating metadata for transcripts and digital reference interchanges so that they may be later discovered and retrieved by end users. While some of this effort may be part of the reference process itself (for example classifying a question for distribution in QuestionPoint), it may still require effort to confirm and refine this classification data for inclusion in a knowledge base.
Chunking: It is well known that users will ask several questions in both real-time and asynchronous transactions. How those questions and answers are “broken apart” is often dependent on human intervention and a great deal of interpretation.
Fact Shifting and Temporal Dependencies: Answers
to reference questions are often time dependent. From the name of the U.S.
President to the height of
This is a simple and incomplete list. Issues of quality have not even been mentioned. These facts alone have stymied knowledge base builders; and this in an environment where true scale has not even come into play. How will any team of humans be expected to maintain a collection of questions and answers in an environment of million possible records? This is arguably a more difficult problem of maintaining a collection of any other type of documents for the simple fact that a knowledge base is not conceptualized as a set of documents with provenance and date, but as a collection of the more nebulous “knowledge.”
While the use of full-text approaches such as vector-based information retrieval may mitigate some of these problems, they do not solve core difficulties of fact shifting, nor do they take into account the dynamic nature of the information presented. While the knowledge base grows the relationship between information may change as well. This situation is complicated when archives from different services are combined.
The authors argue that attempting to devise, scale. and equip a deductive approach to knowledge bases is ultimately unworkable. The authors further argue it is time to try a radically different, inductive approach. Simply put: let the knowledge base, or more specifically, the agents representing digital reference output, organize themselves.
The inductive approach proposed in this prospective is
grounded in Complexity Theory and, more specifically, the concept of Complex
Adaptive Systems as conceptualized by
Put simply, complex adaptive systems are grounded in the creation of autonomous agents that self-organize based on relatively simple rules. This organization is emergent, in that it is not the product of some pre-determined course, but a result of the interactions of the agents themselves. The most common analogy is that of flocking birds. Systems to simulate the flocking behavior of birds are effectively replicated by creating independent agents in a virtual space with a set of very simple rules like “you must move forward: get as close as you can to those agents near you; do not hit anything.” Such simulations demonstrate very effectively that such systems produce complex results with swarms of birds on a screen avoiding obstacles…even though they were never programmed to do obstacle avoidance…or swarming.
Models using these principles have also effectively been created to simulate the activities of financial markets, traffic flows and population studies. The point is, that complex adaptive systems, consisting the interactions of autonomous agents, have been effectively used to create systems impossible to create in a deductive manner where thousands of rules and lines of code would have to be used to anticipate every possible contingency. Already artificial intelligence systems have moved away from these so-called frame-based and expert system approaches toward neural nets and inductive simulations.
These systems are also dynamic, in that the agents constantly adapt to a changing environment. They constantly seek an optimal state in changing conditions. So the virtual birds will avoid obstacles in new ways as new obstacles are added. In simulations of biological systems agents will adapt to changes in weather or food supply. It is this dynamism that makes an inductive approach particularly suitable to digital reference knowledge bases.
In order to examine the contents of DREW and develop new, inductive approaches to knowledge base analysis and construction, the research team must first define the autonomous agents in the complex knowledge base environment. These agents, according to Holland 8, must have three mechanisms:
Tags: Mechanisms that agents utilize for aggregation and flows of information
Internal Models: A representation of the environment used by an agent to anticipate and adapt to the environment
Building Blocks: Components of internal models combined to build, test and re‑build internal models.
The “Internal Models,” and “Building Blocks” will be the result of future research. Tagging, or the mechanisms used for information flow and identification, however, are central to the present study. These tags can be thought of fields or metadata elements. By identifying common elements in digital reference transactions (knowledge base agents) these agents can be compared, clustered, and examined. In order to take the first step in building a digital reference knowledge base as a complex adaptive system the researchers turned to existing standards for representing digital reference transactions.
The National Information Standards Organization has developing a protocol for the exchange of questions between services, called NetRef 10. While this standard is appropriate for questions while they are being answered, it is not appropriate for the long-term archiving of the exchange. One goal of the DREW project, therefore, is to create a schema for the archiving of digital reference transactions once the question-answering process is complete. It is important that this archival schema be compatible with the NISO standard, and perhaps can eventually become part of that standard. Theoretically, it should be easier for systems implementing the NetRef protocol to work with the DREW archival schema.
As these questions are answered, individual reference services create archives of question/answer pairs. These are the artifacts of human intermediation, and represent valuable information that previously was lost in traditional reference. Sometimes these archives are searchable by the public, and other times they are kept as referral tools for the librarians and experts to use in answering questions. This distributed knowledge base of digital reference archives contains the expertise and knowledge of many minds; however, there is currently no way to merge these separate archives into a single knowledge base. If these reference transactions from different services could be collected, cleaned, and privatized into a single data warehouse, the amount of expertise available to users and researchers would be staggering. However, the challenges involved in creating this type of warehouse are just as staggering. The goal of this work is the present the preliminary research in determining the fields that could make up an archival schema and present current and future plans of the DREW project
The first step in creating a data warehouse is to determine the fields that will be collected. As there are many different digital reference services, any schema for capturing information from these different services will result in compromises. In order to better understand what fields would be appropriate to capture, a survey was taken of digital reference services representatives.
In order to develop the fields needed for the archiving of digital reference transactions, we start by exploring what is currently captured and then work toward implementation in an iterative manner. The first stage is a survey of digital reference services with the goal of learning:
· what fields are currently collected by services,
· what fields are services not currently collecting, but are willing to collect, and
· what fields services are not willing to collect
in each of four categories – Patron, Question, Answer, and Expert.
First, field lists were created from Janes’s work 11 and a small group of digital reference services and used to develop a survey instrument. This instrument was tested with a set of volunteer librarians from these services; these librarians added additional fields to the instrument. The instrument was then delivered at the 2003 Virtual Reference Desk conference and through a Web-based survey. The online survey was promoted through the DIG_REF listserv as well as through direct contact of services doing digital reference research. If an institution had different types of reference services (such as live chat and Web form-based asynchronous), it was requested that they fill out the instrument twice.
The survey gathered demographic information such as the communication methods used for question acceptance and question resolution, number of questions received per month, platform used, and consortia information. The survey continued with a series of questions about the collection status of the fields listed in Table 1. There were other open-ended questions asked about some of the fields, such as the location of subject lists, other fields collected but not listed in each category, and other comments.
Table 1: Fields in Survey of Digital Reference Services
Patron Information
|
Expert/Responder
Information |
|
Name |
Name |
|
|
|
|
Telephone |
Telephone |
|
City |
City |
|
State |
State |
|
Country |
Country |
|
Grade/Education Level |
Title |
|
Professional Role |
Institution |
|
Member of organization (library, school, etc.): |
Qualifications |
|
|
|
Question Information
|
Response Information |
|
Subject (From a List) |
Response Text |
|
Subject (Free text supplied by User) |
Resources consulted |
|
Text of Question |
Date of response |
|
Purpose (e.g. How do you plan to use this information?): |
Time of response |
|
Desired form of answer |
|
|
Previously consulted sources |
|
|
Requested deadline for response |
|
|
Date of question |
|
|
Time of question |
|
|
Routing information (i.e. question referrals) |
|
There were 53 responses to the survey, which represented 49 different organizations. Respondents who had different reference services (such as chat and e-mail) that kept different archives in the same organization were asked to fill out a survey for each service. There was little duplication by members of the same consortial group in the survey responses.
Of those services that could be affiliated with an institution, slightly more than half (53%) were from academic libraries. The remaining services were fairly evenly split between public (15%), special and other libraries (17%) and AskA services without a specific library affiliation (14%).
About half (47%) of the responses were from chat-based services, 38% were from Web-based asynchronous services, and the remaining 15% used e-mail or another communication platform for reference. Combining the communication type variable with the service affiliation did show some differences, as can be seen in Table 2. For example, chat was more commonly used in academic libraries, while asynchronous Web-based form was the common method in public libraries and independent services. This would prove an interesting finding to explore on a larger basis to see if it is generalizable and to attempt to shed light on the reasons behind the differences.
Table 2: Type of library versus communication method of reference service
|
|
Chat |
Web form |
Email / Other |
Academic |
54% |
30% |
17% |
|
Public |
29% |
71% |
0% |
|
Special/Other |
50% |
50% |
0% |
|
Independent |
34% |
50% |
17% |
Another question was the average number of transactions per month. Upon examination of this field, it was noted that the answers ranged from 10 to 30,000 (for Tutor.com’s Online Classroom). This range of answers is represented in the data in Table 3. In each case, the standard deviation is greater than the mean, which means the data are badly skewed. The median was calculated to give a less biased idea of the central point of the data. The median number of Web-form based questions was 80 per month, and the median number of chat questions was 120 per month. The non-normal nature of this data makes a trustworthy generalization difficult to produce.
Table 3: Mean, standard deviation, and median of reference questions answered each month
|
|
Mean |
Standard Deviation |
Median |
Chat |
1906 |
6410 |
120 |
|
Web form |
164 |
192 |
80 |
|
E-mail |
30 |
31 |
18 |
Another demographic collected was the platform used by the reference service. The results after cleaning the data are in Table 4. The entries for E-mail, Web form + E-mail, and In-house tool may refer to the same type of service – some type of system using existing e-mail and Web servers. If these are combined, then there are three clear popular choices – Question Point, Tutor.com, and some type of in-house use of existing resources.
Table 4: Percentage of respondents using each reference tool
|
Platform / Software |
Percentage of Respondents |
|
(E-mail, Web form, or In-House tool combined) |
27% |
|
Question Point |
23% |
|
Tutor.com |
21% |
|
|
13% |
|
24/7 |
8% |
|
Web form + E-Mail |
8% |
|
In-house tool |
6% |
|
Altarama RefTracker |
4% |
|
QABuilder 2.0 |
4% |
|
Docutek VRL
Plus |
2% |
|
eAssist NetAgent |
2% |
|
ExpertCity's Desktopstreaming |
2% |
|
LivePerson (HumanClick) |
2% |
|
Open Ask A Question |
2% |
|
PHP Live Support |
2% |
Much of the upcoming analysis is split on the distinctions of communication form used, as the types of fields collected in chat may be different than the fields collected via a Web form and via e-mail. The eventual goal is to create one schema that will serve all of these communication platforms.
A series of questions on the survey sought information about the communication practices of different service types. For example, all surveyed e-mail and Web form-based services e-mailed a copy of the answer or transaction to the patron; however, only 72% of the chat-based services regularly sent a copy of the transaction to the user.
A similar set of questions explored through which format questions are eventually resolved. These results, in Table 2, show that there is not much crossover between formats. Chat reference is resolved in chat about 80% of the time, and Web form questions are resolved via Web forms or e-mail most of the time. The high percent of other forms of answers that started as chat reference is probably because the synchronous connection has already been made, and it is then convenient to complete the transaction via the phone.
Table 5: Formats of Final Resolution of Reference Transactions
|
Incoming
Question Format |
E-mail answer |
Web form answer |
Chat answer |
Other form
(telephone, visit) |
|
|
98% |
0% |
1% |
1% |
|
Web form |
23% |
74% |
0% |
3% |
|
Chat |
7% |
3% |
80% |
10% |
In order to understand what information is being collected by services, the analysis is presented in two parts. First, the fields currently collected by services are presented. Following that, the discussion turns to the data that informs the rest of this schema: what fields are services either currently collecting or willing to collect?
Table 6 lists the fields, sorted by category and overall usage, of what services currently collected during the reference process. Looking at the overall results, the most common set of fields currently collected about a reference transaction are: patron e-mail and name; question text, date, and time; and the response text, date, and time. This aggregate set of fields disguises patterns that appear when the results are broken out by communication method used.
Since the two common communication methods are Web form and chat, they will be examined individually. Chat services tend to be more freeform, and therefore may not explicitly collect many fields. Some services as the user to set up an account before the chat session; this will result in more information about the patron, but not more information about the specific information need behind a reference transaction. Even though chat services tended to collect less information than average, many still collect the patron name and e-mail; question text, date, time, and referral/routing information; and the response text, date, and time. One field of note here is the above-average collection of referral/routing information. Many chat services reported capturing fields like IP address, which was the most common information put into the “Other” open ended survey questions. In addition, as seen earlier, chat sessions end in a different communication channel 20% of the time; they therefore have a stronger need to capture this type of transferal information.
The group of Web form reference services captured more information on average than other types of services; this is not surprising as the process of asking a question via a Web-based form is more structured than asking the same question via e-mail or chat. The most common fields currently collected via Web form-based asynchronous reference are: patron e-mail, name, country, and state; question text, date, and time; response text, date, time, and responses collected. Since the information is collected in small fielded pieces, it is then easier to keep in those pieces in a data warehouse. It is because of this that DREW will start by aggregating Web form-based services, and then move to more free-form services as the warehouse develops.
One interesting pattern is the lack of information collected about the person answering the question during the process. There are two types of individuals who answer questions – those who are trained to do research and answer a question from existing resources (such as librarians) and those who are able to answer questions in a specific topic area because they are trained experts in that area. Librarians are trained to provide citation information, and document the authoritativeness of an answer through the support of external works. Experts, on the other hand, provide the authority for their answer based upon their credentials. If services do not keep information about the person who answered the question, then the authority behind an expert-answered question disappears. Because of this, it is important to encourage experts who are answering questions to supply references to works that would contain the answer to the question, even when they know the answer without looking anything up. As these experts may not have been trained as librarians, the administrator of the system needs to ensure that training is available in the basics of created a response that will have supported authority with no identity of the answerer.
Table 6: Percentage of services currently collecting specified fields
|
|
Overall |
Web form |