Nicholson, S.
(2003). Bibliomining for automated collection development in a digital library
setting: Using data mining to discover web-based scholarly research works. Journal of the American
Society for Information Science and Technology 54(12). 1081-1090.
Bibliomining for Automated Collection Development in a
Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly
Research Works
Scott Nicholson
4-127 Center for
Science and Technology
Phone: 315-443-1640
Fax: 315-443-5806
http://www.bibliomining.org
scott@scottnicholson.com
This is a preprint of an article accepted
for publication in Journal of the American Society for Information Science and
Technology ©2003 John Wiley & Sons.
0. ABSTRACT
This research creates an intelligent agent for automated
collection development in a digital library setting. It uses a predictive model based on facets of
each Web page to select scholarly works.
The criteria came from the academic library selection literature, and a
The resulting models could be used in the selection process
to automatically create a digital library of Web-based scholarly research
works. In addition, the technique can be extended to create a digital library
of any type of structured electronic information.
Keywords
Digital Libraries, Collection Development, World Wide Web,
Search Engines, Bibliomining, Data Mining, Intelligent Agents
Web sites contain information that ranges from the highly
significant through to the trivial and obscene, and because there are no
quality controls or any guide to quality, it is difficult for searchers to take
information retrieved from the Internet at face value. The Internet will not become a serious tool
for professional searchers until the quality issues are resolved
The
Quality of Electronic Information Products and Services, IMO
One purpose of the academic library is to provide access to
scholarly research. Librarians select
material appropriate for academia by applying a set of explicit and tacit
selection criteria. This manual task has
been manageable for the world of print. However, in order to aid selectors with
the rapid proliferation and frequent updating of Web documents, an automated
solution must be found to help searchers find scholarly research works
published on the Web. Bibliomining,
a.k.a. data mining for libraries, provides a set of tools that can be used to
discover patterns in large amounts of raw data, and can provide the patterns
needed to create a model for an automated collection development aid (Nicholson
and Stanton, in press and Nicholson, 2002).
One of the difficulties in creating this solution is
determining the criteria and specifications for the underlying decision-making
model. A librarian makes this decision
by examining facets of the document and determining from those facets if the
work is a research work. The librarian
is able to do this because he/she has seen many examples of research works and
papers that are not research works, and recognizes patterns of facets that
appear in research works.
Therefore, to create this model, many samples of Web-based
scholarly research papers are collected along with samples of other Web-based
material. For each sample, a program in Perl ( a pattern-matching computer
language) analyzes the page and determines the value for each criterion. Different bibliomining techniques are then
applied to the data in order to determine the best set of criteria to discriminate
between scholarly research and other works.
The best model produced by each technique is tested with a different set
of Web pages. The models are then judged
using measures based the traditional evaluation techniques of precision and
recall called accuracy and return.
Finally, the performance of each model is examined with a set of pages
that are difficult to classify.
Researchers need a digital library consisting of
Web-based scholarly works due to the rapidly growing amount of academic
research published on the Web. The
general search tools overwhelm the researcher with non-scholarly documents, and
the subject-specific academic search tools may not meet the needs of those in
other disciplines. An automated
collection development agent is one way to quickly discover online academic
research works.
In order to create a tool for identifying Web-based scholarly
research, a decision-making model for selecting scholarly research must first
be designed. Therefore, the goal of the
present study is to develop a decision-making model that can be used by a Web
search tool to automatically select Web pages that contain scholarly research
works, regardless of discipline. This
tool could then be used as a filter for the pages collected by a traditional
Web page spider, which could aid in the collection development task for a
scholarly digital library.
To specify the types of resources that this predictive model
will identify, the term “scholarly research works” must be defined. For this study, scholarly research is limited
to research written by students or faculty of an academic institution, works
produced by a non-profit research institution, or works published in an scholarly peer-reviewed journal. Research, as defined by
The models are judged using measures named accuracy and
return; these are based off the traditional IR measures of precision and
recall. Accuracy (precision) and return(recall) are both defined in their classical
information retrieval sense, as first defined by Cleverdon
(1962). Accuracy is measured by dividing the number of pages that are correctly
identified as scholarly research by the total number of pages identified as
scholarly research by the model. Return is determined by dividing the number of pages correctly identified as scholarly research by the
total number of pages in the test set that are scholarly research. When
applied to the Web as a whole, return can not be easily defined. However, a higher return in the test
environment may indicate which tool will be able to discover more scholarly
research published on the Web.
Problematic
pages are Web pages that might appear to this agent to be scholarly research
works (as defined above in 1.2.1), but are not.
Categories of problematic pages are author biographies, syllabi, vitae,
abstracts, corporate research, research that is in languages other than
English, and pages containing only part of a research work. Future researchers will want to incorporate
some of these categories into digital library tools and this level of failure
analysis will assist those researchers in adjusting the models presented in
this research.
First, a set of criteria used in academic libraries for print
selection is collected from the literature, and a
This data collection tool is used to gather information on
5,000 pages with scholarly research works and 5,000 pages without these
works. This data set is split, with the
majority of the pages used to train the models and the rest used to test the
models. The training set is used to create different models using logistic
regression, memory-based reasoning (through non-parametric n-nearest neighbor discriminant analysis), decision trees, and neural
networks.
Another set of data is used to tweak the models and make them
less dependent on the training set. Each
model is then applied to the testing set.
Accuracy and return is determined for each model, and the best models
are identified.
This section explores closely related literature and the
placement of this research in the areas of the selection of quality materials,
data mining and similar projects.
Should the librarian be a filter for quality? S.D. Neill argues for it in his 1989
piece. He suggests librarians, along
with other information professionals, become information analysts. In this article, he suggests that these
information analysts sift through scientific articles and remove those that are
not internally valid. By looking for
those pieces that are “poorly executed, deliberately (or accidentally) cooked,
fudged, or falsified”(Neill, 1989, pg. 6), information
analysts can help in filtering for quality of print information.
Piontek and Garlock also discuss the role of librarians in selecting
Web resources. They argue that collection
development librarians are ideal in this role because of “their experience in
the areas of collection, organization, evaluation, and presentation” (1996, pg.
20). Academic librarians have been accepted as quality filters for decades. Therefore, the literature from library and
information science will be examined for appropriate examples from print
selection and Internet resource selection of criteria for quality.
The basic tenet in selection of materials for a library is to
follow the library’s policy, which in an academic library is based upon
supporting the school’s curriculum (Evans, 2000). Because of this, there are not many published
sets of generalized selection criteria for academic libraries.
One of the most well-known researchers in this area is S. R. Ranganathan. His
five laws of librarianship (as cited in Evans, 2000) are a classical base for
many library studies. There are two
points he makes in this work that may be applicable here. First, if something is already known about an
author and the author is writing the same area, then the same selection
decision can be made with some confidence.
Second, selection can be made based upon the past selection of works from
the same publishing house. The name
behind the book may imply quality or a lack thereof, and this can make it
easier to make a selection decision.
Library Acquisition Policies and Procedures (Futas, 1995) is a collection of selection policies from
across the country. By examining these
policies from academic institutions, one can find the following criteria for
quality works that might be applicable in the Web environment:
·
Authenticity
·
Scope and depth of coverage
·
Currency of date
·
Indexed in standard sources
·
Favorable reviews
· Reference materials like encyclopedias, handbooks, dictionaries, statistical compendia, standards, style manuals, and bibliographies.
Before the Internet was a popular medium for information,
libraries were faced with electronic database selection. In 1989, a wish list was created for database
quality by the Southern California Online Users Group (Basch,
1990). This list had 10 items, some of
which were coverage, scope, accuracy, integration, documentation, and
value-to-cost ratio.
This same users group discussed quality on the Internet in
1995 (as cited in Hofman and Worsfold,
1999). They noted that Internet
resources were different from the databases because those creating the
databases were doing so to create a product that would produce direct fiscal
gain, while those creating Internet resources, in general, were not looking for
this same gain. Because of this fact,
they felt that many Internet resource providers did not have the impetus to
strive for a higher-quality product.
The library community has produced some articles on selecting
Internet resources. Only those criteria
dealing with quality that could be automatically judged will be discussed from
these studies. The first such published
piece, by
A year later, a more formal list of guidelines for selecting
Internet resources were published. Created by Pratt, Flannery, and Perkins
(1996), this remains one of the most thorough lists of criteria to be
published. Some of the criteria they
suggest that relate to this problem are:
·
Produced by a national or international organization,
academic institution, or commercial organization with an established reputation
in a topical area
·
Indexed or archived electronically when appropriate
·
Document is reproduced in other formats, but Internet
version is most current
·
Available on-line when needed
·
Does not require a change in existing hardware or
software
Another article from 1996 by the creators of the Infofilter project looked at criteria based on content,
authority, currency, organization, the existence of a search engine on the
site, and accessibility. However, their
judging mechanisms for these criteria were based upon subjective human
judgments for the most part. Exceptions
were learning the institutional affiliation of the author, pointers to new
content, and response time for the site.
One new criterion is introduced in a 1998 article about
selecting Web-based resources for a science and technology library collection:
the stability of the Web server where the document lives. While this does not necessarily represent the
quality of the information on the page, it does affect the overall quality of
the site. Sites for individuals may not
be as acceptable as sites for institutions or companies (McGeachin,
1998).
Three Web sites provide additional appropriate criteria in selecting
quality Internet resources. The first is
a list of criteria by Alastair Smith in the Victoria
University of Wellington LIS program in
The second site adopts criteria for selecting reference
materials presented in Bopp and Smith’s 1991
reference services textbook. Many of the
criteria presented have already been discussed in this review, but one new
quality-related idea was presented.
Discriminating the work of faculty or professionals from the work of
students or hobbyists may aid in selecting works that are more accurate and
reliable. While this is not always the
case, an expert will usually write a better work than a novice (Hinchliffe, 1997).
The final site, that of the DESIRE project, is the most
comprehensive piece listed here. The
authors (Hofman and Worsfold,
1999) looked at seventeen online sources and five print sources to generate an
extensive list of selection criteria to help librarians create pages of links
to Internet cites. However, many of the
criteria have either already been discussed here or require a human for
subjective judging.
There were only a few new criteria appropriate to the
research at hand. In looking at the scope of the page, these authors suggest to
look for the absence of advertising to help determine quality of the page. Metadata might also provide a clue to the
type of the material on the page. In
looking at the content of the page, references, a bibliography, or an abstract
may indicate an scholarly work. Pages that are merely advertising will
probably not be useful to the academic researcher. A page that is inward
focused will have more links to pages on its own site than links to other
sites, and may be of higher quality. In addition, clear headings can be a judge
for a site that is well organized and of higher quality. The authors also suggest looking at factors
in the medium used for the information and the system on which the site is
located. One new criterion in this area
is the durability of the resource; sites that immediately direct the user to
another URL may not be as durable sites with a more “permanent” home.
Once the criteria have been operationalized
and collected with the Perl program for a large
sample of pages that are linked to academic library Web sites and for another
sample of sites that are not scholarly, patterns must be found to help classify
a page as scholarly. Data mining will be
useful for this, as it is defined as “the basic process employed to analyze
patterns in data and extract information” (Trybula ,1997, pg.
199). Data mining is actually the core
of a larger process, known as knowledge discovery in databases (KDD). KDD is the process of taking low-level data
and turning it into another form that is more useful, such as a summarization
or a model (Fayyad, Piatetsky-Shapiro, and Smyth,
1996).
There are a large number of tools available to the data
miner, and the tools used must match the task. In the current task, the goal is
to look at a database of classified documents, and decide if a new document
belongs in an academic library.
Therefore, this is a classification problem. According to the
In order to use standard statistics, a technique would be
needed that can handle both continuous and categorical variables and will
create a model that will allow the classification of a new observation. According to Sharma (1996), logistic
regression would be the technique to use.
In this, the best combination of variables is discovered that maximizes
the correct predictions for the current set and is used to predict membership
of the new observation. This methodology
looks for the best combination of variables to produce a prediction. For this project, however, there will be
different types of Web pages that are deemed appropriate, and thus it may prove
difficult to converge on a single solution using logistic regression.
Memory-based reasoning is where a memory of past situations
is used directly to classify a new observation.
N-neighbor non-parametric discriminant
analysis is one statistical technique used for MBR. This concept was discussed
in 1988 by Stanfill and Waltz in The Memory Based
Reasoning Paradigm at a DARPA workshop. In MBR, some type of distance function
is applied to judge the distance between a new observation and each existing
observation, with optional variable weighting. The program then looks at a
number of the preclassified neighbors closest to the
new observation and makes a decision (Berry and Linoff,
1997).
Decision/Classification trees use a large group of examples
to create rules for making decisions. It
does this in a method similar to discriminant
analysis; it looks for what variable is the best discriminator of the group,
and splits the group on that variable.
It then looks at each subgroup for the best discriminator and splits the
group again. This continues until a set
of classified rules is generated. New
observations are then easily classified with the rule structure (Johnston and Weckert, 1990).
Neural networks are based on the workings of neurons in the
brain, where a neuron takes in input from various sources, processes it, and
passes it on to one or more other neurons.
The neuron accepts 0-1 measurements of each variable. It then creates a hidden layer of neurons,
which weights and combines the variables in various ways. Each neuron is then
fed into an output neuron, and the weights and combinations of the neurons are
adjusted with each observation in the training set through back-propagation
until an optimal combination of weights is found (Hinton, 1992).
Neural networks are very versatile, as they do not look for
one optimal combination of variables; instead, several different combinations
of variables can produce the same result.
They can be used in very complicated domains where rules are not easily
discovered. Because of its ability to handle complicated problems, a neural
network may be the best choice for this problem (Berry and Linoff,
1997).
Several researchers have discussed the appropriateness of
using data mining techniques in libraries.
May Chau presents
several possible theoretical links between academic librarianship and data
mining. She explores Web mining (data
mining on the World Wide Web) as a tool to help the user find information. Not only can Web mining be used to create
better search tools, but also it can be used to track the searching behavior of
users. By tracking this information,
librarians could create better Web sites and reference tools (1999).
In addition, Kyle Banerjee explores
ways that data mining can help the library.
In discussing possible applications, he says “full-text, dynamically
changing databases tend to be better suited to data mining technologies” (1998,
pg. 31). As the Web is a full-text,
dynamically changing database, it is indeed appropriate to use these
technologies to analyze it.
A new term to describe the data mining process in libraries is Bibliomining (Nicholson and Stanton, In press). Bibliomining is defined as “the combination of data mining, bibliometrics, statistics, and reporting tools used to extract patterns of behavior-based artifacts from library systems” (Nicholson, 2002). Instead of behavior-based artifacts, however, this project is using bibliomining to discover patterns in artifacts contained in and associated with Web pages. The techniques to discover novel and actionable patterns still apply.
There are many manually-collected digital libraries of
scholarly research works, two of the largest are Infomine(http://infomine.ucr.edu)
in the
There are currently several projects that automatically
gather scholarly Web pages. Lawrence, Giles, and Bollacker
have created CiteSeer (now called ResearchIndex),
which is based around citations and link analysis. In order to verify that the
page is a research article, the tool looks to see if there is a works cited
section (Lawrence, Giles, and Bollacker, 1999). Another project to identify scholarly research
works is CORA. This tool selects
scholarly Web pages in the computer science domain by visiting computer science
department Web sites and examining all of the Postscript and PDF documents,
keeping those which have sections commonly found in a research paper(McCallum, Nigam, Rennie, and Seymore, 1999). Both ResearchIndex
and CORA might benefit from an expansion of their inclusion criteria using the
models presented in this paper.
In addition, Yulan and Cheung
(2000) created PubSearch. This tool creates customizes searches for a
user by taking a selection of articles and searching for related articles
through citation and author analysis. This tool, therefore, is useful for users
who have already done research in an area and would like to discover similar
research. This research could provide a filter for PubSearch
to use in order to go beyond the user’s specified Web sites.
A list of
criteria used to select academic research was gathered from a literature review
of criteria used in selecting print and electronic documents for academic
libraries (Nicholson, 2000). This list
was presented to a panel of 42 librarians.
The criteria were ranked and the librarians were allowed to suggest new
criteria. The list was then changed to
remove low-ranking criteria and add new suggested criteria. This process was repeated until consensus was
reached. A summary of the final list of
criteria follows.
Author Criteria
Author has written before
Experience of the author
Authenticity of author
Content Criteria
Work
is supported by other literature
Scope
and depth of coverage
Work
is a reference work
Page
is only an advertisement
Pages
are inward focused
Writing
level of the page
Existence
of advertising on the site
Original
material, not links or
abstracts
Organizational Criteria
Appropriate indexing and description
There is an abstract for the work
Pages are well-organized
Currency of date/ Systematically
updated
Producer/Medium Criteria
Document is reproduced in other forms
Available on-line when needed
Does not require new hardware or software
Past success/failure of the publishing house
Produced by a reputable provider
Unbiased material
Stability of the Web server
Response time for the site
Site is durable
External Criteria
Indexed in standard sources
Favorable reviews
Linked to by other sites
A Perl program was then created that would retrieve a Web
page and analyze it in regard to each criterion. The part of the program to analyze each
criterion was developed and tested before being integrated into the entire
program. Once the program was complete,
it was tested on other pages to ensure that the program was working correctly.
In order to
collect pages containing scholarly research works, several techniques were
employed. Requests were posted to scholarly discussion lists, online journals
and conference proceedings were explored, and search tools were
utilized. Only Web pages that were free
to access, written by someone in academic or a non-profit research institution
or published in an scholarly peer-reviewed journal, were in HTML or text, and
contained the full text of the research report on a single Web page were
accepted. As some sites had many
scholarly works, no more than 50 different works were taken from a single site.
After 4,500 documents were collected for the model creation sets, another 500
were collected for the test set. Care
was taken to ensure that none of the documents in the test set came from the
same Web site as any other document in the model or test set.
In order to create models that can discriminate between pages
with scholarly works and those without, a set of pages not containing scholarly
works was gathered. Since this agent was
designed to work with the pages captured by a typical Web spider, the
non-scholarly pages for model-building were taken from the Web search tools.
The first step in selecting random pages was to use Unfiltered MetaSpy (http://www.metaspy.com). MetaSpy presents
the text of the last 12 searches done in MetaCrawler. These queries were extracted from the MetaSpy page and duplicates were removed.
These queries were then put into several major search
tools. The first ten URLs were extracted
from the resulting page and one was selected at random and verified to make
sure the page was functioning through a Perl
program. Each page was then manually
checked to ensure that it did not contain scholarly research. The next query from Search Voyeur was then
used to perform another search. This
process continued until 4,500 URLs were gathered for the model building sets.
The same technique was used for the test set with another search tool providing
the pages.
Each of the 10,000 URLs was then given to the Perl program to process.
For each page, the HTML was collected and analyzed, and the URL
submitted to four different Web search tools and analysis tools in order to
collect values for some of the criteria.
After this, the datasets were cleaned by manually examining them for
missing data, indicators the page was down, or other problems.
After the data were cleaned, the datasets were prepared for
model development and testing. One set
of 8,500 document surrogates was created for model creation, and a second set
of 500 document surrogates was created for tweaking the models. The third dataset consisted of the 1,000
documents selected for testing. Each of
these sets had equal numbers of documents with and without scholarly research
works. Finally, the dataset of
surrogates for the problems pages was prepared.
Four models were then created and tested using different data
mining techniques. In SAS 6.12, logistic
regression and n-nearest neighbor nonparametric discriminant
analysis were used to create models.
Clementine 5.0 was used to create a classification tree and a neural
network for prediction. Each model was
created with the large dataset and tested against the tweaking dataset. If
there were settings available, these were adjusted until the model produced the
best results with the tweaking dataset.
Once settings were finalized, the testing dataset were run through the
model. The actual group membership was compared to the predicted group
membership in order to determine accuracy and return for each model.
Stepwise logistic regression selects a subset of the
variables to create a useful, yet parsimonious, model. In this case, SAS selected 21 criteria for
inclusion in the model.
The R2 for this regression was .6973. On the model-building dataset, the model was
99.3% accurate. All of the criteria used
to start a stepwise regression, and the ones that remained in this model were:
·
Clearly stated authorship at the top of the page
·
Number of age warnings and adult-content keywords
·
Statement of funding or support at the bottom of page
·
Number of times a traditional heading appeared on the
page (such as Abstract, Findings, Discussion, etc.)
·
Presence of labeled bibliography
·
Presence of a banner ad from one of the top banner ad
companies
·
Existence of reference to “Table 1” or “Figure 1”
·
Existence of phrase “presented at”
·
Academic URL
·
Organizational URL
·
Existence of a link in Yahoo!
·
Number of full citations to other works
·
Existence of meta tags
·
Number of words in the meta keyword and dc.subject meta tags
·
Average sentence length
·
Average word length
·
Total number of sentences in document
·
Average number of sentences per paragraph
·
Ratio of total size of images on page to total size of
page
·
Number of misspelled words according to Dr. HTML
·
Average length of misspelled words.
The model created by logistic regression correctly classified 463 scholarly works and 473 randomly chosen pages. Therefore, it has a accuracy of 94.5% and a return of 92.6%. It had problems with non-scholarly pages that were in the .edu domain, that contained a large amount of text, or that contained