Download Boolean Retrieval: Information Retrieval and Organization by Dell Zhang and more Lecture notes Construction in PDF only on Docsity!
Information Retrieval
and Organisation
Dell Zhang
Birkbeck, University of London
IR Chapter 01
Boolean Retrieval
Limits of Scanning
I For many purposes, you need more:
I Process large collections containing billions or
trillions of words quickly
I Allow for more flexible matching operations, e.g.
Romans NEAR countrymen
I Rank answers according to importance (when a
large number of documents is returned)
I Let’s look at the performance problem first:
I Solution: do preprocessing
Term-Document Incidence Matrix
Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0
...
I Entry is 1 if term occurs.
I Example: Calpurnia occurs in Julius Caesar.
I Entry is 0 if term doesn’t occur.
I Example: Calpurnia does not occur in The
Tempest.
Indexing Large Collections
I Consider N = 10^6 documents, each with about
1000 tokens
I On average 6 bytes per token, including spaces
and punctuation ⇒ the size of document
collection is about 6 GB
I Assume there are M = 500,000 distinct terms
in the collection
Building Incidence Matrix
I M = 500, 000 × 106 = half a trillion 0s and 1s.
I We would use about 60GB to index 6GB of text,
which is clearly very inefficient.
I But, wait a minute, the matrix has no more
than one billion 1s.
I The matrix is extremely sparse, i.e. 99.8% is filled
with 0s.
I What is a better representations?
I We only record the 1s.
Index Construction
I Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar...
I Tokenize the text, turning each document into
a list of tokens:
Friends Romans countrymen So...
I Do linguistic preprocessing, producing a list of
normalized tokens, which are the indexing
terms:
friend roman countryman so...
I Index the documents that each term occurs in
by creating an inverted index, consisting of a
dictionary and postings.
Index Construction
I Later on in this module, we’ll talk about
optimizing inverted indexes:
I Index construction: how can we create inverted
indexes for large collections?
I How much space do we need for dictionary and
index?
I Index compression: how can we efficiently store
and process indexes for large collections?
I Ranked retrieval: what does the inverted index
look like when we want the “best” answer?
Intersecting Postings Lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
I Can be done in linear time if postings lists are
sorted
Intersecting Postings Lists
Query Optimization
I What is the best order for query processing?
I Consider a query that is an AND of n terms,
n > 2
I For each of the terms, get its postings list,
then AND them together
I Example query:
I Brutus AND Calpurnia AND Caesar
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Caesar −→ 5 → 31
Query Optimization
I Simple and effective optimization:
I Process in the order of increasing frequency
I Start with the shortest postings list, then keep
cutting further
I In this example, first Caesar, then Calpurnia,
then Brutus
Commercial Boolean IR: Westlaw
I Largest commercial legal search service in
terms of the number of paying subscribers
(www.westlaw.com)
I Over half a million subscribers performing
millions of searches a day over tens of terabytes
of text data
I The service was started in 1975.
I In 2005, Boolean search (called “Terms and
Connectors” by Westlaw) was still the default,
and used by a large percentage of users...
I ... although ranked retrieval has been available
since 1992.
Westlaw Example Queries
I Information need: Information on the legal
theories involved in preventing the disclosure of
trade secrets by employees formerly employed
by a competing company
I “trade secret” /s disclos! /s prevent /s employe!
I Information need: Requirements for disabled
people to be able to access a workplace
I disab! /p access! /s work-site work-place
(employment /3 place)
I Information need: Cases about a host’s
responsibility for drunk guests
I host! /p (responsib! liab!) /p (intoxicat! drunk!)
/p guest