Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Boolean Retrieval: Information Retrieval and Organization by Dell Zhang, Lecture notes of Construction

The process of Boolean Retrieval in Information Retrieval and Organization. The author uses the example of finding plays in Shakespeare's Collected Works that contain the words 'Brutus' and 'Caesar' but not 'Calpurnia' to explain the concept. The document also covers the limits of scanning, the term-document incidence matrix, inverted index, indexing large collections, and processing Boolean queries.

What you will learn

  • How is the term-document incidence matrix used in Boolean Retrieval?
  • What is Boolean Retrieval in Information Retrieval and Organization?
  • How can Boolean Retrieval be used to find specific information in a large collection?
  • What are the limitations of scanning for information retrieval?
  • What is an inverted index and how is it used in Boolean Retrieval?

Typology: Lecture notes

2021/2022

Uploaded on 09/12/2022

goofy-6
goofy-6 🇬🇧

5

(6)

230 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Information Retrieval
and Organisation
Dell Zhang
Birkbeck, University of London
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Boolean Retrieval: Information Retrieval and Organization by Dell Zhang and more Lecture notes Construction in PDF only on Docsity!

Information Retrieval

and Organisation

Dell Zhang

Birkbeck, University of London

IR Chapter 01

Boolean Retrieval

Limits of Scanning

I For many purposes, you need more:

I Process large collections containing billions or

trillions of words quickly

I Allow for more flexible matching operations, e.g.

Romans NEAR countrymen

I Rank answers according to importance (when a

large number of documents is returned)

I Let’s look at the performance problem first:

I Solution: do preprocessing

Term-Document Incidence Matrix

Anthony Julius The Hamlet Othello Macbeth... and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0

...

I Entry is 1 if term occurs.

I Example: Calpurnia occurs in Julius Caesar.

I Entry is 0 if term doesn’t occur.

I Example: Calpurnia does not occur in The

Tempest.

Indexing Large Collections

I Consider N = 10^6 documents, each with about

1000 tokens

I On average 6 bytes per token, including spaces

and punctuation ⇒ the size of document

collection is about 6 GB

I Assume there are M = 500,000 distinct terms

in the collection

Building Incidence Matrix

I M = 500, 000 × 106 = half a trillion 0s and 1s.

I We would use about 60GB to index 6GB of text,

which is clearly very inefficient.

I But, wait a minute, the matrix has no more

than one billion 1s.

I The matrix is extremely sparse, i.e. 99.8% is filled

with 0s.

I What is a better representations?

I We only record the 1s.

Index Construction

I Collect the documents to be indexed:

Friends, Romans, countrymen. So let it be with Caesar...

I Tokenize the text, turning each document into

a list of tokens:

Friends Romans countrymen So...

I Do linguistic preprocessing, producing a list of

normalized tokens, which are the indexing

terms:

friend roman countryman so...

I Index the documents that each term occurs in

by creating an inverted index, consisting of a

dictionary and postings.

Index Construction

I Later on in this module, we’ll talk about

optimizing inverted indexes:

I Index construction: how can we create inverted

indexes for large collections?

I How much space do we need for dictionary and

index?

I Index compression: how can we efficiently store

and process indexes for large collections?

I Ranked retrieval: what does the inverted index

look like when we want the “best” answer?

Intersecting Postings Lists

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31

I Can be done in linear time if postings lists are

sorted

Intersecting Postings Lists

Query Optimization

I What is the best order for query processing?

I Consider a query that is an AND of n terms,

n > 2

I For each of the terms, get its postings list,

then AND them together

I Example query:

I Brutus AND Calpurnia AND Caesar

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Caesar −→ 5 → 31

Query Optimization

I Simple and effective optimization:

I Process in the order of increasing frequency

I Start with the shortest postings list, then keep

cutting further

I In this example, first Caesar, then Calpurnia,

then Brutus

Commercial Boolean IR: Westlaw

I Largest commercial legal search service in

terms of the number of paying subscribers

(www.westlaw.com)

I Over half a million subscribers performing

millions of searches a day over tens of terabytes

of text data

I The service was started in 1975.

I In 2005, Boolean search (called “Terms and

Connectors” by Westlaw) was still the default,

and used by a large percentage of users...

I ... although ranked retrieval has been available

since 1992.

Westlaw Example Queries

I Information need: Information on the legal

theories involved in preventing the disclosure of

trade secrets by employees formerly employed

by a competing company

I “trade secret” /s disclos! /s prevent /s employe!

I Information need: Requirements for disabled

people to be able to access a workplace

I disab! /p access! /s work-site work-place

(employment /3 place)

I Information need: Cases about a host’s

responsibility for drunk guests

I host! /p (responsib! liab!) /p (intoxicat! drunk!)

/p guest