45
Information Retrieval Lecture 4

Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Info

rmat

ion R

etri

eval

Lect

ure

4

Page 2: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Rec

ap o

f la

st t

ime

Post

ings

poin

ter

stora

ge

Dic

tionar

y st

ora

ge

Com

pre

ssio

nW

ild-c

ard q

uer

ies

Page 3: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

This

lec

ture

Quer

y ex

pan

sion

What

quer

ies

can w

e now

pro

cess

?

Index

const

ruct

ion

Esti

mat

ion o

f re

sourc

es

Page 4: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Spel

l co

rrec

tion a

nd

expan

sion

Page 5: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Expan

sion

Wild

-car

ds

wer

e one

way

of

“expan

sion”:

i.e.

, a

single

“te

rm”

in t

he

quer

y hit

s m

any

post

ings.

Ther

e ar

e oth

er f

orm

s of

expan

sion:

Spel

l co

rrec

tion

Thes

auri

Soundex

(hom

onym

s).

Page 6: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Spel

l co

rrec

tion

Expan

d t

o t

erm

s w

ithin

(sa

y) e

dit

dis

tance

2Ed

it∈

{Inse

rt/D

elet

e/R

epla

ce}

Expan

d a

t quer

y ti

me

from

quer

ye.

g.,

Ala

nis

Mor

iset

teSp

ell co

rrec

tion is

expen

sive

and s

low

s th

e quer

y (u

pto

a f

acto

r of

10

0)

Invo

ke o

nly

when

index

ret

urn

s (n

ear-

)zer

o

mat

ches

.W

hat

if

docs

conta

in m

is-s

pel

lings?

Why

not

atin

dex

tim

e?

Page 7: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Thes

auri

Thes

auru

s: lan

guag

e-sp

ecif

ic lis

t of

synonym

s fo

r te

rms

likel

y to

be

quer

ied

Gen

eral

ly h

and-c

raft

edM

achin

e le

arnin

g m

ethods

can a

ssis

t –

more

on t

his

in lat

er lec

ture

s.

Page 8: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Soundex

Cla

ss o

f heu

rist

ics

to e

xpan

d a

quer

y in

to

phonet

ic e

quiv

alen

tsLa

nguag

e sp

ecif

icE.

g.,

che

bys

hev→

tche

byc

heff

How

do w

e use

thes

auri

/soundex

?C

an “

expan

d”

quer

y to

incl

ude

equiv

alen

ces

Quer

y ca

rty

res→

car

tyre

sau

tom

obile

tir

es

Can

expan

d index

Index

docs

conta

inin

g c

arunder

aut

omob

ile, as

wel

l

Page 9: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Quer

y ex

pan

sion

Usu

ally

do q

uer

y ex

pan

sion r

ather

th

an index

expan

sion

No index

blo

wup

Quer

y pro

cess

ing s

low

ed d

ow

nD

ocs

fre

quen

tly

conta

in e

quiv

alen

ces

May

ret

riev

e m

ore

junk

pum

a→

jagu

ar

retr

ieve

s docu

men

ts o

n

cars

inst

ead o

f on s

nea

kers

.

Page 10: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Languag

e det

ecti

on

Man

y of

the

com

ponen

ts d

escr

ibed

req

uir

e la

nguag

e det

ecti

on

For

docs

/par

agra

phs

at index

ing t

ime

For

quer

y te

rms

at q

uer

y ti

me

–m

uch

har

der

For

docs

/par

agra

phs,

gen

eral

ly h

ave

enough

text

to a

pply

mac

hin

e le

arnin

g m

ethods

For

quer

ies,

gen

eral

ly lac

k su

ffic

ient

text

Augm

ent

wit

h o

ther

cues

, su

ch a

s cl

ient

pro

per

ties

/spec

ific

atio

n f

rom

applic

atio

nD

om

ain o

f quer

y ori

gin

atio

n,

etc.

Page 11: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

What

quer

ies

can w

e pro

cess

?

We

hav

eBa

sic

inve

rted

index

wit

h s

kip p

oin

ters

Wild

-car

d index

Spel

l-co

rrec

tion

Quer

ies

such

as

an*e

r* A

ND

(mor

iset

/3

tor

onto

)

Page 12: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Asi

de

–re

sult

s ca

chin

g

If 2

5%

of

your

use

rs a

re s

earc

hin

g f

or

bri

tney

AN

Dsp

ears

then

you p

robab

ly d

onee

d s

pel

ling

corr

ecti

on,

but

you d

on’t

nee

d t

o k

eep o

n

inte

rsec

ting t

hose

tw

o p

ost

ings

lists

Web

quer

y dis

trib

uti

on is

extr

emel

y sk

ewed

, an

d y

ou c

an u

sefu

lly c

ache

resu

lts

for

com

mon q

uer

ies

–m

ore

lat

er.

Page 13: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Index

const

ruct

ion

Page 14: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Index

const

ruct

ion

Thus

far,

consi

der

ed index

spac

eW

hat

about

index

const

ruct

ion t

ime?

What

str

ateg

ies

can w

e use

wit

h

limit

ed m

ain m

emory

?

Page 15: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Som

ewhat

big

ger

corp

us

Num

ber

of

docs

= n

= 4

0M

Num

ber

of

term

s =

m =

1M

Use

Zip

f to

est

imat

e num

ber

of

post

ings

entr

ies:

n +

n/2

+ n

/3 +

….

+ n

/m ~

nln

m =

56

0M

en

trie

sN

o p

osi

tional

info

yet

Che

ck fo

ryo

urse

lf

Page 16: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Rec

all in

dex

const

ruct

ion

Te

rmD

oc

#I

1d

id1

en

act

1ju

lius

1ca

esa

r1

I1

was

1ki

lled

1i'

1th

e1

cap

ito

l1

bru

tus

1ki

lled

1m

e1

so2

let

2it

2b

e2

wit

h2

cae

sar

2th

e2

no

ble

2

bru

tus

2h

ath

2

told

2

you

2ca

esa

r 2

was

2am

bit

iou

s2

Docu

men

ts a

re p

arse

d t

o e

xtr

act

word

s an

d t

hes

e ar

e sa

ved w

ith t

he

Docu

men

t ID

. Doc

1D

oc 2

I did

ena

ct J

uliu

sC

aesa

r I w

as k

illed

i'

the

Cap

itol;

Bru

tus

kille

d m

e.

So

let i

t be

with

Cae

sar.

The

nobl

eB

rutu

s ha

th to

ld y

ouC

aesa

r was

am

bitio

us

Page 17: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Key

ste

pT

erm

Do

c #

I1

did

1e

nac

t 1

juliu

s1

cae

sar

1I

1w

as1

kille

d1

i'1

the

1ca

pit

ol

1b

rutu

s1

kille

d1

me

1so

2le

t2

it2

be

2w

ith

2ca

esa

r2

the

2n

ob

le

2b

rutu

s2

hat

h

2to

ld

2yo

u2

cae

sar

2w

as2

amb

itio

us

2

Ter

mD

oc #

ambi

tious

2be

2br

utus

1br

utus

2

capi

tol

1ca

esar

1ca

esar

2ca

esar

2di

d1

enac

t1

hath

1I

1I

1i'

1it

2ju

lius

1ki

lled

1ki

lled

1le

t2

me

1no

ble

2so

2th

e1

the

2to

ld2

you

2w

as1

was

2w

ith2

Aft

er a

ll docu

men

ts h

ave

bee

n p

arse

d t

he

inve

rted

fi

le is

sort

ed b

y te

rms

We

focu

s on t

his

sort

ste

p.

Page 18: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Index

const

ruct

ion

As

we

build

up t

he

index

, ca

nnot

explo

it

com

pre

ssio

n t

rick

spar

se d

ocs

one

at a

tim

e.

The

final

post

ings

entr

y fo

r an

y te

rm is

inco

mple

te u

nti

l th

e en

d.

(act

ual

ly y

ou c

an e

xplo

it c

om

pre

ssio

n,

but

this

bec

om

es a

lot

more

com

ple

x)

At

10

-12

byt

es p

er p

ost

ings

entr

y, d

eman

ds

seve

ral te

mpora

ry g

igab

ytes

Page 19: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Syst

em p

aram

eter

s fo

r des

ign

Dis

k se

ek ~

1 m

illis

econd

Block

tra

nsf

er f

rom

dis

k ~

1 m

icro

seco

nd

per

byt

e ( f

ollo

win

g a

seek

)A

ll oth

er o

ps

~ 1

0 m

icro

seco

nds

E.g.,

com

par

e tw

o p

ost

ings

entr

ies

and

dec

ide

thei

r m

erge

ord

er

Page 20: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Bott

lenec

k

Pars

e an

d b

uild

post

ings

entr

ies

one

doc

at a

ti

me

Now

sort

post

ings

entr

ies

by

term

(th

en b

y doc

wit

hin

eac

h t

erm

)D

oin

g t

his

wit

h r

andom

dis

k se

eks

would

be

too s

low

If ev

ery

com

paris

on to

ok 1

dis

k se

ek, a

ndn

item

s co

uld

beso

rted

with

nlog

2n c

ompa

rison

s, h

ow lo

ng w

ould

this

take

?

Page 21: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Sort

ing w

ith f

ewer

dis

k se

eks

12

-byt

e (4

+4

+4

) re

cord

s (t

erm

, doc

, fr

eq).

Thes

e ar

e gen

erat

ed a

s w

e par

se d

ocs

.M

ust

now

sort

56

0M

such

12

-byt

e re

cord

s by

term

.D

efin

e a

Block

= 1

0M

such

rec

ord

sca

n “

easi

ly”

fit

a co

uple

into

mem

ory

.

Will

sort

wit

hin

blo

cks

firs

t, t

hen

mer

ge

the

blo

cks

into

one

long s

ort

ed o

rder

.

Page 22: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Sort

ing 5

6 b

lock

s of

10M

re

cord

s

Firs

t, r

ead e

ach b

lock

and s

ort

wit

hin

: Q

uic

ksort

take

s ab

out

2 x

(10M

ln 1

0M

) st

eps

Exer

cise

: es

tim

ate

tota

l tim

e to

rea

d e

ach

Exer

cise

: es

tim

ate

tota

l tim

e to

rea

d e

ach

blo

ck f

rom

dis

k an

d a

nd

blo

ck f

rom

dis

k an

d a

nd q

uick

sort

qui

ckso

rtit

.it

.5

6 t

imes

this

est

imat

e -

giv

es u

s 5

6 s

ort

ed

runs

of

10

M r

ecord

s ea

ch.

Nee

d 2

copie

s of

dat

a on d

isk,

thro

ughout.

Page 23: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Mer

gin

g 5

6 s

ort

ed r

uns

Mer

ge

tree

of

log

25

6 ~

6 lay

ers.

Duri

ng e

ach lay

er,

read

into

mem

ory

runs

in

blo

cks

of

10

M,

mer

ge,

wri

te b

ack. 21 43

1 342

Dis

k

Page 24: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Mer

ge

tree

Sort

ed r

uns.

12

56

55

28 r

uns,

20M

/run

14 r

uns,

40M

/run

7 r

uns,

80M

/run

4 r

uns

… ?

2 r

uns

… ?

1 r

uns

… ?

Page 25: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Mer

gin

g 5

6 r

uns

Tim

e es

tim

ate

for

dis

k tr

ansf

er:

6 x

(5

6ru

ns

x 1

20

MB

x 1

0-6

sec)

x 2

~ 2

2hrs

.

Wor

k ou

t how

thes

e tra

nsfe

rs a

re s

tage

d,

and

the

tota

l tim

e fo

r m

ergi

ng.

Dis

k bl

ock

trans

fer t

ime.

Why

is th

is a

nO

vere

stim

ate?

Page 26: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Exer

cise

-fi

ll in

this

tab

le

Tim

eSt

ep

56

init

ial quic

ksort

sof

10

M r

ecord

s ea

ch

Rea

d 2

sort

ed b

lock

s fo

r m

ergin

g,

wri

te b

ack

Mer

ge

2 s

ort

ed b

lock

s

1 2 3 4 5

Add (

2)

+ (

3)

= t

ime

to r

ead/m

erge/

wri

te

56

tim

es (

4)

= t

ota

l m

erge

tim

e

?

Page 27: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Larg

e m

emory

index

ing

Suppose

inst

ead t

hat

we

had

16

GB

of

mem

ory

for

the

above

index

ing t

ask.

Exer

cise

: ho

w m

uch

tim

e to

index

?R

epea

t w

ith

a co

uple

of

valu

es o

f n,

m.

In p

ract

ice,

spid

erin

g inte

rlac

ed w

ith

index

ing.

Spid

erin

g b

ott

lenec

ked b

y W

AN

spee

d a

nd

man

y oth

er f

acto

rs -

more

on t

his

lat

er.

Page 28: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Impro

vem

ents

on b

asic

mer

ge

Com

pre

ssed

tem

pora

ry f

iles

com

pre

ss t

erm

s in

tem

pora

ry d

icti

onar

y ru

ns

How

do w

e m

erge

com

pre

ssed

runs

to

gen

erat

e a

com

pre

ssed

run?

Giv

en t

wo γ

-enco

ded

runs,

mer

ge

them

into

a

new

γ-e

nco

ded

run

To d

o t

his

, fi

rst γ-

dec

ode

a ru

n into

a

sequen

ce o

f gap

s, t

hen

act

ual

rec

ord

s:33,1

4,1

07,5

… →

33, 47, 154, 159

13,1

2,1

09,5

… →

13, 25, 134, 139

Page 29: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Mer

gin

g c

om

pre

ssed

runs

Now

mer

ge:

13, 25, 33, 47, 134, 139, 154, 159

Now

gen

erat

e new

gap

seq

uen

ce13,1

2,8

,14,8

7,5

,15,5

Finis

h b

y γ-

enco

din

g t

he

gap

seq

uen

ceBu

t w

hat

was

the

poin

t of

all th

is?

If w

e w

ere

to u

nco

mpre

ss t

he

enti

re r

un in

mem

ory

, w

e sa

ve n

o m

emory

How

do w

e gai

n a

nyt

hin

g?

Page 30: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

“Zip

per

” unco

mpre

ss/d

ecom

pre

ss

When

mer

gin

g t

wo r

uns,

bri

ng t

hei

r γ-

enco

ded

ver

sions

into

mem

ory

Do N

OT

unco

mpre

ss t

he

enti

re g

ap

sequen

ce a

t once

–only

a s

mal

l se

gm

ent

at

a ti

me

Mer

ge

the

unco

mpre

ssed

seg

men

tsC

om

pre

ss m

erged

seg

men

ts a

gai

n

Com

pre

ssed

, m

erged

outp

ut

Unco

mpre

ssed

segm

ents

Com

pre

ssed

inputs

Page 31: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Impro

ving o

n b

inar

y m

erge

tree

Mer

ge

more

than

2 r

uns

at a

tim

eM

erge

k>2

runs

at a

tim

e fo

r a

shal

low

er t

ree

mai

nta

in h

eap o

f ca

ndid

ates

fro

m e

ach r

un

15

24

36

….

….

Page 32: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Dyn

amic

index

ing

Docs

com

e in

ove

r ti

me

post

ings

updat

es f

or

term

s al

read

y in

dic

tionar

ynew

ter

ms

added

to d

icti

onar

y

Docs

get

del

eted

Page 33: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Sim

ple

st a

ppro

ach

Mai

nta

in “

big

” m

ain index

New

docs

go into

“sm

all”

auxili

ary

index

Sear

ch a

cross

both

, m

erge

resu

lts

Del

etio

ns

Inva

lidat

ion b

it-v

ecto

r fo

r del

eted

docs

Filt

er d

ocs

outp

ut

on a

sea

rch r

esult

by

this

in

valid

atio

n b

it-v

ecto

r

Peri

odic

ally

, re

-index

into

one

mai

n index

Page 34: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

More

com

ple

x a

ppro

ach

Fully

dyn

amic

updat

esO

nly

one

index

at

all ti

mes

No b

ig a

nd s

mal

l in

dic

es

Act

ive

man

agem

ent

of

a pool of

spac

e

Page 35: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Fully

dyn

amic

updat

es

Inse

rtin

g a

(va

riab

le-l

ength

) re

cord

e.g.,

a t

ypic

al p

ost

ings

entr

y

Mai

nta

in a

pool of

(say

) 6

4K

B ch

unks

Chunk

hea

der

mai

nta

ins

met

adat

a on

reco

rds

in c

hunk,

and its

fre

e sp

ace

Rec

ord

Rec

ord

Rec

ord

Rec

ord

Free

spa

ce

Hea

der

Page 36: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Glo

bal

tra

ckin

g

In m

emory

, m

ainta

in a

glo

bal

rec

ord

addre

ss

table

that

say

s, f

or

each

rec

ord

, th

e ch

unk

it’s

in.

Def

ine

one

chunk

to b

e cu

rren

t.In

sert

ion

if c

urr

ent

chunk

has

enough f

ree

spac

eex

tend r

ecord

and u

pdat

e m

etad

ata.

else

look

in o

ther

chunks

for

enough s

pac

e.el

se o

pen

new

chunk.

Page 37: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Chan

ges

to d

icti

onar

y

New

ter

ms

appea

r ove

r ti

me

cannot

use

a s

tati

c per

fect

has

h f

or

dic

tionar

y

OK

to u

se t

erm

char

acte

r st

ring w

/poin

ters

fr

om

post

ings

as in lec

ture

2.

Page 38: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Index

on d

isk

vs. m

emory

Most

ret

riev

al s

yste

ms

keep

the

dic

tionar

y in

m

emory

and t

he

post

ings

on d

isk

Web

sea

rch e

ngin

es f

requen

tly

keep

both

in

mem

ory

mas

sive

mem

ory

req

uir

emen

t

feas

ible

for

larg

e w

eb s

ervi

ce inst

alla

tions,

le

ss s

o f

or

stan

dar

d u

sage

wher

equer

y lo

ads

are

lighte

r

use

rs w

illin

g t

o w

ait

2 s

econds

for

a re

sponse

More

on t

his

when

dis

cuss

ing d

eplo

ymen

t m

odel

s

Page 39: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Dis

trib

ute

d index

ing

Suppose

we

had

sev

eral

mac

hin

es a

vaila

ble

to

do t

he

index

ing

how

do w

e ex

plo

it t

he

par

alle

lism

Tw

o b

asic

appro

aches

stri

pe

by

dic

tionar

y as

index

is

built

up

stri

pe

by

docu

men

ts

Page 40: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Index

ing in t

he

real

worl

d

Typ

ical

ly,

don’t

hav

e al

l docu

men

ts s

itti

ng

on a

loca

l fi

lesy

stem

Docu

men

ts n

eed t

o b

e sp

ider

edC

ould

be

dis

per

sed o

ver

a W

AN

wit

h v

aryi

ng

connec

tivi

tyM

ust

sch

edule

dis

trib

ute

d s

pid

ers/

index

ers

Could

be

(sec

ure

conte

nt)

in

Dat

abas

esC

onte

nt

man

agem

ent

applic

atio

ns

Emai

l ap

plic

atio

ns

htt

p o

ften

not

the

most

eff

icie

nt

way

of

fetc

hin

g t

hes

e docu

men

ts -

nat

ive

API

fe

tchin

g

Page 41: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Index

ing in t

he

real

worl

d

Docu

men

ts in a

var

iety

of

form

ats

word

pro

cess

ing f

orm

ats

(e.g

., M

S W

ord

)sp

read

shee

tspre

senta

tions

publis

hin

g f

orm

ats

(e.g

., p

df)

Gen

eral

ly h

andle

d u

sing f

orm

at-s

pec

ific

“f

ilter

s”co

nve

rt f

orm

at into

tex

t +

met

a-dat

a

Page 42: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Index

ing in t

he

real

worl

d

Docu

men

ts in a

var

iety

of

languag

esau

tom

atic

ally

det

ect

languag

e(s)

in a

docu

men

tto

keniz

atio

n,

stem

min

g,

are

languag

e-dep

enden

t

Page 43: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

“Ric

h”

docu

men

ts

(How

) D

o w

e in

dex

im

ages

?

Res

earc

her

s hav

e dev

ised

Quer

y Ba

sed o

n

Imag

e C

onte

nt

(QBI

C)

syst

ems

“show

me

a pic

ture

sim

ilar

to t

his

ora

nge

circ

le”

wat

ch f

or

nex

t w

eek’

s le

cture

on v

ecto

r sp

ace

retr

ieva

l

In p

ract

ice,

im

age

sear

ch b

ased

on m

eta-

dat

a su

ch a

s fi

le n

ame

e.g.,

monal

isa.

jpg

Page 44: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Pass

age/

sente

nce

ret

riev

al

Suppose

we

wan

t to

ret

riev

e not

an e

nti

re

docu

men

t m

atch

ing a

quer

y, b

ut

only

a

pas

sage/

sente

nce

-sa

y, in a

ver

y lo

ng

docu

men

t

Can

index

pas

sages

/sen

tence

s as

min

i-docu

men

ts

But

then

you lose

the

ove

rall

rele

vance

as

sess

men

t fr

om

the

com

ple

te d

ocu

men

t -

more

on t

his

as

we

study

rele

vance

ran

king

More

on t

his

when

dis

cuss

ing X

ML

sear

ch

Page 45: Information Retrievalale/BICI/IR/Slides/bertinoro4.pdf · Information Retrieval Lecture 4. Recap of last time Postings pointer storage Dictionary storage Compression Wild-card queries

Res

ourc

es, an

d b

eyond

MG

5.

Thus

far,

docu

men

ts e

ither

mat

ch a

quer

y or

do n

ot.

It’s

tim

e to

bec

om

e m

ore

dis

crim

inat

ing -

how

wel

l does

a d

ocu

men

t m

atch

a q

uer

y?G

ives

ris

e to

ran

king a

nd s

cori

ng