Information retrieval
meaning
Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).
keywords: unstructured, large scale - provides a more natural and acceptable way of human-machine interaction compared with daunting database-style searching, also gives more challenge to data organization and query processing.(while In fact, no data is truly unstructured)
IR also covers supporting users in browsing or filtering document
collections or further processing a set of retrieved documents
scale
- web search
billions of documents stored on millions of computers
gather documents to indexed
build efficient system
exploit hypertext
protect from being boosted - personal information retrieval
spotlight, instant search
email program, search and classification - enterprise, institutional, and domain-specific search
An example information retrieval problem
Shakespeare's collected works, containing the words Brutus and Caesar and not Calpurnia.
grep
(How about requiring lager data, more flexible query, ranked retrieval more quickly)
incidence matrix
extremely sparse
terminology
- boolean retrieval model
a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms. - term
the smallest unit we treat as the element of the set - document
units we have decided to build a retrieval system over - collection/corpus
the group of documents - information need
the topic about which the user desires to know more. - query
what the user convey to the computer. - relevant
a document is relevant if it is the one that the user perceives as containing information of value with respect to their personal informational need. - effectiveness
the quality of its search results
- pricision
TP/(TP+FP) - recall
TP/(TP+FN)
inverted index/inverted file/index
- vocabulary/lexicon
the set of terms - dictionary
the data structure of the items - posting
each item in the list - posting list
- postings
all posting lists
a first take at building an inverted index
- collect documents to be indexed
- tokenize the text, turning each document into a list of tokens
- do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms
- Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
- storage
memory - disk(a linked list of fixed length arrays for each term)
processing boolean queries
- simple conjunctive query
- query optimization
process in increasing order of term frequency
- asymmetric
- difference is large
The extended Boolean model versus ranked retrieval
- ranked retrieval model
such as the vector space model, in which users freely use free text queries - the extended Boolean model
proximity operator: specify that two terms in a query must occur close to each other in a document