Computer Science 252
Neuromorphic Computing
Assignment 5: Latent Semantic Analysis / Indexing
Objectives
-
- Understand the LSA (LSI) algorithm by coding it from scratch in Python.
- Apply LSA to a real-world problem like semantic web search.
Part I: Background reading
Read this four-page tutorial to see what you will need to implement for this assignment. The tutorial uses technical vocabulary from Natural Language Processing like stemming (removing suffixes) and stop words (articles, prepositions) but is otherwise straightforward.
Part II: Code up LSA
As usual, we’ll complete the project one step at a time. The tutorial lays out the steps very clearly, but to give you an more concrete goal to aim for, here is the output of my solution:
Step 1: Set term weights and construct the term-document matrix A and query matrix
a [1. 1. 1.] 0
arrived [0. 1. 1.] 0
damaged [1. 0. 0.] 0
delivery [0. 1. 0.] 0
fire [1. 0. 0.] 0
gold [1. 0. 1.] 1
in [1. 1. 1.] 0
of [1. 1. 1.] 0
shipment [1. 0. 1.] 0
silver [0. 2. 0.] 1
truck [0. 1. 1.] 1
Step 2: Decompose matrix A matrix and find the U, S and V matrices,
where A = USV^T
U=
[[-0.4201 -0.0748 -0.046 ]
[-0.2995 0.2001 0.4078]
[-0.1206 -0.2749 -0.4538]
[-0.1576 0.3046 -0.2006]
[-0.1206 -0.2749 -0.4538]
[-0.2626 -0.3794 0.1547]
[-0.4201 -0.0748 -0.046 ]
[-0.4201 -0.0748 -0.046 ]
[-0.2626 -0.3794 0.1547]
[-0.3151 0.6093 -0.4013]
[-0.2995 0.2001 0.4078]]
S=
[4.0989 2.3616 1.2737]
V=
[[-0.4945 -0.6458 -0.5817]
[-0.6492 0.7194 -0.2469]
[-0.578 -0.2556 0.775 ]]
Step 3: Implement a Rank 2 Approximation by keeping the first two columns of
U and V and the first two columns and rows of S.
U=
[[-0.4201 -0.0748]
[-0.2995 0.2001]
[-0.1206 -0.2749]
[-0.1576 0.3046]
[-0.1206 -0.2749]
[-0.2626 -0.3794]
[-0.4201 -0.0748]
[-0.4201 -0.0748]
[-0.2626 -0.3794]
[-0.3151 0.6093]
[-0.2995 0.2001]]
S=
[4.0989 2.3616]
V=
[[-0.4945 -0.6458 -0.5817]
[-0.6492 0.7194 -0.2469]]
Step 4: Find the new document vector coordinates in this reduced 2-dimensional
space.
d1: [-0.4945 -0.6492]
d2: [-0.6458 0.7194]
d3: [-0.5817 -0.2469]
Step 5: Find the new query vector coordinates in the reduced 2-dimensional
space.
q = [-0.214 0.1821]
Step 6: Rank documents in decreasing order of query-document cosine
similarities.
sim(q,d1) = -0.054
sim(q,d2) = +0.991
sim(q,d3) = +0.448