Computer Science 252
Neuromorphic Computing
Assignment 5: Latent Semantic Analysis / Indexing
Objectives

 Understand the LSA (LSI) algorithm by coding it from scratch in Python.
 Apply LSA to a realworld problem like semantic web search.
Part I: Background reading
Read this fourpage tutorial to see what you will need to implement for this assignment. The tutorial uses technical vocabulary from Natural Language Processing like stemming (removing suffixes) and stop words (articles, prepositions) but is otherwise straightforward.
Part II: Code up LSA
As usual, we’ll complete the project one step at a time. The tutorial lays out the steps very clearly, but to give you an more concrete goal to aim for, here is the output of my solution:
Step 1: Set term weights and construct the termdocument matrix A and query matrix
a [1. 1. 1.] 0
arrived [0. 1. 1.] 0
damaged [1. 0. 0.] 0
delivery [0. 1. 0.] 0
fire [1. 0. 0.] 0
gold [1. 0. 1.] 1
in [1. 1. 1.] 0
of [1. 1. 1.] 0
shipment [1. 0. 1.] 0
silver [0. 2. 0.] 1
truck [0. 1. 1.] 1
Step 2: Decompose matrix A matrix and find the U, S and V matrices,
where A = USV^T
U=
[[0.4201 0.0748 0.046 ]
[0.2995 0.2001 0.4078]
[0.1206 0.2749 0.4538]
[0.1576 0.3046 0.2006]
[0.1206 0.2749 0.4538]
[0.2626 0.3794 0.1547]
[0.4201 0.0748 0.046 ]
[0.4201 0.0748 0.046 ]
[0.2626 0.3794 0.1547]
[0.3151 0.6093 0.4013]
[0.2995 0.2001 0.4078]]
S=
[4.0989 2.3616 1.2737]
V=
[[0.4945 0.6458 0.5817]
[0.6492 0.7194 0.2469]
[0.578 0.2556 0.775 ]]
Step 3: Implement a Rank 2 Approximation by keeping the first two columns of
U and V and the first two columns and rows of S.
U=
[[0.4201 0.0748]
[0.2995 0.2001]
[0.1206 0.2749]
[0.1576 0.3046]
[0.1206 0.2749]
[0.2626 0.3794]
[0.4201 0.0748]
[0.4201 0.0748]
[0.2626 0.3794]
[0.3151 0.6093]
[0.2995 0.2001]]
S=
[4.0989 2.3616]
V=
[[0.4945 0.6458 0.5817]
[0.6492 0.7194 0.2469]]
Step 4: Find the new document vector coordinates in this reduced 2dimensional
space.
d1: [0.4945 0.6492]
d2: [0.6458 0.7194]
d3: [0.5817 0.2469]
Step 5: Find the new query vector coordinates in the reduced 2dimensional
space.
q = [0.214 0.1821]
Step 6: Rank documents in decreasing order of querydocument cosine
similarities.
sim(q,d1) = 0.054
sim(q,d2) = +0.991
sim(q,d3) = +0.448