after performing k means clusters let us suppose that we examine the clusters by sig 5100825
After performing K-means clusters, let us suppose that we examine the clusters by sight and assign names
to them. For example, one cluster may represent documents about sports, another may represent documents
about politics, and yet another may represent documents about animals. Let us assume that we assign each
cluster a name such as sports, politics, and animals.
Sometimes, words are used in multiple contexts. For example, the word duck is ambiguous. Sometimes it
means a waterfowl and would fall into the animal category. Sometimes it is used in politics such as a lame
duck congress and would fall into the politics category. Sometime it is used in sports such as the name of a
National Hockey League team the Anaheim Ducks and would fall into the sports category. Knowing which
context the word is used makes the clustering much better. To understand why, suppose that we had two
documents, one with the words duck and water, and the other with the words duck and ice. Without
understanding the context of the word duck, our similarity metric may actually find that these documents
are similar. However, understanding that when duck appears with water, the word duck probably refers to
an animal, whereas when duck appears with ice, the word duck probably refers to sports. With this
knowledge, our similarity metric would find these documents not very similar at all.
Suppose we had a library of words that are used in multiple contexts such as:
String[] multiContextWords= {“duck”, “crane”, “book”, …};
Suppose also that we have a multi-dimensional array that shows the multi-context words and common
words that are used with them:
Attachments: