By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email
No need to pay just yet!
About this sample
About this sample
Words: 1017 |
Pages: 2|
6 min read
Published: Jan 29, 2019
Words: 1017|Pages: 2|6 min read
Published: Jan 29, 2019
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation (e.g. their string format). These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.
The similarity is subjective and is highly dependent on the domain and application. For example, two fruits are similar because of colour or size or taste. Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each element must be normalized, or one feature could end up dominating the distance calculation. Similarities are measured in the range 0 to 1 [0,1].
A Similarity Measure is the measure of how much alike two data objects are. Similarity measure in context of data mining is a distance between points of dimensions representing features of the objects. If this distance is small, it will be the high degree of similarity where as a large distance will be the low degree of similarity.
A Similarity Measure is also known as Similarity Function which is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects.
Similarity between two documents or document Vs query terms: A similarity measure can be used to calculate similarity between two documents, two queries, or one document and one query.
Document Ranking: similarity measure score can be used to rank the documents.
All clustering algorithms use similarity or so called “distance functions” to determine cluster members. Few of the most popular similarity measures are discussed in the following subsections.
It is a standard metric for geometrical problems. It is the ordinary distance between two points and can be easily measured with a ruler in two- or three-dimensional space. Euclidean distance is widely used in clustering problems, including clustering text. It is also the default distance measure used with the K-means algorithm. Measuring distance between text documents: given two documents, da and db represented by their term vectors ta and tb respectively. The Euclidean distance of the two documents is defined as:
Where, the term set is T = {t1, t2,..….., tn}In this calculation Wt,a = tf-idf(da,t)
Euclidean distance is the most common use of distance. In most cases when people said about distance, they will refer to Euclidean distance. Euclidean distance is also known as simply distance. When data is dense or continuous, this is the best proximity measure.
Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the total sum of the difference between the x-coordinates and y-coordinates.
Suppose we have two points A and B if we want to find the Manhattan distance between them, just we have, to sum up, the absolute x-axis and y – axis variation means we have to find how these two points A and B are varying in X-axis and Y- axis. In a more mathematical way of saying Manhattan distance between two points measured along axes at right angles.
In a plane with p1 at (x1, y1) and p2 at (x2, y2), Manhattan distance = |x1 – x2| + |y1 – y2|
This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1 distance or L1 norm, city block distance, taxi-cab metric, or city block distance.
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.
Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle.
It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.
Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.
The Jaccard coefficient is used to measure similarity between sets, and it can be calculated by dividing the size of the intersection by the size of the union of the sets:
We so far discussed some metrics to find the similarity between objects. where the objects are points or vectors. When we consider about Jaccard similarity, this object will be sets. So first let’s learn some very basic about sets.
A set is (unordered) collection of objects {a, b, c}. we use the notation as elements separated by commas inside curly brackets {}. They are unordered so {a, b} = {b, a}.
Cardinality of A denoted by |A| which counts how many elements are in A.
Intersection between two sets A and B is denoted A ∩ B and reveals all items which are in both sets A, B.
Union between two sets A and B is denoted A ∪ B and reveals all items which are in either set.
The Jaccard Coefficient measures the similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find Jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B
Similarity J (A, B) = A ∩ B/ A ∪ B
For calculating Similarity between query and given document by using Jaccard Coefficient
Browse our vast selection of original essay samples, each expertly formatted and styled