close
test_template

What is Similarity Measures

Human-Written
download print

About this sample

About this sample

close
Human-Written

Words: 1017 |

Pages: 2|

6 min read

Published: Jan 29, 2019

Words: 1017|Pages: 2|6 min read

Published: Jan 29, 2019

Table of contents

  1. Semantic Similarity
  2. Similarity Measures
  3. Euclidian Distance
  4. Manhattan Distance
  5. Cosine Similarity
  6. Jaccard Coefficient

Semantic Similarity

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation (e.g. their string format). These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.

The similarity is subjective and is highly dependent on the domain and application. For example, two fruits are similar because of colour or size or taste. Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each element must be normalized, or one feature could end up dominating the distance calculation. Similarities are measured in the range 0 to 1 [0,1].

Similarity Measures

A Similarity Measure is the measure of how much alike two data objects are. Similarity measure in context of data mining is a distance between points of dimensions representing features of the objects. If this distance is small, it will be the high degree of similarity where as a large distance will be the low degree of similarity.

A Similarity Measure is also known as Similarity Function which is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects.

Similarity between two documents or document Vs query terms: A similarity measure can be used to calculate similarity between two documents, two queries, or one document and one query.

Document Ranking: similarity measure score can be used to rank the documents.

All clustering algorithms use similarity or so called “distance functions” to determine cluster members. Few of the most popular similarity measures are discussed in the following subsections.

Euclidian Distance

It is a standard metric for geometrical problems. It is the ordinary distance between two points and can be easily measured with a ruler in two- or three-dimensional space. Euclidean distance is widely used in clustering problems, including clustering text. It is also the default distance measure used with the K-means algorithm. Measuring distance between text documents: given two documents, da and db represented by their term vectors ta and tb respectively. The Euclidean distance of the two documents is defined as:

Where, the term set is T = {t1, t2,..….., tn}In this calculation Wt,a = tf-idf(da,t)

Euclidean distance is the most common use of distance. In most cases when people said about distance, they will refer to Euclidean distance. Euclidean distance is also known as simply distance. When data is dense or continuous, this is the best proximity measure.

Manhattan Distance

Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the total sum of the difference between the x-coordinates and y-coordinates.

Suppose we have two points A and B if we want to find the Manhattan distance between them, just we have, to sum up, the absolute x-axis and y – axis variation means we have to find how these two points A and B are varying in X-axis and Y- axis. In a more mathematical way of saying Manhattan distance between two points measured along axes at right angles.

In a plane with p1 at (x1, y1) and p2 at (x2, y2), Manhattan distance = |x1 – x2| + |y1 – y2|

This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1 distance or L1 norm, city block distance, taxi-cab metric, or city block distance.

Cosine Similarity

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.

Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle.

It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

Jaccard Coefficient

The Jaccard coefficient is used to measure similarity between sets, and it can be calculated by dividing the size of the intersection by the size of the union of the sets:

We so far discussed some metrics to find the similarity between objects. where the objects are points or vectors. When we consider about Jaccard similarity, this object will be sets. So first let’s learn some very basic about sets.

A set is (unordered) collection of objects {a, b, c}. we use the notation as elements separated by commas inside curly brackets {}. They are unordered so {a, b} = {b, a}.

Cardinality of A denoted by |A| which counts how many elements are in A.

Intersection between two sets A and B is denoted A ∩ B and reveals all items which are in both sets A, B.

Union between two sets A and B is denoted A ∪ B and reveals all items which are in either set.

The Jaccard Coefficient measures the similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find Jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B

Similarity J (A, B) = A ∩ B/ A ∪ B

Get a custom paper now from our expert writers.

For calculating Similarity between query and given document by using Jaccard Coefficient

Image of Alex Wood
This essay was reviewed by
Alex Wood

Cite this Essay

What is Similarity Measures. (2019, January 28). GradesFixer. Retrieved November 20, 2024, from https://gradesfixer.com/free-essay-examples/what-is-similarity-measures/
“What is Similarity Measures.” GradesFixer, 28 Jan. 2019, gradesfixer.com/free-essay-examples/what-is-similarity-measures/
What is Similarity Measures. [online]. Available at: <https://gradesfixer.com/free-essay-examples/what-is-similarity-measures/> [Accessed 20 Nov. 2024].
What is Similarity Measures [Internet]. GradesFixer. 2019 Jan 28 [cited 2024 Nov 20]. Available from: https://gradesfixer.com/free-essay-examples/what-is-similarity-measures/
copy
Keep in mind: This sample was shared by another student.
  • 450+ experts on 30 subjects ready to help
  • Custom essay delivered in as few as 3 hours
Write my essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled

close

Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.

    close

    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts

    close

    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.

    close

    Thanks!

    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!

    clock-banner-side

    Get Your
    Personalized Essay in 3 Hours or Less!

    exit-popup-close
    We can help you get a better grade and deliver your task on time!
    • Instructions Followed To The Letter
    • Deadlines Met At Every Stage
    • Unique And Plagiarism Free
    Order your paper now