By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email
No need to pay just yet!
About this sample
About this sample
Words: 2317 |
Pages: 5|
12 min read
Published: Jun 20, 2019
Words: 2317|Pages: 5|12 min read
Published: Jun 20, 2019
Abstract— Information Retrieval (IR) system finds the relevant documents from a large dataset according to the user query. Queries submitted by users to search engines might be ambiguous, concise and their meaning may change over time. As a result, understanding the nature of information that is needed behind the queries has become an important research problem. So, various search engines emphasize the query classification. For the efficient IR system, this system proposes the Query Classification Algorithm (QCA) and domain term extraction algorithm. This system classifies the queries into each predefined target categories. In query classification, domain terms are extracted from the query, and each of them is classified into their relevant categories that are stored in the database. By using categories from QCA, this system finds the relevant document from the document collection. The vector space IR model is used in this system to retrieve the relevant document.
I. INTRODUCTION
Information Retrieval (IR) system finds the relevant documents from a large dataset according to the user query. The IR is comprised of basic components such as document indexing, searching and ranking. The current IR systems, including search engines, have a standard interface consisting of a single input box that accepts keywords. The keywords submitted by the user are matched against the collection index to find the documents that contain those keywords. When a user query contains multiple topic-specific keywords that accurately describe his information need, the system is likely to return good matches; however, when the user query is short and the natural language is inherently ambiguous, this simple retrieval model is usually prone to errors and omissions.
Understanding the meaning of search queries is a key task which lies at the heart of search research. Query classification is a difficult task since queries usually consist of only a few terms, often leading to significant ambiguity.
Semantic logics are very important in query understanding to create successful search engine. A user might not formalize the query when he seeks information although he knows what he wants. As a result, understanding the nature of information that is needed behind the queries has become an important research problem.
So, this system proposes the domain term extraction algorithm and Query Classification Algorithm (QCA). In the proposed system, the concept terms strategy is used to identify the relevant category with the ambiguous domain term. This system stores the concept terms in the NoSQL graph database. Based on the concept term strategy and NoSQL graph database, this system uses the QCA to classify query characteristics and the ambiguous domain terms. Using classified user query, this system performs the information retrieval process.
In the query classification based IR system, the QCA and vector space model are used to retrieve user query relevant information. According to the concept terms analysis results, this system becomes a good IR system by extracting documents which are more relevant to user’s requirements.
The rest of the paper is organized as follows: related work is described in section 2. Background theory is shown in section 3. Proposed system design is presented in section 4. Proposed methodology is described in section 5, and experimental result of the system is presented in section 6. Finally, conclusion is given in section 7.
II. RELATED WORK
In 2006, W. Yue, Z.Chen and X. Lu proposed a novel information retrieval algorithm based on query expansion and classification. The algorithm is induced by the observation that very short queries with the traditional information retrieval methods often have low precision, although they can get high recall. Their approach attempted to catch more relevant documents by query expansion and text classification. The results of the experiments showed that the proposed algorithm is more precise and efficient than the traditional query expansion methods.
In 2012, S. M. Fathalla and Y. F. Hassan presented hybrid method for user query reformation and classification depending on fuzzy semantic-based approach and K-Nearest Neighbors (KNN) classifier. The overall processes of the system are query pre-processing, fuzzy membership calculation, query classification and reformation. Classification is performed using KNN classifier not just by keyword-based semantic but using a sentence-level semantics. After classification, user’s query is reformulated to be submitted to a search engine which gives better results than submitting the original query to the search engine. Experiments show significant enhancement on search results over traditional keyword-based search engines’ results.
In 2015, C. Xia and X. Wang adopted a new web query classification method. Their method consists of three steps. In the first step, some context information is labeled to enrich their training set. In the second step, the list of labeled queries is split into word sequences, and then a graph whose nodes and edges are indexed with category labels is constructed. After that, a liner equation is trained to evaluate the possibility of a given query belonging to a certain category. Their method can decrease the training time by 10% compared with the Support Vector Machine (SVM).
III. BACKGROUND THEORY
A. Domain Term Extraction
Domain term extraction is a categorization or classification task in which terms are categorized into a set of predefined domains. It has been applied to tasks such as key phrase extraction, word sense disambiguation, cross-lingual text categorization and query classification.
B. Query Classification
Queries submitted by users to search engines might be ambiguous, concise and their meaning may change over time. Query classification is emphasized by various search engines nowadays due to the increase in the size of the web, as millions of resources are added to it every day. Query classification assigns a search query to one or more predefined categories, based on its topics. It is to classify a user query Qi into a list of n categories ci1, ci2, cin. The importance of query classification is underscored by many services provided by search engine. A direct application is to provide better search result documents for users in the interests of different categories. Search result pages can be grouped according to the categories predicted by query classification method.
Query classification is a two-step process. The first one is learning step where a classification model is constructed. The second one is classification step where the model is used to predict class label for given data. If a certain category in an intermediate taxonomy is given, query classification is directly mapped to a target category if and only if the following condition is satisfied: one or more terms in each node along the path in the target category appear along the path corresponding to matched intermediate category.
C. Information Retrieval
Information Retrieval (IR) system is able to accept a user query, understand the user’s requirements, search a database for relevant documents, retrieve the documents to the user, and rank the documents according to their relevance. There are four main IR models. These are as follows:
1) Boolean Model: A document matches the query if the set of terms associated with the document satisfies the Boolean expression representing the query. Boolean expression of terms uses the standard Boolean operators: and, or and not. The result of the query is the set of matching documents.
2) Vector Space Model: In the vector space model text is represented by a vector of terms. Terms are typically words and phrases. If words are chosen as terms, then every word in the vocabulary becomes an independent dimension in a very high dimensional vector space. Any text can then be represented by a vector in this high dimensional space. If a term belongs to a text, it gets a non-zero value in the text-vector along the dimension corresponding to the term. A vector-based IR method represents both documents and queries with high-dimensional vectors, while computing their similarities by vector inner product.
3) Language Model: Statistical language models are based on probability and have foundations in statistical theory. It first estimates a language model for each document, and then ranks documents by the likelihood of the query given in the language model.
4) Probabilistic Model: Probabilistic IR models estimate the probability of relevance of documents for a query. This model is based on probability theory. It can be estimated by the relevance of a given document based on their query.
IV. PROPOSED SYSTEM DESIGN
In this system, there are three main steps. In the first step, this system uses the domain term extraction algorithm to extract the domain terms from the user query. In the second step, this system classifies each extracted domain terms into each category by using QCA and Neo4j graph database. In the final step, this system retrieves the user query relevant information by using classified query.
V. PROPOSED METHODOLOGY
In this system, domain term extraction and query classification algorithms are proposed. Using classified query, this system retrieves the relevant information according to the vector space IR model
a. Vector Space IR Model
In the vector space IR model, a document is represented as a vector of term weights. The number of dimensions in the vector space is equal to the number of terms used in the overall documents collection. A query in the vector space model is treated as if it were just another document allowing the same vector representation to be used for the queries as documents. This representation naturally leads to the use of the vector inner product as the measure of similarity between the query and a document.
1) TF-IDF Scheme: In TF-IDF scheme, a document in the vector space model is represented as a weight vector, in which each component weight is computed based on some variation of TF or TF-IDF scheme. In this scheme, N is total number of documents in the system. The dfi is number of documents in which term ti appears at least once. The fij is the raw frequency count of term ti in document dj.
2) Weighting Scheme for Query: A query q is represented in exactly the same way as a document in the document collection. The term weight wiq of each term ti in q can also be computed in the same way as in a normal document.
3) Similarity Measure: Dice similarity method measures the similarity between the document vector dj and the query vector q.
b. Explanation of the System
This system performs the IR process by using classified user query. Firstly, user query is classified by using web query classification algorithm. Then, user required information are retrieved according to the vector space IR model.
Sample User Query: “explain about VOIP speech coding techniques”.
After accepting user query, the system performs tokenization and stop words removal process that removes “about” word from user query. This system then extracts domain terms from the query by using domain term extraction algorithm. In the sample user query, domain terms are “VOIP”, “speech”, “coding” and “techniques”.
This system then performs classification process by classifying each domain term from the user query. To support classification process, this system stores possible category for each domain term into the Neo4j graph database. By using this database, this system extracts the matched terms for each domain term. This system then searches categories for each matched term.
TABLE I. DENSITY RESULT FOR EACH CATEGORY
Domain Term Matched Term Related Category Density
VOIP Voice Over Internet Protocol (VOIP) Digital Signal Processing (DSP) 1
speech Speech Processing, Speech Recognition Digital Signal Processing (DSP) 1
coding Coding Tools and Techniques Software Engineering (SE) 1
techniques Database Index Techniques Data Structure (DS) 0.10566
Methodology and techniques Digital Image Processing (DIP) 0.10566
Programming Techniques,
Design Tools and Techniques,
Coding Tools and Techniques Software Engineering (SE) 0.31699
TABLE II. SCORE RESULTS FOR EACH CATEGORY
Rank Category Score
1 Digital Signal Processing (DSP) 2
2 Software Engineering (SE) 1.31699
3 Digital Image Processing (DIP) 0.10566
4 Data Structure (DS) 0.10566
For the highest score category, this system calculates density and score for each category. Table I and II shows the density and score results for each category. After calculating scores, this system chooses the category that has the highest score as the most relevance term with the domain term.
Classified User Query: “explain about VOIP speech coding techniques - digital signal processing”
By using the classified query, this system retrieves the relevant document based on the ranking results. The retrieval results using classified user query are shown in Table III
TABLE III. RETRIEVAL RESULTS USING CLASSIFIED USER QUERY
ID Document Category Similarity
1 Speech coding techniques for VOIP.pdf DSP 0.35690
2 Digital signal processing - Wikipedia.html DSP 0.23715
3 What is a DSP - Digital Signal Processor.html DSP 0.19320
4 Digital Signal Processing.html DSP 0.18900
5 Digital Signal Processing - Wikiversity.html DSP 0.17350
6 Guide to Digital Signal Process.pdf DSP 0.13203
7 ArithmeticCoding.pdf DIP 0.11785
8 An Introduction to Digital Signal Processing.html DSP 0.11321
9 Digital Image Processing Introduction.html DIP 0.11046
10 digital signal processing.pdf DSP 0.09485
11 BOOK digital-image-processing-part-one.pdf DIP 0.09181
12 Digital image processing - Wikipedia.html DIP 0.08899
13 Remote Sensing and Digital Image Processing.doc DIP 0.08246
14 Data Structures – Geeks for Geeks.html DS 0.08128
VI. EXPERIMENTAL RESULTS OF THE SYSTEM
To show the performance of the system, this system tested each ambiguous query by using 220 training documents that include different file types (.doc, .pdf, .html). These training documents are relevant 22 categories. Accuracy measurement method is as follows:
Precision = True Positive / (True Positive + False Positive) (6)
The experimental results of the query classification algorithm based information retrieval system are shown in Table IV.
TABLE IV. ACCURACY RESULTS
ID Category Sub Category Accuracy Avg
1 Information Science (IS) Software Engineering 80% 86.25%
DBMS 100%
Information System 75%
Data Mining 90%
2 Application (App) Business App 85% 85.75%
Windows App 88%
Web App 90%
Human Computer Interaction 80%
3 Hardware Electronic Circuits 90% 85.00%
Computer Networking 87%
Network Security 85%
Cryptography 90%
Embedded System 88%
Computer Architecture 75%
Digital Image Processing 85%
Digital Signal Processing 80%
4 Software Data Structure 90% 85.83%
Programming Language 75%
Operating System (OS) 85%
Distributed System 75%
Artificial Intelligence 90%
Analysis of Parallel Algorithm 100%
VII. CONCLUSION
The proposed IR system based on domain term extraction algorithm and QCA is a proven IR system that can retrieve documents which are more relevant to user’s requirements. This system can classify intended category of user query and analyze the ambiguous domain terms. The proposed query classification method can solve lack of semantics correlativity in traditional IR system. This system classifies the query into the target categories to increase the precision of the information retrieval system. Proposed search engine provides a set of relevant documents based on semantic retrieval.
Browse our vast selection of original essay samples, each expertly formatted and styled