close
test_template

The Description and Evaluation of Data Mining

Human-Written
download print

About this sample

About this sample

close
Human-Written

Words: 3793 |

Pages: 8|

19 min read

Published: May 7, 2019

Words: 3793|Pages: 8|19 min read

Published: May 7, 2019

Table of contents

  1. Data Mining
  2. What is data mining?
  3. Data Mining Functions
  4. Classification
  5. Associations
  6. Sequential/Temporal patterns
  7. Clustering/Segmentation
  8. Cluster Analysis
  9. Induction
  10. Decision trees
  11. Rule induction
  12. Neural networks
  13. On-line analytical processing
  14. OLAP Example
  15. Comparison of OLAP and OLTP
  16. Data Visualization
  17. The primary Tasks of Data mining
  18. Conclusion

Data Mining

The paper will first define and describe what is data mining. It will also seek to determine why data mining is useful and show that data mining is concerned with the analysis of data and the use of techniques for finding patterns and regularities in sets of data. The background of data mining will be investigated in order to validate the claims of data mining. In this, the areas of inductive learning, statistics, and other areas will be investigated to compare the fields data mining drew from. The data mining models will be investigated describing the verification model, discovery model, and others. The data warehouse will be described and the effect of a well-defined data warehouse on the quality of the data extracted will be shown. The processes in data warehousing will be investigated and also the data-warehousing model will be investigated including the differences between an online transaction processing system (OLTP) and the data warehouse will be investigated. The Problems with the data warehouses will be looked at in the context of data mining and the criteria of a data warehouse will be listed. The data mining problems and issues will be investigated on the basis that data mining systems rely on databases to supply the raw data for input.

The data mining functions will be investigated. The Data mining methods will be classified by the function they perform or according to the class of application they can be used in. The data mining techniques will be investigated. These will be the cluster analysis, induction, and the neural networks to name a few. The applications for data mining will be investigated and finally to describe the Online application processing.

What is data mining?

There has been a dramatic increase in the amount of information or data being stored in electronic format. The increase in use of electronic data gathering devices such as point-of-sale, web pages, or remote sensing devices has contributed to this explosion of available data.

Data storage became easier as the availability of large amounts of computing power at low cost i.e. the cost of processing power and storage is falling, made data cheap. There was also the introduction of new machine learning methods for knowledge representation based on logic programming in addition to traditional statistical analysis of data. The new methods tend to be computationally intensive hence a demand for more processing power.

It is recognized that information is at the heart of business operations and that decision-makers could make use of the data stored to gain valuable insight into the business. Database Management systems gives access to the data stored but this is only a small part of what can be gained from the data. Traditional on-line transaction processing systems, OLTP, are good at putting data into databases quickly, safely and efficiently but are not good at delivering meaningful analysis in return. Analyzing data can provide further knowledge about a business by going beyond the data explicitly stored to derive knowledge about the business. This is where Data Mining or Knowledge Discovery in Databases (KDD) has obvious benefits for any enterprise.

The term data mining has been stretched beyond its limits to apply to any form of data analysis.

Basically, data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. It is the computer that is responsible for finding the patterns by identifying the underlying rules and features in the data. Data mining is asking a process engine to show answers to questions we do not know how to ask (Bichoff & Alexander, June 1997, p310).

The idea is that it is possible to strike Gold in unexpected places, as the data mining software extracts patterns not previously discernable or so obvious that no one has noticed them before.

Data mining analysis tends to work from the data up and the best techniques are those developed with an orientation towards large volumes of data, making use of as much of the collected data as possible to arrive at reliable conclusions and decisions. The analysis process starts with a set of data, uses a methodology to develop an optimal representation of the structure of the data during which time knowledge is acquired. Once knowledge has been acquired this can be extended to larger sets of data working on the assumption that the larger data set has a structure similar to the sample data. This is analogous to a mining operation where large amounts of low-grade materials are sifted through in order to find something of value.

Data Mining Functions

Data mining methods may be classified by the function they perform or according to the class of application they can be used in. Some of the main techniques used in data mining are described below.

Classification

Learning to map an example into one of several classes (Lain, July 1999, p254) as the book describes classification. Data mining tools have to infer a model from the database, and in the case of supervised learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class.

When learning classification rules the system has to find the rules that predict the class from the predicting attributes. Firstly the user has to define conditions for each class, the data mining system then constructs descriptions for the classes. Basically the system should, given a case or tuple with certain known attribute values, be able to predict what class this case belongs to.

Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. The descriptions should only refer to the predicting attributes of the training set so that the positive examples should satisfy the description and none of the negative. A rule said to be correct if its description covers all the positive examples and none of the negative examples of a class.

A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, are very probable. The categories of rules are:

exact rule - permits no exceptions so each object of LHS must be an element of RHS

strong rule - allows some exceptions, but the exceptions have a given limit

probabilistic rule - relates the conditional probability P(RHS|LHS) to the probability P(RHS)

Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS.

Associations

Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as "56% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 56) is called the confidence factor of the rule. Also, in this rule, A, B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule.

Sequential/Temporal patterns

Sequential/temporal pattern functions analyze a collection of records over a period of time for example to identify trends. Where the identity of a customer who made a purchase is known an analysis, can be made of the collection of related records of the same structure (i.e. consisting of a number of items drawn from a given collection of items). The records are related by the identity of the customer who did the repeated purchases. Such a situation is typical of a direct mail application where for example a catalogue merchant has the information, for each customer, of the sets of products that the customer buys in every purchase order. A sequential pattern function will analyze such collections of related records and will detect frequently occurring patterns of products bought over time. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven.

Sequential pattern mining functions are quite powerful and can be used to detect the set of customers associated with some frequent buying patterns. Use of these functions on for example a set of insurance claims can lead to the identification of frequently occurring sequences of medical procedures applied to patients that can help identify good medical practices as well as to potentially detect some medical insurance fraud.

Clustering/Segmentation

Clustering and segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A cluster is a set of objects grouped together because of their similarity or proximity. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters.

Clustering according to similarity is a very powerful technique, the key to it being to translate some intuitive measure of similarity into a quantitative measure. When learning is unsupervised then the system has to discover its own classes i.e. the system clusters the data in the database. The system has to discover subsets of related objects in the training set and then it has to find descriptions that describe each of these subsets.

There are a number of approaches for forming clusters. One approach is to form rules, which dictate membership in the same group based on the level of similarity between members. Another approach is to build set functions that measure some property of partitions as functions of some parameter of the partition.

Cluster Analysis

In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database.

Clustering and segmentation basically partition the database so that each partition or group is similar according to some criteria or metric. Clustering according to similarity is a concept, which appears in many disciplines. If a measure of similarity is available there are a number of techniques for forming clusters. Membership of groups can be based on the level of similarity between members and from this the rules of membership can be defined. Another approach is to build set functions that measure some property of partitions i.e. groups or subsets as functions of some parameter of the partition. This latter approach achieves what is known as optimal partitioning.

Many data mining applications make use of clustering according to similarity for example to segment a client/customer base. Clustering according to optimization of set functions is used in data analysis.

Clustering/segmentation in databases are the processes of separating a data set into components that reflect a consistent pattern of behavior. Once the patterns have been established they can then be used to divide data into more understandable subsets and also they provide sub-groups of a population for further analysis or action, which is important when dealing with very large databases.

Induction

A database is a store of information but more important is the information, which can be inferred from it. There are two main inference techniques available i.e. deduction and induction.

Deduction is a technique to infer information that is a logical consequence of the information in the database e.g. the join operator applied to two relational tables where the first concerns employees and departments and the second departments and managers infers a relation between employee and managers.

Induction is the technique to infer information that is generalized from the database. This is higher-level information or knowledge in that it is a general statement about objects in the database. The database is searched for patterns or regularities.

Decision trees

Decision trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labeled with attribute names, the edges are labeled with possible values for this attribute and the leaves labeled with different classes. Objects are classified by following a path down the tree, by taking the edges, corresponding to the values of the attributes in an object.

Rule induction

A data mining system has to infer a model from the database that is it may define classes such that the database contains one or more attributes that denote the class of a tuple i.e. the predicted attributes while the remaining attributes are the predicting attributes. Class can then be defined by condition on the attributes. When the classes are defined the system should be able to infer the rules that govern classification, in other words the system should find the description of each class.

Production rules have been widely used to represent knowledge in expert systems and they have the advantage of being easily interpreted by human experts because of their modularity i.e. a single rule can be understood in isolation and doesn't need reference to other rules. The propositional like structure of such can summed up as if-then rules.

Neural networks

Neural networks are an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural networks have the innate ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an expert in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if" questions.

Neural networks have broad applicability to real world business problems and have already been successfully applied in many industries. Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including:

  • sales forecasting
  • industrial process control
  • customer research
  • data validation
  • risk management
  • target marketing etc.

Neural networks use a set of processing elements (or nodes) analogous to neurons in the brain. These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data, i.e. the network learns from experience just as people do. This distinguishes neural networks from traditional computing programs, which simply follow instructions in a fixed sequential order.

The issue of where the network gets the weights from is important but suffice to say that the network learns to reduce error in its prediction of events already known.

On-line analytical processing

A major issue in information processing is how to process larger and larger databases, containing increasingly complex data, without sacrificing response time. The client/server architecture gives organizations the opportunity to deploy specialized servers, which are optimized for handling specific data management problems. Relational database management systems (RDBMSs) were used completely for the complete spectrum of database applications. It is however apparent that there are major categories of database applications which are not suitably serviced by relational database systems. Another category of applications is that of on-line analytical processing (OLAP). OLAP was a term coined by E F Codd and was defined by him as;

The dynamic synthesis, analysis and consolidation of large volumes of multidimensional data.

One question is what is multidimensional data and when does it become OLAP? It is essentially a way to build associations between dissimilar pieces of information using predefined business rules about the information you are using. Dimensional databases are not without problem as they are not suited to storing all types of data such as lists for example customer addresses and purchase orders etc. Relational systems are also superior in security, backup and replication services, as these tend not to be available at the same level in dimensional systems. The advantages of a dimensional system are the freedom they offer in that the user is free to explore the data and receive the type of report they want without being restricted to a set format.

OLAP Example

An example OLAP database may be comprised of sales data which has been aggregated by region, product type, and sales channel. A typical OLAP query might access a multi-gigabyte/multi-year sales database in order to find all product sales in each region for each product type. After reviewing the results, an analyst might further refine the query to find sales volume for each sales channel within region/product classifications. As a last step the analyst might want to perform year-to-year or quarter-to-quarter comparisons for each sales channel. This whole process must be carried out on-line with rapid response time so that the analysis process is undisturbed. OLAP queries can be characterized as on-line transactions which:

Access very large amounts of data, e.g. several years of sales data.

Analyze the relationships between many types of business elements e.g. sales, products, regions, and channels.

Involve aggregated data e.g. sales volumes, budgeted dollars and dollars spent.

Compare aggregated data over hierarchical time periods e.g. monthly, quarterly, yearly.

Present data in different perspectives e.g. sales by region vs. sales by channels by product within each region.

Involve complex calculations between data elements e.g. expected profit as calculated as a function of sales revenue for each type of sales channel in a particular region.

Are able to respond quickly to user requests so those users can pursue an analytical thought process without being overwhelmed by the system.

Comparison of OLAP and OLTP

OLAP applications are quite different from On-line Transaction Processing (OLTP) applications, which consist of a large number of relatively simple transactions. The transactions usually retrieve and update a small number of records that are contained in several distinct tables. The relationships between the tables are generally simple.

A typical customer order entry OLTP transaction might retrieve all of the data relating to a specific customer and then insert a new order for the customer. Information is selected from the customer, customer order, and detail line tables. Each row in each table contains a customer identification number, which is used to relate the rows from the different tables. The relationships between the records are simple and only a few records are actually retrieved or updated by a single transaction.

The difference between OLAP and OLTP has been summarized as, OLTP systems handle mission-critical production data accessed through simple queries; while OLAP systems handle management-critical data accessed through an iterative analytical investigation. Both OLAP and OLTP have specialized requirements and therefore require special configurations.

OLAP database systems use multidimensional structures to store data and relationships between data. Multidimensional structures can be best visualized as cubes of data, and cubes within cubes of data. Each side of the cube is considered a dimension.

Each dimension represents a different category such as product type, region, sales channel, and time. Each cell within the multidimensional structure contains aggregated data relating elements along each of the dimensions. Multidimensional databases are a compact and easy to understand vehicle for visualizing and manipulating data elements that have many inter relationships.

OLAP databases support common analytical operations including: consolidation, drill-down, and "slicing and dicing".

Consolidation - involves the aggregation of data such as simple roll-ups or complex expressions involving inter-related data.

Drill-Down - OLAP databases can also go in the reverse direction and automatically display detail data, which comprises consolidated data. This is called drill-downs. Consolidation and drill-down are an inherent property of OLAP.

"Slicing and Dicing" - Slicing and dicing refers to the ability to look at the database from different viewpoints. Slicing and dicing is often performed along a time axis in order to analyze trends and find patterns.

OLAP should have the means for storing multidimensional data in a compressed form.

Relational database designs concentrate on reliability and transaction processing speed, instead of decision support need.

Data Visualization

Data visualization makes it possible for the analyst to gain a deeper, more intuitive understanding of the data and as such can work well along side data mining. Data mining allows the analyst to focus on certain patterns and trends and explore in-depth using visualization. On its own, data visualization can be overwhelmed by the volume of data in a database but in conjunction with data mining, can help with exploration.

The primary Tasks of Data mining

The two high level primary goals of data mining in practice tend to be prediction and description. Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest. Description focuses on finding human-interpretable patterns describing the data(Fayyad, July 1966, p12). The primary tasks produce the patterns where the naked eye could not possibly see them. The system predicts according to trends and the description aspect describes to the user, the possibilities/scenarios.

Get a custom paper now from our expert writers.

Conclusion

Data mining is the area of knowledge discovery that takes the transformed data and finds the patterns. The pattern found is transformed into knowledge that is used by users to make calculated decisions. This paper defined data mining as asking for answers to questions we do not know how to ask. This is done by using the data mining functions of classification, associations, sequential/temporal patterns, and clustering/segmentation. The techniques used by data mining are the cluster analysis, induction, decision trees, rule induction, and neural networks. The difference between OLTP and OLAP was discussed and an example of OLAP was given along with a comparison of OLAP and OLTP. Data visualization was explained with the primary tasks of data mining given as the prediction and description of trends.

Image of Alex Wood
This essay was reviewed by
Alex Wood

Cite this Essay

The Description and Evaluation of Data Mining. (2019, April 26). GradesFixer. Retrieved November 20, 2024, from https://gradesfixer.com/free-essay-examples/the-description-and-evaluation-of-data-mining/
“The Description and Evaluation of Data Mining.” GradesFixer, 26 Apr. 2019, gradesfixer.com/free-essay-examples/the-description-and-evaluation-of-data-mining/
The Description and Evaluation of Data Mining. [online]. Available at: <https://gradesfixer.com/free-essay-examples/the-description-and-evaluation-of-data-mining/> [Accessed 20 Nov. 2024].
The Description and Evaluation of Data Mining [Internet]. GradesFixer. 2019 Apr 26 [cited 2024 Nov 20]. Available from: https://gradesfixer.com/free-essay-examples/the-description-and-evaluation-of-data-mining/
copy
Keep in mind: This sample was shared by another student.
  • 450+ experts on 30 subjects ready to help
  • Custom essay delivered in as few as 3 hours
Write my essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled

close

Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.

    close

    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts

    close

    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.

    close

    Thanks!

    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!

    clock-banner-side

    Get Your
    Personalized Essay in 3 Hours or Less!

    exit-popup-close
    We can help you get a better grade and deliver your task on time!
    • Instructions Followed To The Letter
    • Deadlines Met At Every Stage
    • Unique And Plagiarism Free
    Order your paper now