SUSHUMA

INTRODUCTION

1) What Motivated Data Mining? Why Is It Important?

Data mining can be viewed as a result of the evolution of information technology. The database system industry has witnessed an evolutionary path in the development of the following functionalities (shown in below Figure): data collection and database creation, data management (including data storage and retrieval, and database transaction processing), and advanced data analysis (involving data warehousing and data mining). For instance, the early development of data collection and database creation mechanisms served as a prerequisite for later development of effective mechanisms for data storage and retrieval, and query and transaction processing.

FIG: Evolution of database system technology

In 1960s, database and information technology has been evolving from primitive file processing systems to sophisticated and powerful database systems. The research and development in database systems since the 1970s has progressed from early hierarchical and network database systems to the development of relational database systems, data modeling tools, and indexing and accessing methods. Efficient methods for on-line transaction processing (OLTP), where a query is viewed as a read-only transaction, have contributed to the evolution of relational technology as a major tool for efficient storage, retrieval, and management of large amounts of data.

Database technology since the mid-1980s has been characterized by the popular adoption of relational technology and an increase of research and development activities on new and powerful database systems. These promote the development of advanced data models such as extended- relational, object-oriented, object-relational, and deductive models. Application-oriented database systems, including spatial, temporal, multimedia, active, stream, and sensor, and scientific and engineering databases, knowledge bases, and office information bases have grown.

Data can now be stored in many different kinds of databases and information repositories. One data repository architecture that has emerged is the data warehouse, a repository of multiple heterogeneous data sources organized under a unified schema at a single site in order to facilitate management decision making. Data warehouse technology includes data cleaning, data integration, and on-line analytical processing (OLAP).

The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation. The fast-growing, tremendous amount of data, collected and stored in large and numerous data repositories, has far exceeded our human ability for comprehension without powerful tools. As a result, data collected in large data repositories become “data tombs”—data archives that are infrequently visited. Consequently, important decisions are often made based not on the information-rich data stored in data repositories, but rather on a decision maker’s intuition, simply because the decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data. Data mining tools perform data analysis and may uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research. The widening gap between data and information calls for a systematic development of data mining tools that will turn data tombs into “golden nuggets” of knowledge.

2) What Is Data Mining?

Data mining refers to extracting or “mining” knowledge from large amounts of data. the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named “knowledge mining from data”. “Knowledge mining,” a shorter term, may not reflect the emphasis on mining from large amounts of data.

Another synonym for data mining is Knowledge Discovery from Data, or KDD. data mining as simply an essential step in the process of knowledge discovery. Knowledge discovery as a process is depicted in below Figure and consists of an iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be combined)

3. Data selection (where data relevant to the analysis task are retrieved from the database)

4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)

5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)

6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some

interestingness measures)

7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base. Data mining is a step in the knowledge discovery process. Data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories.

The architecture of a typical data mining system may have the following major components (as shown in below figure):

Fig: Architecture of typical data mining system

· Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.

· Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.

· Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction.

· Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

· Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns.

· User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query.

3) Data Mining—On What Kind of Data?

Data mining can be performed on a number of different data repositories.

i) Relational Databases

ii) Data Warehouses

iii) Transactional Databases

iv) Advanced data and information systems

i) Relational Databases: A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.

A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships.

Consider the following example.

Example: A relational database for AllElectronics. The AllElectronics Company is described by the following relation tables: customer, item, employee, and branch. Fragments of the tables described here are shown in Figure:

· The relation customer consists of a set of attributes, including a unique customer identity number (cust_ID), customer name, address, age, occupation, annual income, credit information, category, and so on.

· Similarly, each of the relations item, employee, and branch consists of a set of attributes describing their properties.

· Tables can also be used to represent the relationships between or among multiple relation tables. For our example, these include purchases (customer purchases items, creating a sales transaction that is handled by an employee), items sold (lists the items sold in a given transaction), and works at (employee works at a branch of AllElectronics).

Relational data can be accessed by database queries written in a relational query language. A query allows retrieval of specified subsets of the data. Suppose that your job is to analyze the AllElectronics data. Through the use of relational queries, you can ask things like “Show me a list of all items that were sold in the last quarter.”

When data mining is applied to relational databases, we can go further by searching for trends or data patterns. For example, data mining systems can analyze customer data to predict the credit risk of new customers based on their income, age, and previous credit information.

Fig: Fragments of relations from a relational database for AllElectronics.

ii) Data Warehouses: A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. The typical framework for construction and use of a data warehouse for AllElectronics is as shown below:

Fig: Data warehouse for AllElectronics

A data warehouse is usually modeled by a multidimensional database structure, where each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure, such as count or sales amount. The actual physical structure of a data warehouse may be a relational data store or a multidimensional data cube. A data cube provides a multidimensional view of data and allows the precomputation and fast accessing of summarized data.

Example: A data cube for AllElectronics. A data cube for summarized sales data of AllElectronics

is presented in below Figure(a). The cube has three dimensions: address (with city values Chicago, New York, Toronto, Vancouver), time (with quarter values Q1, Q2, Q3, Q4), and item(with item type values home entertainment, computer, phone, security). The aggregate value stored in each cell of the cube is sales amount (in thousands). For example, the total sales for the first quarter, Q1, for items relating to security systems in Vancouver is$400,000, as stored in cell (Vancouver, Q1, security).

By providing multidimensional data views and the precomputation of summarized data, data warehouse systems are well suited for on-line analytical processing, or OLAP. Examples of OLAP operations include drill-down and roll-up, which allow the user to view the data at differing degrees of summarization, as illustrated in below Figure(b).

Fig: A Multidimensional data cube

iii) Transactional Databases: Transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans_ID) and a list of the items making up the transaction (such as items purchased in a store).The transactional database may have additional tables associated with it, which contain other information regarding the sale, such as the date of the transaction, the customer ID number, the ID number of the salesperson and of the branch at which the sale occurred, and so on.

Fig: Transactional Database for sales at AllElectronics

iv)Advanced Data and Information systems:

a) Object-Relational Databases: The object-relational data model inherits the essential

concepts of object-oriented databases, where, in general terms, each entity is considered as an object. Each object has associated with it the following:

· A set of variables that describe the objects. These correspond to attributes in the entity-relationship and relational models.

· A set of messages that the object can use to communicate with other objects, or with the rest of the database system.

· A set of methods, where each method holds the code to implement a message. Upon receiving a message, the method returns a value in response. For instance, the method for the message get_photo(employee) will retrieve and return a photo of the given employee object.

Objects that share a common set of properties can be grouped into an object class. Each object is an instance of its class. Object classes can be organized into class/subclass hierarchies so that each class represents properties that are common to objects in that class.

b) Temporal Databases, Sequence Databases, and Time-Series Databases: A temporal database typically stores relational data that include time-related attributes. These attributes may involve several timestamps, each having different semantics. A sequence database stores sequences of ordered events, with or without a concrete notion of time. Examples include customer shopping sequences, Web click streams, and biological sequences. A time-series database stores sequences of values or events obtained over repeated measurements of time (e.g., hourly, daily, weekly). Examples include data collected from the stock exchange, inventory control, and the observation of natural phenomena (like temperature and wind).

c) Spatial Databases and Spatiotemporal Databases: Spatial databases contain spatial-related information. Examples include geographic (map) databases, very large-scale integration (VLSI) or computer-aided design databases, and medical and satellite image databases. Spatial data may be represented in raster format, consisting of n-dimensional bit maps or pixel maps. For example, a 2-D satellite image may be represented as raster data, where each pixel registers the rainfall in a given area. Maps can be represented in vector format, where roads, bridges, buildings, and lakes are represented.

A spatial database that stores spatial objects that change with time is called a spatiotemporal database, from which interesting information can be mined. For example, we may be able to group the trends of moving objects and identify some strangely moving vehicles, or distinguish a bioterrorist attack from a normal outbreak of the flu based on the geographic spread of a disease with time.

d) Text Databases and Multimedia Databases: Text databases are databases that contain word descriptions for objects. These word descriptions are usually not simple keywords but rather long sentences or paragraphs, such as product specifications, error or bug reports, warning messages, summary reports, notes, or other documents.

Multimedia databases store image, audio, and video data. They are used in applications such as picture content-based retrieval, voice-mail systems, video-on-demand systems, the World Wide Web, and speech-based user interfaces that recognize spoken commands

e) Heterogeneous Databases and Legacy Databases: A heterogeneous database consists of a set of interconnected, autonomous component databases. The components communicate in order to exchange information and answer queries. Objects in one component database may differ greatly from objects in other component databases, making it difficult to assimilate their semantics into the overall heterogeneous database.

A legacy database is a group of heterogeneous databases that combines different kinds of data systems, such as relational or object-oriented databases, hierarchical databases, network databases, spreadsheets, multimedia databases, or file systems.

f) Data Streams: Many applications involve the generation and analysis of a new kind of data, called stream data, where data flow in and out of an observation platform (or window) dynamically. Such data streams have the following unique features: huge or possibly infinite volume, dynamically changing, flowing in and out in a fixed order, allowing only one or a small number of scans, and demanding fast (often real-time) response time. Typical examples of data streams include various kinds of scientific and engineering data.

g) The World Wide Web: The World Wide Web where data objects are linked together to facilitate interactive access. Users seeking information of interest traverse from one object via links to another. Such systems provide ample opportunities and challenges for data mining.

4)Data Mining Functionalities—What Kinds of Patterns Can Be Mined?

Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. Data mining tasks can be classified into two categories: descriptive and predictive.

· Descriptive mining tasks characterize the general properties of the data in the database.

· Predictive mining tasks perform inference on the current data in order to make predictions.

Data mining functionalities, and the kinds of patterns they can discover, are described below.

i. Concept/Class Description: Characterization and Discrimination:

Data can be associated with classes or concepts. For example, in the AllElectronics store, classes of items for sale include computers and printers, and concepts of customers include bigSpenders and budgetSpenders. It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept descriptions.

These descriptions can be derived as:

(1) data characterization, by summarizing the data of the class under study (often called the target class) in general terms.

(2) data discrimination, by comparison of the target class with one or a set of comparative classes (often called the contrasting classes).

(3) both data characterization and discrimination.

Data characterization is a summarization of the general characteristics or features of a target class of data. For example, to study the characteristics of software products whose sales increased by 10% in the last year.

The output of data characterization can be presented in various forms. Examples include pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.

Data discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. The target and contrasting classes can be specified by the user, and the corresponding data objects retrieved through database queries. For example, the user may like to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by at least 30% during the same period.

The forms of output presentation are similar to those for characteristic descriptions.

ii. Mining Frequent Patterns, Associations, and Correlations:

Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are many kinds of frequent patterns, including item sets, subsequences, and substructures. A frequent item set typically refers to a set of items that frequently appear together in a transactional data set, such as milk and bread. A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. A substructure can refer to different structural forms, such as graphs, trees, or lattices, which may be combined with item sets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern. Mining frequent patterns leads to the discovery of interesting associations and correlations within data.

Example: Association analysis. Suppose, as a marketing manager of AllElectronics, to determine which items are frequently purchased together within the same transactions. An example of such a rule, mined from the AllElectronics transactional database, is

buys(X, “computer”) => buys(X, “software”) [support = 1%; confidence = 50%]

where X is a variable representing a customer. A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well. A 1% support means that 1% of all of the transactions under analysis showed that computer and software were purchased together. This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a single predicate are referred to as single-dimensional association rules. Dropping the predicate notation, the above rule can be written simply as “computer)software [1%, 50%]”.

A data mining system may find association rules like

age(X, “20:::29”)^income(X, “20K:::29K”) => buys(X, “CD player”)

[support = 2%, confidence = 60%]

Adopting the terminology used in multidimensional databases, where each attribute is referred to as a dimension, the above rule can be referred to as a multidimensional association rule.

iii. Classification and Prediction:

Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a set of training data.

The derived model may be represented in various forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks (as shown in below Figure). A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions. Decision trees can easily be converted to classification rules. A neural network, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units.

Fig: A classification model can be represented in various forms, such as (a) IF-THEN rules,

(b) a decision tree, or a (c) neural network.

iv. Cluster Analysis:

Clustering analyzes data objects without consulting a known class label. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. That is, clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Each cluster that is formed can be viewed as a class of objects, from which rules can be derived.

Example: Cluster analysis. Cluster analysis can be performed on AllElectronics customer data in order to identify homogeneous subpopulations of customers. These clusters may represent individual target groups for marketing. Figure below shows a 2-D plot of customers with respect to customer locations in a city. Three clusters of data points are evident.

Fig: A 2-D plot of customer data with respect to customer locations in a city, showing three data

clusters. Each cluster “center” is marked with a “+”.

v. Outlier Analysis

A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. Most data mining methods discard outliers as noise or exceptions. However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones. The analysis of outlier data is referred to as outlier mining.

Example: Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account. Outlier values may also be detected with respect to the location and type of purchase, or the purchase frequency.

vi. Evolution Analysis:

Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time.

Example: Evolution analysis. A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies. Such regularities may help predict future trends in stock market prices, contributing to your decision making regarding stock investments.

5)Are All of the Patterns Interesting?

“Are all of the patterns interesting?” This raises some serious questions for data mining. You may wonder, “What makes a pattern interesting? Can a data mining system generate all of the interesting patterns? Can a data mining system generate only interesting patterns?”

To answer the first question, a pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test data with some degree of certainty, (3) potentially useful, and (4) novel. A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An interesting pattern represents knowledge.

Several objective measures of pattern interestingness exist. An objective measure for association rules of the form X => Y is rule support, representing the percentage of transactions from a transaction database that the given rule satisfies. This is taken to be the probability P(X U Y),where X U Y indicates that a transaction contains both X and Y, that is, the union of itemsets X and Y. Another objective measure for association rules is confidence, which assesses the degree of certainty of the detected association. This is taken to be the conditional probability P(Y/X), that is, the probability that a transaction containing X also contains Y. Support and confidence are defined as

support(X => Y) = P(X U Y)

confidence(X => Y) = P(Y/X)

Subjective interestingness measures are based on user beliefs in the data. These measures find patterns interesting if they are unexpected (contradicting a user’s belief) or offer strategic information on which the user can act. In the latter case, such patterns are referred to as actionable. Patterns that are expected can be interesting if they confirm a hypothesis that the user wished to validate.

The second question—“Can a data mining system generate all of the interesting patterns?”—refers to the completeness of a data mining algorithm. It is often unrealistic and inefficient for data mining systems to generate all of the possible patterns. Instead, user-provided constraints and interestingness measures should be used to focus the search. Association rule mining is an example where the use of constraints and interestingness measures can ensure the completeness of mining.

Finally, the third question—“Can a data mining system generate only interesting patterns?”— is an optimization problem in data mining. It is highly desirable for data mining systems to generate only interesting patterns. This would be much more efficient for users and data mining systems, because neither would have to search through the patterns generated in order to identify the truly interesting ones.

6) Classification of Data Mining Systems

Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science (as shown below Figure).

Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems. Data mining systems can be categorized according to various criteria, as follows:

Fig: Data mining as a confluence of multiple disciplines.

i. Classification according to the kinds of databases mined: A data mining system can

be classified according to the kinds of databases mined. Database systems can be classified according to different criteria, each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly.

For instance, if classifying according to data models, we may have a relational, transactional, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time-series, text, stream data, multimedia data mining system, or a World Wide Web mining system.

ii. Classification according to the kinds of knowledge mined: Data mining systems can

be categorized according to the kinds of knowledge they mine, that is, based on data mining functionalities, such as characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis.

iii. Classification according to the kinds of techniques utilized: Data mining systems can be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems) or the methods of data analysis employed (e.g., database-oriented or data warehouse– oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on).

iv. Classification according to the applications adapted: Data mining systems can also be categorized according to the applications they adapt. For example, data mining systems may be tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on. Different applications often require the integration of application-specific methods.

7) Data Mining Task Primitives

Each user will have a data mining task. A data mining task can be specified in the form of a data mining query, which is input to the data mining system. A data mining query is defined in terms of data mining task primitives. The data mining primitives specify the following:

i. The set of task-relevant data to be mined: This specifies the portions of the database

or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest.

ii. The kind of knowledge to be mined: This specifies the data mining functions to be

performed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis.

iii. The background knowledge to be used in the discovery process: This knowledge about

the domain to be mined is useful for guiding the knowledge discovery process and for evaluating the patterns found. Concept hierarchies are a popular form of background knowledge, which allow data to be mined at multiple levels of abstraction.

iv. The interestingness measures and thresholds for pattern evaluation: They may be

used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures. For example, interestingness measures for association rules include support and confidence.

v. The expected representation for visualizing the discovered patterns: This refers to the

Form in which discovered patterns are to be displayed, which may include rules, tables, charts, graphs, decision trees, and cubes.

Fig: Primitives for specifying a data mining task.

A data mining query language can be designed to incorporate these primitives. There are several proposals on data mining languages and standards. We use a data mining query language known as DMQL (Data Mining Query Language), which was designed as a teaching tool, based on the above primitives. The language adopts an SQL-like syntax, so that it can easily be integrated with the relational query language, SQL.

Example: Mining classification rules. Suppose, as a marketing manager of AllElectronics, you would like to classify customers based on their buying patterns. You are especially interested in those customers whose salary is no less than $40,000, and who have bought more than $1,000 worth of items, each of which is priced at no less than $100. In particular, you are interested in the customer’s age, income, the types of items purchased, the purchase location, and where the items were made. You would like to view the resulting classification in the form of rules. This data mining query is expressed in DMQL as follows,

(1) use database AllElectronics_db

(2) use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age

(3) mine classification as promising_customers

(4) in relevance to C.age, C.income, I.type, I.place_made, T.branch

(5) from customer C, item I, transaction T

(6)whereI.item_ID=T.item_ID and C.cust_ID = T.cust_ID and C.income_40,000 and I.price>=100

(7) group by T.cust_ID

(8) having sum(I.price) >= 1,000

(9) display as rules

8) Integration of a Data Mining System with a Database or Data Warehouse System:

A critical question in the design of a data mining (DM) system is how to integrate or couple the DM system with a database (DB) system and/or a data warehouse (DW) system. If a DM system works as a stand-alone system or is embedded in an application program, there are no DB or DW systems with which it has to communicate. This simple scheme is called no coupling. Possible integration schemes include no coupling, loose coupling, semitight coupling, and tight coupling.

· No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system. It may fetch data from a particular source (such as a file system), process data using some data mining algorithms, and then store the mining results in another file. Drawbacks in such system are: 1) A DB system provides a great deal of flexibility and efficiency at storing, organizing, accessing, and processing data. Without using a DB/DW system, a DM system may spend a substantial amount of time finding, collecting, cleaning, and transforming data. 2) Without any coupling of such systems, a DM system will need to use other tools to extract data, making it difficult to integrate such a system into an information processing environment. Thus, no coupling represents a poor design.

· Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse. Loose coupling is better than no coupling because it can fetch any portion of data stored in databases or data warehouses by using query processing, indexing, and other system facilities.

· Semitight coupling: Semitight coupling means that besides linking a DM system to a DB/DW system, efficient implementations of a few essential data mining primitives can be provided in the DB/DW system. These primitives can include sorting, indexing, aggregation, histogram analysis, multiway join, and precomputation of some essential statistical measures, such as sum, count, max, min, standard deviation, and so on.

· Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system. The data mining subsystem is treated as one functional component of an information system. Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system. This approach is highly desirable because it facilitates efficient implementations of data mining functions, high system performance, and an integrated information processing environment.

9) Major Issues in Data Mining:

Major issues in data mining are

· Mining methodology and user interaction issues: These reflect the following:

i)Mining different kinds of knowledge in databases: different users can be interested in different kinds of knowledge, so data mining should cover a wide spectrum of data analysis and knowledge discovery tasks, including data characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis.

ii) Interactive mining of knowledge at multiple levels of abstraction: For databases containing a huge amount of data, appropriate sampling techniques can first be applied to facilitate interactive data exploration. Specifically, knowledge should be mined by drilling down, rolling up, and pivoting through the data space.

iii) Incorporation of background knowledge: Background knowledge, or information regarding the domain under study, may be used to guide the discovery process and allow discovered patterns to be expressed in concise terms and at different levels of abstraction.

iv) Data mining query languages and ad hoc data mining: Relational query languages (such as SQL) allow users to pose ad hoc queries for data retrieval. In a similar vein, high-level data mining query languages need to be developed to allow users to describe ad hoc data mining tasks for analysis, the domain knowledge, the kinds of knowledge to be mined, and the conditions and constraints to be enforced on the discovered patterns.

v) Presentation and visualization of data mining results: Discovered knowledge should be expressed in high-level languages, visual representations, or other expressive forms so that the knowledge can be easily understood and directly usable by humans. This requires the systemto adopt expressive knowledge representation techniques, such as trees, tables, rules, graphs, charts, crosstabs, matrices, or curves.

vi) Handling noisy or incomplete data: The data stored in a database may reflect noise, exceptional cases, or incomplete data objects. As a result, the accuracy of the discovered patterns can be poor. Data cleaning methods and data analysis methods that can handle noise are required.

vii) Pattern evaluation—the interestingness problem: A data mining system can uncover thousands of patterns. Many of the patterns discovered may be uninteresting to the given user, either because they represent common knowledge or lack novelty.

· Performance issues: These include efficiency, scalability, and parallelization of data mining algorithms.

i) Efficiency and scalability of data mining algorithms: To effectively extract information from a huge amount of data in databases, data mining algorithms must be efficient and scalable. In other words, the running time of a data mining algorithm must be predictable and acceptable in large databases.

ii)Parallel, distributed, and incremental mining algorithms: The huge size of many databases, the wide distribution of data, and the computational complexity of some data mining methods are factors motivating the development of parallel and distributed data mining algorithms. Such algorithms divide the data into partitions, which are processed in parallel. The results from the partitions are then merged. The high cost of some data mining processes promotes the need for incremental data mining algorithms that incorporate database updates without having to mine the entire data again “from scratch.” Such algorithms perform knowledge modification incrementally.

· Issues relating to the diversity of database types:

i) Handling of relational and complex types of data: Because relational databases and data warehouses are widely used, the development of efficient and effective data mining systems for such data is important.

ii) Mining information from heterogeneous databases and global information systems: Local- and wide-area computer networks (such as the Internet) connect many sources of data, forming huge, distributed, and heterogeneous databases. The discovery of knowledge from different sources of structured, semi structured, or unstructured data with diverse data semantics poses great challenges to data mining.

SUSHUMA

Tuesday, August 1, 2023

Data Warehousing and Data Mining

Star Schema

Snowflake Schema

Fact Constellation Schema

Thursday, March 18, 2021

Wednesday, October 30, 2019

NPTEL Video Link for Knowledge Discovery Process

Data Warehousing And Data Mining Introduction