Today, in many physical phenomena, the underlying first principles may not be known with certainty, or the system under study is so complex that a mathematical formulation is difficult to develop. However, with the growing use of computers, it has been possible to generate a great amount of data for such systems. In such cases, in stead of developing models from first principles, such readily available data can be used to derive models by extracting meaningful relationships between a system's variables (i.e., unknown input-output dependencies). The complex, information-rich data sets today are becoming common to all fields of business, science, and engineering. The ability to extract useful knowledge hidden in these data and to act on that knowledge is becoming increasingly important in today's competitive world. The entire process of applying a computer-based methodology for extracting knowledge or information from data is called data mining. Data mining is an iterative process and it is a search for new, valuable, and nontrivial information in large volumes of data.
Basically, the objective of data mining is either prediction or description. Predictive data mining involves using some variables or fields in the data set to classify, predict unknown or estimate values of the variables of interest. Descriptive data mining, on the other hand, involves finding patterns and relationships described by the data that can be interpreted. Therefore, data-mining activities may fall into one of the foregoing categories.
Today, data mining applications are available on all size computer systems, viz., for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest.
The success of a data-mining depends largely on the amount of skill, knowledge, and ingenuity of the analyst. It is essentially like solving a puzzle. Data mining is one of the fastest growing fields in the field of science and engineering. Starting with the use of computer science and statistics, it has quickly expanded into a field of its own. One of the greatest strengths of data mining is reflected in its wide range of methodologies and techniques that have been embraced by it and applied to a host of problems.
Data mining has its origins in various disciplines, of which the two most important are statistics and machine learning. With the use of Statistics, there has been an emphasis on mathematical rigor, a need to establish that is something sensible on theoretical grounds before testing it in practice. On the other hand, the machine-learning community has its origins in computer technology. This has led to a practical orientation, a willingness to test something out to see how well it performs, without waiting for a formal proof of effectiveness.
Basic modeling principles in data mining also have roots in control theory, which is primarily applied to engineering systems and industrial processes. The problem of determining a mathematical model for an unknown is by observing its input-output data pairs and is generally referred to as system identification. The purposes of system identification from the point of view of data mining, is to predict a system's behavior and to explain the interaction and relationships between the variables of a system. System identification generally involves two top-down steps:
- Structure identification - In this step, a priori knowledge about the system is applied to determine a class of models within which the search for the most suitable model can be conducted. Usually this class of models is denoted by a parameterized function y = f(u,t), where y is the model's output, u is an input vector, and t is a parameter vector.
- Parameter identification - In this step, once the structure of the model is known, we need to apply optimization techniques to determine parameter vector t such that the resulting model y* = f(w,/*) can describe the system appropriately.
In general, system identification is not a one-pass process: both structure and parameter identification needs to be applied repeatedly until a satisfactory model is developed.
Several types of analytical software are available: statistical, machine learning, and neural networks and the idea is to seek any of the four types of relationships:
- Classes: Stored data is used to locate data in predetermined groups.
- Clusters: Data items are grouped according to logical relationships or the user's preferences.
- Associations: Data can be mined to identify associations.
- Sequential patterns: Data is mined to anticipate behavior patterns and trends.
Data mining consists of five major elements:
- Extract, transform, and load transaction data onto the data warehouse system.
- Store and manage the data in a multidimensional database system.
- Provide data access to business analysts and information technology professionals.
- Analyze the data by application software.
- Present the data in a useful format, such as a graph or table.
The manual extraction of patterns from data has been there for centuries. Early methods of identifying patterns in data included Bayes' theorem (1700s) and well-known regression analysis (1800s). The proliferation, all pervasiveness and increasing power of computers has increased data collection and storage capabilities. As data sets have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. This has been aided by other useful techniques in computer science, such as neural networks, clustering, genetic algorithms (1950s), decision trees (1960s) and support vector machines (1980s). Nearest neighbourhood technique, rule induction (the extraction of useful if-then rules from data based on statistical significance), data visualization etc. are the newer techniques used in data mining.
Data mining is today the process of applying these methods to data with the intention of uncovering hidden patterns. It has been used for many years by businesses, scientists, engineers and governments to sift through volumes of data.
Realizing the importance of data mining to the field of reliability and risk, Professor Krishna B. Misra, Editor-in-Chief of IJPE requested me to bring out a special issue on Data Mining as applied to reliability and risk. Invitations were sent out to several researchers active in this field and the result of the exercise is that we received only four papers which relate to Data mining principles for this issue. However, it is hoped that these papers will act as catalyst to generate further interest of the researchers and readers to come forward to augment reliability and risk literature with data mining principles.
Claudio M. Rocco received his B.Sc. in Electrical Engineering and M.Sc. in Electrical Engineering (Power Systems) from Universidad Central de Venezuela and his Ph.D. from The Robert Gordon University, Aberdeen, Scotland, U.K. He is a full professor at Universidad Central de Venezuela in Operational Research post-graduate courses. He is a member of the editorial board of International Journal of Performability Engineering.