Classification, Regression, Clustering and Association Rules
The main difference between classification and regression models, which are used in predicting the future based on existing data and which are the most widely used among data mining techniques, is that the estimated dependent variable has a categorical or continuous value [1]. The predicted dependent variable in classification has a categorical value and in regression it has a continuous value.
For example, a classification model can be established to categorize whether bank loan applications are safe or risky, while the regression model can be established to estimate the income and expense of potential customers given the profession while purchasing computer products. However, with techniques such as multinomial logistic regression, which allow estimation of categorical values, both models gradually approach each other and as a result, it is possible to benefit from the same techniques [1].
Classification is a technique that allows data to be parsed in accordance with predetermined outputs. Because the outputs are known in advance, the classification learns the data set as supervised [2].
For example, financial services company A wants to find out if its customers are interested in a new investment opportunity. It has previously sold a similar product and historical data shows which customers responded to the previous offer. Goal; To determine the characteristics of customers who respond to this offer and thus to carry out marketing and sales activities more effectively. There is a “yes” / “no” field in the customer records that indicates whether the customer responded to the previous offer. This field is called “target” or “dependent” variable (target variable). The aim is to analyze the effects of other characteristics of customers (income level, job type, age, marital status, number of years of customer, other product and investment types purchased) on the target variable. Other features in the analysis are called “independent” or “estimator” variables [3].
The main techniques used in classification and regression models are as follows [3]:
- Decision Trees,
- Artificial Neural Networks,
- Genetic Algorithms (Genetic Algorithms),
- K-Nearest Neighbor (K-Nearest Neighbor),
- Regression Analysis,
- Naive-Bayes,
- Rough Sets.
Decision trees
Decision trees are the most widely used data mining classification model technique, due to the fact that their organizations are cheap, they are easy to interpret, they can be easily integrated with database systems and have good reliability. Decision tree is, as the name suggests, a predictive technique in the form of a tree. Decision trees are an easy-to-understand technique that consists of knots and branches. Each branch in the decision tree has a certain probability. In this way, it is possible to calculate the probabilities from the last branches to the root or to the desired location.
It is one of the most popular segmentation methods with its easily convertible structure to SQL query. The most commonly used decision tree methods are as follows [4]:
- CHAID (Chi-Squared Automatic Interaction Detector , Kass 1980),
- C&RT (Classification and Regression Trees, Breiman ve Friedman, 1984),
- ID3 (Induction of Decision Trees, Quinlan, 1986),
- C4.5 (Quinlan, 1993),
- C5.0.
Artificial neural networks
Artificial neural networks (ANN) is a technology developed by basically sampling the human brain. Nerve cells are the basis of all human behaviors such as learning, remembering and thinking. It is thought that there are an estimated 10 to 11 nerve cells in the human brain. There is an infinite number of nerve-to-nerve connections between these nerve cells, called synaptic junctions. They are designed to develop systems that enable the human brain to generate new information, discover, think and observe without assistance. It can be said that ANN’s computation and information processing power derives from its parallel distributed structure, ability to learn and generalize. Generalization is defined as ANN generating appropriate responses for entries that are not encountered in the education or learning process. These superior properties show the ability of ANN to solve complex problems [5].
This method checks the pattern patterns to match a certain profile and improves the system by performing a certain learning activity in this process. Learning algorithms used in artificial neural networks calculate the link weights between units from the data. Artificial neural networks do not assume a parametric model about data like statistical methods. In other words, its application area is wider and does not require as much processing and memory as memory-based methods [6]. An artificial neural network is created for a specific purpose and learns through examples like humans. Artificial neural networks change their own structure and weight through repeated inputs and have an adaptable structure just like the nervous system of living things. Artificial neural networks have the ability to create relationships between information as well as learning.
The basic functions of artificial neural networks can be specified as follows [7]:
- Prediction or forecasting: Future sales, weather forecasts, horse racing, environmental risk,…
- Classification and Clustering: Customer profiles, medical diagnosis, voice and shape recognition, cell types …
- Control: Sound and vibration levels in aircraft for early warning,
…
It can also be used for Data Association, Data Conceptualization and Data Filtering.
Artificial neural networks also have special application areas such as industrial applications, financial applications, military and defense applications, medical and health applications, engineering applications, robotics, image processing, pattern recognition, communication industry, entertainment forecasting [7].
An artificial neural network cell basically consists of inputs, weights, addition functions and output.
Artificial neural networks (ANN) have been developed inspired by the human brain. They are information processing structures that are connected by links of varying weights and consist of processing elements, each with its own memory. As seen above, ANN generates an output by using weights and activation function to the input given to it [8].
In general, weight values are automatically changed according to the specified learning rule by giving output values corresponding to a given input set. After the training data is completed, the trained network can predict the outcome of any given data set according to the final state of the weight values. An artificial neural network is formed by connecting a number of nerve cells with forward-driven and feedback linkage forms. Today, many artificial neural network models (such as Perceptron, Adaline, MLP, LVQ, Hopfield, Recurrent, SOM, ART and PCA) have been developed for specific purposes and for use in different fields. Among the learning types, supervised learning, unsupervised learning, reinforced learning and semi-supervised strategies are used [7].
Genetic algorithms
The algorithm is first started with a solution set (learning dataset) called population. The results from one population are used to create a new population that is expected to be better than the previous one. When the evolution process is completed, addiction rules or class models are introduced [9].
It is a type of machine learning system inspired by biological systems such as artificial neural networks. These algorithms, which model the theory of evolution, simulate natural selection in a computer environment based on the survival of the individual who adapts the most to natural conditions [10].
Genetic algorithms can be used for classification as well as optimization problems. They can be used in data mining to evaluate the suitability of other algorithms [11].
Thanks to the genetic algorithm that produces a solution set consisting of different solutions instead of producing a single solution to the problems, many points in the search space are evaluated at the same time and the possibility of reaching a holistic solution increases. Solutions in the solution set are completely independent from each other. Each is a vector on multidimensional space. Clusters representing many possible solutions to the problem are called population in genetic algorithm terminology. Populations consist of sequences of numbers called vectors or individuals. Each element in an individual is called a gene. Individuals in the population are determined by genetic algorithm processors in the evolutionary process [5].
K-Nearest neighbor
Another technique used for classification in data mining is the k-nearest neighbor algorithm based on learning by analogy. In this technique, all samples are stored in a pattern space. The algorithm finds the k sample closest to the unknown sample by searching the pattern space to determine which class an unknown sample belongs to. Proximity is defined by the Euclidean distance. Then the unknown sample is assigned to the class that it most resembles among the k closest neighbors. The K-nearest neighbor algorithm can also be used to estimate a real value for the unknown sample.
The reason why this method is preferred is that it is fast and efficient for data sets whose number is known [12]. It is a classification technique used especially in large databases. It is based on the logic of classifying the set to which the object to be classified belongs in the same set with the one belonging to the most unit from the nearest K unit object [13].
In this method, training examples are defined with n-dimensional numerical properties. Each example shows a point in n-dimensional space. In this way, all training samples are stored in n-dimensional sample space. These k-training samples are the closest neighbor to the unknown sample. Euclidean distance is between X = (x1, x2,…., Xn) and Y = (y1, y2, ……., Yn). The unknown sample is given to the class that is the most common of its k-neighbors. When k = 1, it is the class of the closest training instance in the unknown sample space. The closest neighbor classifier is sample-based or lazy-learners (lazylearners); they store all training samples and do not configure a classifier unless a new sample classification is required. Lazy-learners can increase the cost of computing if the number of potential neighbors is larger than the number of unlabeled samples. Therefore, it requires effective indexing techniques. As expected, lazy-learning methods are faster than eager methods, but classification slows down when all calculations are delayed. The nearest neighbor classifier can be used for estimation [11].
It is among the supervised learning methods that solve the classification problem. In the method, the similarities of the data to be classified with the normal behavior data in the learning set are calculated and assigned to the classes according to the threshold value determined with the average of the k data considered to be the closest. The important thing is that the properties of each class are clearly defined in advance. The performance of the method is affected by the number of nearest neighbors, threshold value, similarity measurement, and adequate number of normal behaviors in the learning set [14].
Regression analysis
Regression analysis is a method that mathematically models the relationship between one or more independent variables and the target variable. While the target variable to be predicted in linear regression, one of the regression models commonly used in data mining, takes continuous value, the target variable in logistic regression takes a discrete value. In linear regression, the value of the target variable, and in logistic regression, the probability of realization of one of the values that the target variable can take is estimated [15].
There are linear, nonlinear and logistic modeling alternatives. It is a combination of independent variables that will contain weighting of predictive variables, called independent variables, that will determine the value of the variable to be predicted, called dependent variable [4].
Naive-Bayes
Naive Bayes algorithm is based on calculating the effects of each criterion on the result as probability. For example, in the health sector, it is also frequently used in evaluating the probability of a person having a disease or not by evaluating the results of the test.
Naive Bayes is a predictive and descriptive classification algorithm that analyzes the relationship between target variable and independent variables. It does not work with continuous data. For this reason, dependent or independent variables containing continuous values should be categorized. For example; If age is one of the independent variables, continuous values should be converted to age ranges such as “<20”, “21–30”, “31–40”. Naive Bayes calculates how many times each output occurs in the learning set during the learning of the model. This value found is called the priority probability. For example, a bank wants to group credit card applications into “good” and “bad” risk classes. If the good risk outcome occurred 2 times out of a total of 5 cases, the priority probability for good risk is 0.4. This situation states, “If nothing is known about someone applying for a credit card, that person is in a good risk group with probability 0.4.” interpreted as. Naive Bayes also finds the frequency of occurrence of each independent variable / dependent variable combination. These frequencies are used in estimation by combining with priority probabilities [3].
Rough clusters
The rough clustering theory was developed by Pawlak in the 1970s. In the rough set theory there is an approximation space and the lower and upper approximations of a set. The approximation space classifies the area of interest into separate categories. Sub-zoom is the definition of objects that are precisely known to belong to a particular subset. Top zoom is the definition of objects that are likely to belong to the subset. Any object defined between the upper and lower limits is called the “rough cluster” [16].
Clustering
The process of grouping objects with their counterparts is called clustering. Cluster analysis is a group of multivariate techniques whose main purpose is to group objects (units) on the basis of their characteristics. Clustering analysis is of great importance in terms of efficient and reliable analysis while analyzing with the help of the available data. For example, a study will be done to remove the profiles of cities in Turkey. It is discussed how reliable results are to compare a city with an income system based on agriculture with cities based on industry. Likewise, it is wrong to compare cities with a population of millions with cities with a population of hundreds of thousands. Cities showing similar characteristics according to the determined criteria are gathered into a group and analysis is made among themselves.
For example, instead of comparing Hakkari with Ankara, Siirt, Batman, Muş etc. whose profile may be similar. Comparing with cities will provide much more reliable results [6].
While forming clusters, the similarity between objects in the cluster should be the greatest and the similarity between clusters should be the smallest. Similarly, the purpose of cluster analysis is to separate existing data as internally homogeneous and heterogeneous between clusters. Cluster analysis is often used in topics such as market research and gene research. Interesting correlations in terms of quality can be obtained between these data. For example, genes with similar characteristics in the medical field can be included in the same cluster.
In clustering models, the aim is to find clusters whose members are very similar to each other but whose characteristics are very different from each other and dividing the records in the database into these different clusters. Whether the records in the database will be divided into clusters or according to the variable characteristics of the clustering can be specified by a person who is expert in the subject, computer programs can also be developed to determine which clusters the records in the database will be divided into.
Typical clustering practices include discovering different customer groups in the markets and revealing the shopping patterns of these groups, classifying similar genes according to their plant and animal classifications and functions in biology, and dividing houses into groups according to their types, values and geographic locations in urban planning. Clustering can also be used to classify documents for information discovery on the Web [17].
Data clustering is developing strongly. In proportion to the increasing amount of data collected in databases, cluster analysis has recently become an active topic in data mining research. There are many clustering algorithms in the literature. The choice of clustering algorithm to be used depends on the data type and purpose.
Generally, the main clustering methods can be classified as follows [1]:
- Partitioning methods,
- Hierarchical methods,
- Density-based methods,
- Grid-based methods,
- Model-based methods.
In the division methods, n is considered as the number of objects in the database and k as the number of sets to be created. The division algorithm divides n objects into k sets (k n). Since clusters are created according to a criterion that is qualified as neutral division criteria, while objects in the same set are similar to each other, they are different from objects in different clusters. The most common method used in the division methods is the k-means method [1].
Hierarchical methods are based on grouping data objects into a tree of clusters. Hierarchical clustering methods can be classified as agglomerative and divisive hierarchical clustering depending on whether the hierarchical decomposition is bottom-up or top-down [18].
In agglomerative hierarchical clustering, as shown in the figure, hierarchical separation occurs from bottom to top. First, each object creates its own cluster, and then these atomic clusters combine to form larger clusters until all objects are gathered into one cluster. In divise hierarchical clustering, as shown in the figure, the hierarchical separation occurs from top to bottom. First, all objects are in a cluster and clusters are divided into smaller pieces until each object alone forms a cluster [18].
Association rules
Association rules find association relationships between large data sets. Companies want to reveal the association rules in their databases due to the increase of the collected and stored data day by day. Discovering interesting association relationships from large volumes of business transaction records makes companies’ decision-making processes more efficient.
The most typical example in which association rules are used is the market basket application. This process analyzes the purchasing habits of the customers by finding the associations between the products in their purchases. The discovery of these types of partnerships reveals which products customers buy together, and market managers can develop more effective sales strategies in the light of this information.
For example, if a customer buys milk, what is the probability of buying bread with milk in the same purchase? Market managers who organize shelves in the light of this type of information can increase the sales rate of their products. For example, if a supermarket has a high percentage of customers who buy bread with milk, market managers can increase their bread sales by putting milk and bread racks side by side.
The mathematical model of the association rule was presented by Agrawal, Imielinski and Swami in 1993 [19]. In this model, the set I = {i1, i2, .., im} is called “products”. D stands for all transactions in data integrity, T stands for each movement of products. TID, on the other hand, is the only indicator belonging to each movement [20].
Examples of association rules can be:
- “When customers buy beer, there is a 75% probability that they also buy diapers.”
- “Customers who buy low-fat cheese and skim milk are 85% likely to buy diet milk.”
Sequential analysis, on the other hand, is used to define relationships that are related but occur in successive periods. Examples of sequential analysis can be as follows.
- “10% of customers who buy tents buy backpacks within a month.”
- “If A share rises by 15%, within three days there is a probability of 60%.
Some algorithms developed regarding the association rule are as follows; AIS, SETM, Apriori, Partition, RARM, CHARM.
Among these algorithms, AIS is the first and the most well known is the Apriori algorithm [20].
REFERENCES
[1] İnternet: Hacettepe Üniversitesi “Veri Madenciliğine Giriş” http://yunus.hacettepe.edu.tr/~hcingi/ist376a/6Bolum.doc (2011)
[2] Giudici, P., “Applied Data Mining: Statistical Methods for Business and Industry”, John Wiley & Sons Inc., Chichester, 85–100 (2003).
[3] Akbulut, S., “Veri madenciliği teknikleri ile bir kozmetik markanın ayrılan müşteri analizi ve müşteri segmentasyonu”, Yüksek Lisans Tezi, Gazi Üniversitesi Fen Bilimleri Enstitüsü, Ankara, 1–25 (2006).
[4] Koyuncugil, A.S. ve Özgülbaş, N., “Veri madenciliği: tıp ve sağlık hizmetlerinde kullanımı ve uygulamaları”, Bilişim Teknolojileri Dergisi, 2(2): 21–32 (2009).
[5] Ayık, Y.Z., Özdemir, A. ve Yavuz, U., “Lise türü ve lise mezuniyet başarısının kazanılan fakülte ile ilişkisinin veri madenciliği tekniği ile analizi”, Sosyal Bilimler Enstitüsü Dergisi, 10(2): 441–454 (2007).
[6] Albayrak, M., “EEG sinyallerindeki epileptiform aktivitenin veri madenciliği süreci ile tespiti”, Doktora Tezi, Sakarya Üniversitesi Fen Bilimleri Enstitüsü, Sakarya, 56–70 (2008).
[7] Uğur, A. ve Kınacı, A.C., “Yapay zeka teknikleri ve yapay sinir ağları kullanılarak web sayfalarının sınıflandırılması”, Inet-tr 2006, XI. Türkiye’de İnternet Konferansı, TOBB Ekonomi ve Teknoloji Üniversitesi, Ankara, (2006).
[8] Savaş, S., Topaloğlu, N., Kazcı, Ö. et al. Classification of Carotid Artery Intima Media Thickness Ultrasound Images with Deep Learning. J Med Syst 43, 273 (2019). https://doi.org/10.1007/s10916-019-1406-2
[9] Shah. S. ve Kursak. A., “Data mining and genetic algorithms based gene / SNP selection”, Artificial Intelligence in Medicine, 31, 183 -196 (2004).
[10] Çiftci, S., “Uzaktan eğitimde öğrencilerin ders çalışma etkinliklerinin log verilerinin analiz edilerek incelenmesi”, Yüksek Lisans Tezi, Gazi Üniversitesi Eğitim Bilimleri Enstitüsü, Ankara, 1–5 (2006).
[11] Han, J. ve Kamber, M., “Data Mining: Concepts and Techniques”, Morgan Kaufmann, San Francisco, USA, 45–53 (2001).
[12] Larose, D.T., “Discovering Knowledge in Data: An Introduction to Data Mining”, John Wiley & Sons Inc., 42–70 (2005).
[13] Shah. S. ve Kursak. A., “Data mining and genetic algorithms based gene / SNP selection”, Artificial Intelligence in Medicine, 31, 183 -196 (2004).
[14] Çalışkan, S.K. ve Soğukpınar, İ.,”KxKNN: K-Means ve k en yakın komşu yöntemleri ile ağlarda nüfuz tespiti”, 2. Ağ ve Bilgi Güvenliği Sempozyumu, Girne, 120–124 (2008).
[15] Hui, S. ve Jha, G., “Application data mining for customer service support”, Information and Management, 38: 1–13 (2000).
[16] Pawlak, Z., “Rough sets, decision algorithms and bayes theorem”, Europen Journal of Operation Research, 136: 181–189 (2002).
[17] Seidman, C., “Data Mining With Microsoft SQL Server 2000”, Microsoft Press, Washington, USA, (2001).
[18] Özekes, S., “Veri madenciliği modelleri ve uygulama alanları”, İstanbul Ticaret Üniversitesi Dergisi, 3: 65–82 (2003).
[19] Agrawal, R., Imielinski, T. ve Swami, A., “Mining association rules between sets of ıtems in large databases”, In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACMSIGMOD ’93), Washington, USA, 207–216 (1993).
[20] Özçakır, F.C. ve Çamurcu, A. Y., “Birliktelik kuralı yöntemi için bir veri madenciliği yöntemi tasarımı ve uygulaması”, İstanbul Ticaret Üniversitesi Fen Bilimleri Dergisi, 6(12): 21–37 (2007).