When you think of the word mining the first thing that comes to your mind is extraction of Gold, Silver, or any valuable minerals needed to be mined in order to extract it. Well, in the data world we have the same thing. You can mine your data in order to discover, extract hidden and valuable resource (Insights, Knowledge) which is considered as one of the most valuable assets to your business. But, unlike minerals which is a finite resource, data is effectively infinitely durable and reusable.
Just imagine that you have a whole Petabyte of data which is considered as your mine and the insights and knowledge you’re looking for is the valuable minerals and garbage is everything else.
In this post we will try to Highlight the most tremendous benefit of applying Data mining to your business through real world example of companies already did but first. let’s discuss in a brief way what is Data mining?
According to Wikipedia “ is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
Other definition stand for that DM is the process of analyzing data from different perspectives and discovering anomalies, patterns and correlations within these data sets that is insightful and useful to predict outcomes which, in the end, help you make a good informed decision.
Let’s call back our mining example, when you plan to mine for gold or any valuable minerals you have first to locate a place that you think it has these resources then you start digging.
In data mining process we have the same concept, in order to mine the data, you have to collect the data from different sources, prepare and store them in a one place first, as none of data mining is about searching for data itself.
Nowadays, company usually store data in what is called a Datawarehouse database which we are going to talk about in a later post in detail.
Data mining is used by companies to increase revenue, optimizing expenditure, target new customers, provide better customer service, listen to what others are saying and do competitive intelligence. And that’s just some of the ways.
Although data mining’s roots can be track back to 1990s. The process of digging through data to discover hidden patterns and predict outcomes has a long history. Sometimes referred to as "knowledge discovery" the term "data mining" wasn’t coined until the 1990s. But its foundation comprises three intertwined scientific disciplines: statistics (the numerical study of data relationships), artificial intelligence (human-like intelligence displayed by machines), machine learning (algorithms that can learn from data to make predictions) and business domain knowledge.
As result of this growing field in the 1999s, several sizable companies began working together to formalize and standardize an approach for data mining. The result of their work was CRISP-DM which stand for Cross-Industry Standard Process for Data Mining.
The whole thing in mining process is built up on determine what you are looking for. You have to understand and find what the business requirements are in order to formulate problem statement. Once we define problem statement, we can drive data accordingly.
Example of problem statement that you need to find a solution for using data mining.
From there, you begin to develop the more specific question you to answer.
We start with an initial data collection, understanding and proceeds with activities in order to get familiar with the data, to identify data quality, inconsistent, readability problems, to discover initial level of insights.
In this stage you must determine:
It covers all activities performed to construct the final dataset from the initial raw data. Per in mind that converting raw data into analytical dataset makes up 90% of the project time.
Once we identify the data sources, then we need to select, clean, construct and format the data in the desired form. The data exploration task has to be done at a great depth to notice the patterns based on business understanding.
Quality of cleaned data or your final datasets will impact the model performance. Every data miner knows and works according to the simple rule (Garbage in .. Garbage out). Data preparation tasks are likely to be performed multiple times and not in any prescribed order and it includes a number of activities for example:
Check for Inconsistent data that need to be handled for example:
In data mining, Is a computerized representation of real word observation. Models are the application of the algorithm to seek out, identify, and display any patterns or message is your data. There are two kinds of models in DM:
In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary. Some of the famous algorithms are Decision tree, random forest, KNN, naïve Bayes, K means, linear regression and logistics regression.
Particularly, in this phase, we have to evaluate the result in the context of the business goal or objectives. The target is to determine if there are any important business issue that has not been sufficiently considered. At the end of this phase, the go or no-go decision must be made to move to the deployment phase.
In this phase we need to determine how the results will be utilized. The knowledge gained will need to be organized and presented in a way that the stakeholders can use it. However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.
The CRISP-DM offers a uniform framework for experience documentation and guidelines. In addition, the CRISP-DM can apply in various industries with different types of data.
Clustering analysis such as K-means, essentially grouping large amounts of data together based on their similarities in groups or what is called Clusters. The Image below shows what a clustering analysis may look like.
There are a few ways to draw knowledge out of a clustering analysis. For Example, in marketing for various purposes. Segmentation of consumers in cluster analysis is used on the basis of benefits sought from the purchase of the product.
The Association rule says that whenever a customer buys a dozen eggs, he is 80% likely to also purchase milk. If you’ve ever been suggested products on an e-commerce site based on what’s in your cart, then you’ve seen association rule mining at work.
In other words: It is a procedure which aims to observe frequently occurring patterns, correlations, or associations from datasets.
Walmart applied this data mining technique flawlessly in 2004 during Hurricane Frances. By mining transaction and inventory data, analysts discovered that strawberry Pop-Tart sales were actually seven times higher right before a hurricane hit. Beer was also revealed as the top-selling pre-hurricane item. With this information at-hand, Walmart was sure to stock up.
The example above is called a linear regression analysis, which basically means a straight line can be drawn to show how each variable relates to one another. In this case, we see that the higher total frozen yogurt sold, the higher the High temperature will be, and vice versa.
If a business is looking to make a prediction based on the effect one variable has on others, they may refer to a data mining technique called regression analysis. It is used across multiple industries for business and marketing planning, financial forecasting, environmental modeling and analysis of trends. Regression is a data mining technique used to predict a range of numerical values (continuous values), given a particular dataset.
For example, regression might be used to predict the cost of a product or service, given other variables.
One of the more visual data mining techniques, and it is a popular method for important decision making. There are two types of decision tree analyses. One of them is called classification, which is what you see in the example above determining whether or not a passenger would have survived on the Titanic. Classification is logic-based, using a variety of if/then or yes/no conditions until all relevant data is mapped out.
The other decision tree is called regression, which is used when the target decision is a numerical value. For example, regression could be used when determining a house’s value. Both decision trees can be running through machine learning programs as well.
All of the data that businesses gather would serve no purpose without knowledge discovery. Data mining allows businesses to visualize patterns and trends hidden in their datasets that may have not been previously visible. Whichever insights are revealed will lead to more informed decision making, which is beneficial to both businesses and the customers they serve and stakeholders in general.
At the end, we have studied what is meant by Data Mining. Along with this have learned stages of data with cross-industry standard process (CRISP-DM). Furthermore, if you feel any query feel free to ask in a comment section.