Data Mining

What is Data Mining ?


Meaning of Data Mining


Data mining is the process of discovering meaningful patterns, trends, and insights from large volumes of data to inform decision-making and extract valuable knowledge. The activity of analyzing various data from different aspects and presenting it into meaningful information which can be utilized to promote sales and revenues or reduce costs or both is referred to as data mining. It is also known as 'knowledge or data discovery'. The various analytical tools which are employed for the analysis of data are data mining software. A user can analyze the data from various perspectives, group them according to different classifications and identify various existing relationships with the help of these software. In technical terms, the process of identifying the various trends and correlations among the number of fields in huge relational database is known as data mining.

Data Mining Definition


According to William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus :
"Data Mining, or Knowledge Discovery in Databases (KDD), is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. This encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency networks, analyzing changes, and detecting anomalies".

According to Marcel Holshemier & Arno Siebes (1994) :
"Data mining is the search for relationships and global patterns that exist in large databases but are 'hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database".

Analysis of data is the main concern of data mining along with the implementation of various software for determining the various trends and patterns in the available data. Determining various trends and patterns with the help of various underlying rules and characteristics of the data is the responsibility of a computer.

Data Mining

Major Components of Data Mining


The various components of a typical data mining system are described in figure and explained as below :

Diagram of data mining

1) Database, Data Warehouse, or Other Information Repository : 
It is constituted by the various data warehouses, individual or combination of databases, spreadsheets or various other types of information repositories. Data can be processed with the help of data integration or data cleaning techniques.

2) Database or Data Warehouse Server : 
The process of significant data fetching as per the data mining request of the user, is performed by data warehouse or database.

3) Knowledge Base : 
The various domain knowledge that are used to direct the search, analyse, the interesting trends are performed by knowledge base. Various concept hierarchies which are employed to arrange the various attributes or attribute's values into variety of abstraction levels are incorporated by this type of knowledge.

4) Data Mining Engine : 
This is very crucial for the data mining system and it contains a variety of operational modules for various activities such as association, cluster analysis, classification, deviation and evolution analysis and characterization.

5) Pattern Evaluation Module : 
Various interesting methods are typically implemented by pattern evaluation module and it has a direct interaction with the data mining modules so that some interesting trends and patterns can be extracted. For filtering the identified trends, various interesting thresholds can be utilized. In order to control the search only up to the interesting trends, it is very important to promote the analysis of trends interesting-ness deep into the data mining process.

6) Graphical User Interface : 
An effective interaction between the data mining system and user is facilitated by Graphical User Interface (GUI). By specifying certain data mining task or query, facilitating numerous important information to narrow down the search and examine various data mining depending upon the intermediate data mining, users can interact with the system through GUI. With the help graphical user interface, users are facilitated for analysing mined trends, visualizing various trends in different forms, browsing data warehouse schemes, data structures and database.

Types of Data Mining


Data mining encompasses several different types, each with its own specific focus and techniques. Here are the main types of data mining:

1) Descriptive Data Mining:
Descriptive data mining aims to summarize or describe the characteristics of a dataset. It doesn't make predictions or classifications. Instead, it helps in understanding the underlying structure and patterns within the data.

2) Predictive Data Mining:
Predictive data mining, as the name suggests, is focused on making predictions about future events or trends based on historical data. It often involves the use of techniques like regression, classification, and time series analysis.

3) Prescriptive Data Mining:
This type of data mining is concerned with providing recommendations or suggestions on what actions to take to achieve a desired outcome. It takes into account the predicted future scenarios and offers guidance on how to achieve specific goals.

4) Diagnostic Data Mining:
Diagnostic data mining seeks to identify the cause of a particular outcome or event. It looks at historical data and tries to determine why a certain event occurred.

5) Discovery-driven Data Mining:
Also known as exploratory data mining, this type focuses on finding new and previously unknown patterns or trends in the data. It's often used in cases where the goals are not clearly defined from the outset.

6) Text Mining:
Text mining involves extracting valuable information from unstructured text data. This could include techniques like sentiment analysis, topic modeling, and entity recognition.

7) Web Mining:
Web mining involves extracting information from websites and web pages. This can include tasks like web scraping, analyzing user behavior on a website, and extracting useful information for various purposes.

8) Spatial Data Mining:
This type deals with geographic or spatial data, looking for patterns and relationships in data with a spatial component. It's often used in fields like geography, urban planning, and environmental science.

9) Temporal Data Mining:
Temporal data mining deals with data that has a time component. It focuses on extracting patterns and trends over time, which is crucial for tasks like forecasting and trend analysis.

10) Streaming Data Mining:
This type is specialized for handling data that arrives continuously in real-time streams. It's used in applications like monitoring social media feeds, analyzing sensor data, and detecting anomalies in network traffic.

11) Multimedia Data Mining:
This involves mining data types like images, audio, video, and other multimedia formats. It's used in tasks like image recognition, video analysis, and audio classification.

12) Spatiotemporal Data Mining:
Combining aspects of spatial and temporal data mining, this type deals with data that has both a spatial and temporal component.

13) Graph Mining:
Graph mining focuses on analyzing and extracting information from graphs or networks, which are structures with nodes and edges.

Data Mining Techniques


There are various techniques and methods used in data mining to extract valuable information from raw data. Here are some common data mining techniques:

1) Association Rule Mining:
It identifies relationships between variables in a dataset. This is often used in market basket analysis, where you try to find associations between products that are frequently bought together.

2) Clustering:
This groups data points that are similar to each other. It helps in identifying natural groupings within a dataset.

3) Classification:
This involves assigning predefined labels or classes to new, unseen data points based on the patterns learned from the training data.

4) Regression:
It predicts a continuous numeric value based on the relationships observed in the data.

5) Anomaly Detection:
This technique identifies unusual patterns or outliers in the data. It's useful for fraud detection, network security, and quality control.

6) Sequence Mining:
It's used to discover sequential patterns in data. For example, it might be applied in analyzing a customer's browsing history on an e-commerce site.

7) Time Series Analysis:
It deals with data points that are collected over time. This is often used in forecasting future values based on historical trends.

8) Dimensionality Reduction:
This technique reduces the number of variables in a dataset while preserving important information. It's useful for visualization and speeding up computations.

9) Neural Networks and Deep Learning:
These are advanced machine learning techniques that involve multiple layers of interconnected nodes (neurons) in a network. They are particularly powerful for complex, non-linear relationships in data.

10) Ensemble Methods:
These combine multiple models to improve predictive performance. Examples include Random Forest, Gradient Boosting, and AdaBoost.

11) Social Network Analysis:
It focuses on analyzing relationships and interactions between entities in a network. This can be applied to social media data, communication networks, and more.

12) Web Scraping and Web Mining:
These techniques involve extracting data from websites and web pages. This is often done to gather information for analysis.

Data Mining Strategy


1) Define Objectives: 
Clearly state the goals and objectives of the data mining process, including what insights or knowledge you hope to gain.

2) Understand the Domain: 
Familiarize yourself with the industry or field the data pertains to, as well as any specific nuances or challenges associated with it.

3) Data Collection and Integration:
  • Identify relevant data sources.
  • Gather and aggregate data from different databases, spreadsheets, APIs, etc.
  • Ensure data quality and address any missing or inconsistent values.

4) Exploratory Data Analysis (EDA):
  • Perform initial analysis to understand the basic characteristics and structure of the data.
  • Identify potential patterns, outliers, and relationships.

5) Data Preprocessing:
  • Handle missing values through imputation or deletion.
  • Normalize or scale features to ensure uniformity.
  • Encode categorical variables.
  • Remove duplicates and irrelevant features.

7) Select Data Mining Techniques:
  • Choose appropriate techniques based on the objectives (e.g., classification, regression, clustering, etc.).
  • Consider ensemble methods or combinations of techniques for improved results.

8) Model Building:
  • Apply selected techniques to the preprocessed data.
  • Fine-tune parameters and hyperparameters for optimal performance.

9) Model Evaluation:
  • Use appropriate metrics (accuracy, precision, recall, etc.) to assess the performance of the models.
  • Employ techniques like cross-validation to ensure robustness.

10) Interpret Results:
  • Analyze the output of the models to extract meaningful insights.
  • Understand the implications of the patterns discovered.

11) Validate Findings:
  • Verify that the insights obtained align with domain knowledge and expectations.
  • Address any discrepancies or unexpected outcomes.

12) Implementation and Deployment:
  • Integrate the data mining results into practical applications or business processes.
  • Consider how the insights will be used to drive decision-making.

13) Monitor and Maintain:
  • Continuously assess the performance of the deployed models.
  • Update models as needed to account for changes in data or business conditions.

14) Document the Process:
  • Keep detailed records of all steps taken, including preprocessing techniques, model parameters, and results.
  • This documentation aids in reproducibility and knowledge sharing.

15) Feedback Loop:
Establish a feedback mechanism to incorporate new data and insights into the ongoing data mining process.

16) Ethical Considerations:
Ensure compliance with data privacy regulations and ethical guidelines throughout the process.

Need/Role of Data Mining in Business


The various reasons for which the process of data mining is essential for many organizations are explained as below : 

1) Operational : 
The various operations of a business organisation can be performed without any hindrance with the help of data mining. This can be done by correcting the various mistakes which are identified along with monitoring the overall operations activities. A high level of expertise and productivity can be ensured with the help of information derived from this process.

2) Decisional : 
Depending upon the real data and historical data, various critical decisions can be made by the managers with the help of data mining. Both the long terms objectives and short terms modifications can be accomplished by using various input data from the customers such as geographical or sales data.

3) Informational : 
The various information required by different individuals, can be facilitated in the various customised formats exactly at the time when it is really required. 
For example, office locations, company profile, training materials, organisational structure, service information, company profiles and organisational policies are such types of information.

4) Specific Applications : 
What are the various applications of data mining? Data mining can be implemented as "a model for forecasting the consumer behaviour (For example, the probability of satisfaction of customers) depending upon the past data related to the communication with certain organisation". This can be proved as a sure advantage to determine the chances of a customer to interact with the business organisation so that various modifications can be implemented.

Advantages of Data Mining 


Various Importance of data mining are described below :

1) Automated Forecasting of Trends and Behaviors :
The act of deriving the forecasted information in a huge database can be programmed with the help of data mining. The answers of many questions can be found quickly from data itself which otherwise requires a rigorous hand-on analysis. 

2) Automated Determination of Earlier Unknown Trends : 
Previously unknown trends can be derived by various data mining tools through the entire database in a single stage. 

3) Extensive Depth and Breadth of Database : 
There can be numerous rows and columns in a certain database. While performing the hands-on analysis due to limited availability of time, the number of variables which are analysed by the analyst, must be controlled. However, many other information and patterns can be hidden in data which are removed as they do not appear to be significant. 
The in-depth analysis of a database by users can be facilitated by the high performance data mining methods without choosing a variable subset. As limited errors and variance are obtained from the data mining database, they include huge samples (higher number of rows) and users are facilitated to conclude from vital yet small population segments.

Disadvantages of Data Mining


Various Limitations/disadvantages of data mining are described below :

1) Privacy : 
There has been a lot of discussion about the privacy in the country in recent time. This issue of privacy has become very critical due to rapid growth and coverage of internet. Privacy is the main issue in the online shopping. The customers are sensitive towards the unauthorized access of personal information and utilization of this vital personal information for creating some harm to them. Customers can also be hampered by selling the personal information as customers are not aware about the application of their personal information by other organisation.

2) Security : 
However a large amount of information about certain customers is available online despite that there are many flaws in the security of this vital personal information. For example, the vital information such as address, social security number, account number and payment history of almost 13000 customers of Ford Motor Credit company was hacked by the hackers recently who breached the security of database of Experian credit reporting agency. The willingness of sharing and disclosing of personal information from the business organisations are quite evident from this example but these organisations are neglecting the security of these information. Theft identification can be proved as the biggest issue due to the availability of huge information.

3) Misuse of Information/Inaccurate Information : 
Various marketing efforts require the determination of trends with the help of some ethical measures or through data mining which can be misused. Various unethical organisations can try to misuse the information of various individuals which are derived from the process of data mining. The accuracy of data mining is not 100 percent thus some fallout's can be resulted due to some incorrect information obtained from data mining.

Application of Data Mining


There are a variety of fields in which the data mining technique is implemented. Some of the areas are described below :

1) Retail/Marketing :
  • Determining the buying behaviors from the customers.
  • Determining the relationship among various demographic factors of a customer. 
  • Estimating the success of various marketing campaigns. 
  • Market basket analysis.

2) Banking :
  • Identifying the fake use of credit cards. 
  • Recognizing the loyalty of customers.
  • Estimating the likelihood of changing the credit card relationship of the customers. 
  • Predicting the spending of customers through credit card. 
  • Determining the hidden relation between various financial factors.
  • From past market data, determining the stock trading principles.

3) Insurance and Health Care :
  • Determining the various medical processes which are claimed simultaneously i.e. claim analysis.
  • Estimating the purchase of new policies by the customers. 
  • Analyzing the risky customer behavior trends.
  • Determining the fake practices.

4) Transportation :
  • Analysing the schedule of distribution among various outlets. 
  • Loading trend analysis.

5) Medicine :
  • For predicting the office visits, determining the patient behavior characteristics. 
  • For various diseases determining the effective medical therapies.