Data warehouse and data mining are some of the most important concepts of data science. Because at an initial level people have to deal with raw data and the methods to store and then reuse the data for more important purposes these cannot be done without the help of a data warehouse and the processing without the data mining. Let’s understand data mining and data warehouse in more detail.
Let’s suppose you a manager of a big retail company and you want the data about sales from each of your retail store, as each of your retail stores will contain the actual data about the consumer and the sales and the product they can have stored them in their computer in excel format and this data you require for deciding on which products from all your retail stores are selling best and therefore you can make a decision to improve upon the product to gain a bigger market, therefore you attempt to collect all the data that is present in the computer of each store and in this attempt you have to organize them so that the data can help you to decide while based on analytical processing. The place where you will store the data from various retail stores will be called a warehouse. A data warehouse is a collection of different types of databases that are subject-oriented.
Now the data that the data warehouse is receiving can be from two different sources. First is the operational data and the second is the strategic data. The operational data is data such as the daily report of the employee, check-in and check out time, the report of the machine working time, etc. These type of data cannot much drive any business logic but can give you insights into operations that are performed but the strategic data is the required data contains the data about the number of sales from a particular product, the cost that is incurred on the transport of the products all these kind of data that can help to make a strategy for marketing and developing business logic will be considered as the strategic data.
The need for the data warehouse was realized during the 1970s when the computer was getting popular and have become readily available for the use of the business and therefore this computer contained a large amount of the data but they were in a separate place and never in the same place, therefore, the data warehouse took a great leap from there and then transformed the way of how we view data. The formal definition of a data warehouse is a subject-oriented, integrated, non-volatile, time-variant collection of data to support the management’s decision.
These are the basic characteristics of the data warehouse: the subject-oriented means the data that is stored in the warehouse will be detail-oriented toward company-related matters only and it should not contain data that is irrelevant and cannot help to make a strategic decision. Such as the data about the competitor cannot help the product manager to improve his product and the place that needs improvement. The second is the integrated characteristics of the data warehouse that is the data that is coming from the various sources are not in different formats and they are in the same formats. For example, one data set may come from a branch that uses the SQL database and another uses an oracle database and therefore all these data have to be integrated into one format to achieve the integrated characteristics in data. The data is time-variant as the data that comes in the data warehouse are never deleted and hence you can store the data in the warehouse for a long period and the last in non-volatile it means data cannot be changed and this makes the data to be more reliable and stable to use for making managerial decisions.
The data mining comes after the data has been stored in a data warehouse and the processing has to be performed on that data warehouse. The definition of data mining is data mining refers to extracting knowledge from large amounts of data.
Data mining is much more complex than a data warehouse. Many tasks have to be performed on the data so it becomes useful and can be implemented into the systems to use it in realtime. The data that is collected from the data warehouse and the other database from the world wide web is first cleansed and all the missing details are filled and the inconsistency form the data are removed and then it is stored in data warehouse server, this place has a consistent data that are now fed to data mining engine where the task is to recognize patterns from the data. For example, when you query any product on Amazon it feeds you with related product data and these related product data are a result of the pattern recognition that is performed from the data mining engine and which can perform it in the real-time, and now such thing to work we will require a knowledge graph a database that utilizes repeated patterns and data used into a server so that when the user performs the same query it should not check again and again into the server.