Data science is a process from where we put in the raw data and then gain insights out of that raw data. There is a great journey that the raw data has to travel which includes the process such as preparing the data, transforming the data, performing advanced analytics on the data, visualization of the data and then we get some meaningful insights out of the data that can help us to make decisions. This complete process is all part of data science.
In data science, we receive the raw data and from that data, we get some meaningful insights. The raw data is similar to some jumbled picture puzzle, when you complete some part of the puzzles you can see what the complete puzzle can be. And then some parts of the puzzle that you have completed can also help to shape the other parts of the puzzle, and completing incomplete parts. When the puzzle is completely built upon and you can see what it has to show then you can make decisions.
The collection of data and turning them into meaningful insights as well as making business decisions based on the data is making data science so valuable. For example, a bank has thousands of customers. The bank can see a reduction in the number of customers a few hundred every day. A great amount of data has been generated from the past a week about the people who have left the bank. The data is then provided to a data scientist. The data scientist extracts meaningful data and then performs advanced analytics on that data. The data scientists may find that there are great chances that people who are married are more likely to leave the bank than single people. From the data it may become clear what are the characteristics of the person who is loyal to the bank and did not leave. Later on, these analytics can help on making some meaningful completion of the puzzle, the product manager can reason on these data that it might be because the married couple may have high consumption needs and require more mortgage and maybe the bank schemes are not suiting those people and they have to make a new scheme targeting only married people. The complete example is for you to understand what is data science, data science is identifying hidden patterns and relationships using the data.
A data scientist’s work may include working with structured and unstructured data. The structure is something that can be easily collected and worked on such as data that is already arranged in an excel sheet or database management system such as MySQL which works upon structured data. But in real life a data scientist finds himself to be working on unstructured data, for example, a huge amount of text on which a data scientist may have to perform sentiment analysis, or some heat map from which the retention level of the user has to be calculated. In some cases, the data can be so much in an amount that it may not able to be processed by traditional database management system therefore here the role of the big data comes in, where the big data with the help of advance database management system such as Hadoop can help to gather a tremendous amount of unstructured as well as structured data.
After the data has been collected then it’s time for the data scientist to work on the data. The data scientist then starts performing analytics on the data that has been collected. For example, there is a weather data that is provided to the data scientist and then his work will be to predict what the future weather can be. The way he can do this is with the help of years of data and geographical as well as climatic change data and anything that can influence the weather and with all the data collected, then he will have to design a machine learning predictive models which will process all the data and then provide him with the required predictions that he asked for.
Also, with the help of the prediction model that processes the data. A data scientist can now end up in designing of data products such as recommendation systems. You might have seen the recommendation systems of Netflix and Spotify, behind them data science is working by collecting what the user is viewing and liking, and then on that basis, the recommendation system is providing the feedback to him. Netflix has even gone beyond this and it has started to plan its new releases one basis is how its previous series performed, and all this with the help of data science.
There is also the role of data engineer, but the difference between the data engineer and data science is that data science focuses on data analytical, visualization, story-telling, modeling roles whereas the data engineer focuses more on programming, database management, statistics, maths, or engineering aspects. The product manager is in close contact with the data scientist and the data scientist is in close contact with the data engineer.
Also, data science is more towards future predictions from past behavior. There are also fields such as business intelligence which is more focused on the cause of the past behavior and this also happens due to the amount of the data that has been gathered. Data science is more future-focused and business intelligence is more past focused.
The first question that comes in the mind of someone is what amount of data is called big data. The answer to this question is that any amount of data can be called big data and there is no actual limitation on the amount. The data is called big data when the data cannot be handled or processed by the traditional methods such as data requiring many computers to process or form of data that is not handled by file processing systems such as Excel sheets or database management systems such as MySQL are all big data.
Sometimes people may see big data in negative aspects, such as companies gathering tonnes of data about them and there is no more privacy left now. But this is not Big Data, As Wikipedia states the definition of big data “any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications”. Now maybe this does not sound as threatening, with so much data in hand from various sources of people activity that to focus on an individual becomes a waste of process, in contrast, big data focuses much more on collective groups rather than grilling the activity of a particular user. Also, big data does not revolve around only tracking of user activity or collecting their behavior information from the website, big data is applied across many fields of science and technology, such as atomic research data, hospital and patient data, even in financial markets, the big data is present.
In big data, some problems are common to a big data engineer, and these problems were termed back in 2001 and from there it has continued to add on.
The first of all is the volume which signifies how big your data set is.
Second is Velocity as of today Facebook or Google can generate so much data in one data that they would have generated in the year 2010. So all this amount of the data has to be processed in real-time, and for this, we need to have appropriate techniques and methods ready for processing them.
Variety, the traditional form of data was structured data having the data stored in various columns and rows but today the variety of the data has been increased and the data coming in cannot be always stored in the rows and columns that earlier it was used to. There can be data in the format of web clicks, images, videos, thousand lines of sentences, heat maps, geographical patterns. And all this data you cannot imagine storing by only the help of traditional row and column methods so there are technologies such as Hadoop and spark.
Value, it does not make sense to have tremendous amounts of data and then having no ultimate use for it. If the data cannot provide value to the user then it might not be worth storing and processing it. So the end goal for the data is to become meaningful to the person who will use it. For example, there may be lots of data that is generated from train temperature sensors now these have to be collected and then visualized so that it can come in handy to those train managers who can then decide to install a better cooling system. The data has to provide value to the field of the domain that it is used for. For example, in geographical location, the data from the past can predict the weather of the future and it can be done with the help of machine learning algorithms that can help in determining the near future outcomes.
The last one is the veracity, the veracity tells us how trustworthy the data is. Sometimes the sensors may be faulty that may produce some off charts data then we have to acknowledge the fact that the source of data is not trustworthy or there may be missing value in the data that were filled to make some sense this case also decreases the level of the veracity of the trust on the data that is being produced.
These are some main Vs of big data but you can find more and up to 10 Vs of big data on searching.
One of the main things about the big data that has to be dealt with is the sheer volume of the data and that data cannot be handled with one computer and therefore we have got many computers that have been simultaneously working with data. It can happen with the same piece of data distributed in many computers simultaneously and if we have to perform the task on the complete data then the distributed data can perform the task and it will depict the same as performed on one single computer. If we had performed the task on one computer and then required more memory for the storage than we had to probably update the hard disk which would be much costly and ineffective with the same instance that may occur in the future. For arranging these technologies the company does make a cluster of computers where the computer can store and process the data simultaneously and therefore here comes the importance of cloud computing.
There are very specific working frames such as Hadoop and Spark which can manage the distribution of the data and also take care of fault tolerance and provide reliability by taking of the data that if one node in the cluster of the computer goes down then there is a replication of that data that is present in some other node which may not hurt the real-time processing that is going on.
Some of the main purposes of the big data are:
Ingest the data where the data has been coming from various sources then you have frameworks of Apache to deal with the data which can then accumulate the data that has been coming from various sources and then push down in the line so that it can be stored onto the systems.
The storage of the data occurs as mentioned earlier with the help of a distributed file system framework of Hadoop and the data has been distributed across the cluster.
After the storage, we have to deal with the processing of the data. The processing can occur variously either it can be stored processing where the data has been stored and then processed with the help some machine learnings algorithm or there can be real-time processing where the data is only needed for training some type of machine learning algorithms and then the data can be discarded and there may be no use for storing that data and that is a real-time machine learning working on the data. The processing of the data can involve the removal of various repeated instances that may have occurred many times in the data and remove the useless noise of data that may have no value after all. All this work has to be done in the processing stage of the data.