In a data science project, we want to get to the ultimate goal of extracting knowledge from a given set of data, knowledge that we can use to predict and explain future and past observed events. Like other quantitative sciences, for example, physics or chemistry, we want to get at the underlying causes of observed events in our world around us. But unlike physics or chemistry, the final knowledge extracted comes out of modeling the data using mathematical models or tools such as machine learning or data mining. By its very nature, data science is purely empirical in deriving knowledge out of data.
Being empirical, data science requires that we follow the scientific method. Yes, that’s right, I said scientific method. We begin with a question of why/how a particular thing happens, followed by a hypothesis. . . null hypothesis. . . alternative hypothesis. . . Then think of some way to test the hypothesis to see if it can explain that thing which we want to explain, and finally use the results to make predictions, then test again. In data science, we create models as tests to the hypothesis based on a priori domain knowledge. These models connect one piece of data with another piece, finding relationships previously unknown. Once satisfied that our models confirm our hypothesis, how we choose to use this knowledge is up to us. But one thing’s for certain – we at some point are going to have to explain our results to others. How do we get knowledge out of data is not always straightforward and we may go through many hypothesis before finding something that works. Either way, the methodology for data science should be pretty general. Here I just want to outline what I have come to call the pyramid of data science, which at its core is based on the scientific method and simply outlines the steps I have come up with to execute a data science project. You may also find it resembles the DIKW (Data, Information, Knowledge, Wisdom) pyramid in some ways. The base upon which the pyramid sits is the idea, data selection and gathering are on the first tier, followed by data cleaning/integration and storage, then feature extraction, knowledge extraction, and finally visualization.
Continue reading here
Source: Data Community DC
No comments have been posted yet.