4 Components of a Data Science Project
Scott Plewes | October 10, 2019 | 7 Min Read
Understanding the structure of a typical data science project is the first step in the process of building your internal data science practice. Missing any of the core components of data science can result in the failure of your efforts to realize any true business value.
Data science, at its core, is a practice that involves finding patterns within data. From these patterns, insight can be derived and used for business intelligence purposes or as the basis for creating new product features. Both of these outcomes of a data science project can be beneficial to product teams that are looking to differentiate their offerings in the market and provide customers with greater value. However, before your team can begin to implement data science, they should be well versed in the core components of the domain. Unfortunately, like many fields, there is some variance in how these terms are defined, but for the most part, this should help you better understand some key concepts.
The four components of Data Science include:
- Data Strategy
- Data Engineering
- Data Analysis and Models
- Data Visualization and Operationalization
Developing a data strategy is simply determining what data are you going to gather and why. As obvious as that seems, it’s often either overlooked, not given enough thought or not formalized. To be clear, we’re not talking about the strategy for deciding what mathematical techniques you are going to use or the technologies required. We’re talking only about the data you need to address your business problem/opportunity and why – the other considerations are important, but they’re not the first step.
Deciding on a data strategy requires you to make the connection between the data you’re going to gather and your business goals. Not all data is created equal. In the end, the effort you put into gathering data, as well as formatting it correctly and getting rid of “garbage” data that doesn’t serve your business goals, will be a reflection of both how hard that is to do, and how valuable it might be. Your team will identify data that is mission-critical to your business goals, and thus, is worth the time and energy to collect and sort. Then you might identify other data as being “nice-to-have”, but won’t contribute substantially to meeting your goals, so it may not be worth collecting if it requires a lot of additional time and effort.
Data Engineering is about the technology and systems that are leveraged to access, organize and use the data. It primarily involves the creation of software solutions for data problems. These solutions typically involve establishing a data system then creating data pipelines and endpoints within that system. This can involve bringing together dozens of technologies, often at a vast scale.
Data engineering is important to data science overall because you can’t actually do any science without it. In the end, data engineering allows data to flow from or to the product and through the ecosystem to various stakeholders. You can’t write an algorithm to improve image scheduling, for instance, unless data from the device can get to the person or “bot” that is going to analyze the data and make recommendations or decisions. Engineering is the “plumbing” that lets you make use of your data.
To understand the difference between who does the data analysis or codes the corresponding algorithms, and who does data engineering, it’s useful to look at the skills of a data engineer. A data engineer is a better programmer and is more of an expert in distributed systems than a data scientist. Data engineering requires in-depth understanding of a wide range of data technologies and frameworks, as well as how to combine them to create solutions that enable business processes with data pipelines.
Data Analysis and Mathematical Models
This is the “heart” of data science; it’s where a lot of what we associate with data science happens. We take data and using Math or an algorithm (arguably in some form it’s always both), we try to model how a “system” works. The data analysis and mathematical modeling aspect of data science is anything that involves the combination of:
- Computing (could possibly be a person doing this, though it’s rare today)
- Math and/or Statistics
- A domain (like healthcare)
- The application of the scientific method or aspects of it
To further break it down, we think of data analysis and mathematical models in terms of how you can use data:
- To describe, extract insights or make predictions about a service, product, person, business or technology or more likely – a combination of them (aka an “ecosystem”)
- To create a “tool” that replaces or supplements what a person does
- This is what most machine learning does – plays Go, reads an X-ray, schedules a patient and so on. Instead of being a mechanical robot replacing a person putting in lug nuts, it replaces a person “thinking about” and doing a task.
The first use case refers to what science has always done: obtain an understanding and where possible, create a model to make a prediction utilizing data. The second use case, again, refers to what engineers have always done with math and science: find a way to use their knowledge to create a tool that does something to support a human, or is faster/better than a person could do.
What is new in the realm of data analysis and mathematical modeling is the computing power, the incredible amount of data available, and some new algorithms. In addition, now that we have access to more advanced computing power, we’ve only recently been able to build on many existing mathematics and statistics that could previously could not be utilized because of computational power limitations.
Visualization and Operationalization
We’ve lumped operationalization and visualization into one category because they occur hand-in-hand so often. Operationalization is the more general notion, though. Simply put, it is the idea that you’re going to do something with the data at hand (after analysis and modeling) – draw a conclusion or take an action, for instance. Quite often when it’s a human drawing that conclusion or taking an action, as opposed to a “bot”, the data or analysis of the data is visualized. The reason for this is simple – visualization is often the easiest way to convey the meaning of the data or analysis to the person whose job it is to interpret the output of the data science.
Visualization is not just about taking the data analysis and presenting it “correctly”’. Sometimes, it involves going back into the raw data and understanding what needs to be visualized based on the needs and goals of both the user and the operations.
If you are developing a device that visualizes any data, then your team will require a deep understanding of the following in order for your product to integrate into the existing ecosystem and stand out in the market amongst competitors.
- How the data will be used
- The needs and capabilities of the person consuming that data (ie. do they understand enough math that a p-value is meaningful to them?)
- Users’ context of use including physical location (ie. operating room), devices being used (ie. on the connected device, laptop, phone), physical environment (ie. darkened room or sterile environment), and situational context (ie. does the user need to make an immediate decision based on the visualization that they are being shown?)
- The complexity of the analysis (ie. is it important to convey the number of variables that have been analyzed to create a prediction?).
Operationalizing is really about doing something with the data; someone (or occasionally a machine) has to make a decision and/or take an action based on the math and computing that has happened. This could be in the form of:
- A real-time person decision/action (ie. human intervention based on analysis of patient data gathered by a device);
- A longer-term response (ie the decision to restructure resource deployment in a hospital, based on business operational efficiencies) or;
- A recommendation on a very specific task (ie an “AI” diagnosing a broken leg on an x-ray).
If these ideas are relatively new to your organization and you’re, for instance, planning a new release or new product, and you want to bring to bear data science, then a simple start is to draw an ecosystem diagram.
Then use this tool to have conversations about what data you’re going to gather and why, and how you’re either going to optimize or transform a system with your product or service. This will naturally lead to the steps of data strategy, data engineering and so on.
You could also take a look at your existing approach for defining and designing other product features and follow that if drawing out an ecosystem doesn’t interest you, although, based on our experience, we’d highly recommend you give it a go. When it comes to applying data science, treat it exactly as though you were creating a product feature, because, from a practical point of view, that’s what it is.
Download: Applications of Data Science in Medical Devices
In this eBook we demystify the domain of data science, articulate what’s involved, dive into the two main approaches to applying data science, and review the most common pitfalls experienced by teams that lead to failed data science initiatives.
Health Information System Integration
In this webinar, we discuss interoperability in healthcare and answer attendee questions on Health Information System integration. Download the webinar Now.Read More
Achieving Interoperability in your Healthcare Organization
The challenge with integrating healthcare organizations is that they are just that – separate organizations. Each organization provides a healthcare service, but they provide that service using their own process.Read More
Using Contextual Data in Connected Device Design
With more physical things being connected to the internet, the amount of data collected by these devices continues to grow. When designing an experience for an IoT device, considering context becomes increasingly important to avoid agitating users.Read More