Our process begins before we even look at a single datapoint. We need an analytical framework, an instrument that guides our work. It includes how we frame the question we’re trying to answer, putting the data into a larger business or operational context, and understanding the data’s journey as it travels from inside the device and into our AWS cloud for use in our ML pipeline.
For instance, business problems often have ambiguous framing. A manufacturer may want to reduce downtime, but this still leaves us with multiple paths to follow. Should we look for correlations between sensor output and machines breaking—also known as anomaly detection—for predictive maintenance, or should we search out positive relations that will give our engineers a way to create more durable products? These are just two of the many possible routes.
Another way that we need to frame our data is by mapping out the data flow. To this end, we emphasize using a DevOps approach with a tight-knit, interdisciplinary team so that we all know that we’re on the same page. That way we understand where we’re calculating derived variables, what our input/output (IO) requirements are, if we’re using synchronous or asynchronous data transmission, and other similar details.
This all amounts to contextualizing our data. We’re answering fundamental questions like “What is this observation?”, “What am I trying to model?”, and “What am I trying to predict?”
Depending on how we choose to approach the data, we’ll prioritize certain types of data, conduct different analytics, and select the right ML algorithm for the job.
A Data Scientist’s Perspective on Models
Data scientists think about models a little differently than most people. For us, models take data as input and then give us an output. This input data can have tens or hundreds of variables, and the output is the answer to the question we posed during analytical framing.
Going back to our predictive maintenance for industrial IoT example, we’ll input variables like uptime and frequency of past failures, or environmental factors like humidity and temperature, alongside derived variables like number of operations per second to give us a percentage number to answer “how likely is our machine to fail within the next month?”
For machine learning models, one of the key takeaways is that we don’t write any rules for translating inputs into outputs. Instead, we give it tons and tons of examples of both inputs and outputs that it uses to then learn all by itself. We call this training the model. From there, we can give it new inputs and it’ll guess the output. As we give it more data or better data, it becomes more accurate.
That’s why an important part of any data scientist’s job is creating the training data. We almost never feed our model data straight from the sensors. The real world is noisy; there’s errors, blank spots, and a slew of issues to overcome. Preparing the training data, therefore, requires us to aggregate it, augment it, and clean it up. The specifics of this process are in turn defined by our prior framework.
Preparing training data, selecting the right model, choosing the proper hyperparameters that determine the computational method for training the model, and writing Python code in AWS SageMaker are iterative processes. Again, this demonstrates the importance of an agile approach. We’re constantly tweaking and refining our model, both to answer our initial questions and to adapt to changing business demands.
As expectations grow for our connected devices to become smarter, assume more autonomy, and make our organizations more efficient, IoT machine learning becomes increasingly important. That’s why the ML data pipelines we build take device data, train models on the cloud, and then ship these insights back to the user either through an IoT frontend, such as a mobile app, or by embedding it into the device itself via edge computing.
In the end, data science is all about creating knowledge. This powerful tool then fuels actionable predictions, recommendations, and insights. As a pioneer in the ML for IoT industry, Very’s cross-disciplinary team works closely together to foster a data ecosystem that’s rich in gold.