The evolution of machine learning 2/2

Attempting to stick it all together — tools from data to deployment

So when it comes to training a machine learning model, traditional methods work well. But the same does not apply to the infrastructure that holds together the machine learning pipeline. Using the same old software engineering tools for machine learning engineering creates greater potential for errors.

The first stage in the machine learning pipeline — data collection and processing — illustrates this. While big companies certainly have big data, data scientists or engineers must clean the data to make it useful — verify and consolidate duplicates from different sources, normalize metrics, design and prove features.

At most companies, engineers do this using a combination SQL or Hive queries and Python scripts to aggregate and format up to several million data points from one or more data sources. This often takes several days of frustrating manual labor. Some of this is likely repetitive work, because the process at many companies is decentralized — data scientists or engineers often manipulate data with local scripts or Jupyter Notebooks.

Furthermore, the large scale of big tech companies compounds errors, making careful deployment and monitoring of models in production imperative. As one engineer described it, “At large companies, machine learning is 80 percent infrastructure.”

However, traditional unit tests — the backbone of traditional software testing — don’t really work with machine learning models, because the correct output of machine learning models isn’t known beforehand. After all, the purpose of machine learning is for the model to learn to make predictions from data without the need for an engineer to specifically code any rules. So instead of unit tests, engineers take a less structured approach: They manually monitor dashboards and program alerts for new models.

And shifts in real-world data may make trained models less accurate, so engineers re-train production models on fresh data on a daily to monthly basis, depending on the application. But a lack of machine learning-specific support in the existing engineering infrastructure can create a disconnect between models in development and models in production — normal code is updated much less frequently.

Many engineers still rely on rudimentary methods of deploying models to production, like saving a serialized version of the trained model or model weights to a file. Engineers sometimes need to rebuild model prototypes and parts of the data pipeline in a different language or framework, so they work on production infrastructure. Any incompatibility from any stage of the machine learning development process — from data processing to training to deployment to production infrastructure — can introduce error.

Making it presentable — the road forward

To address these issues, a few big companies, with the resources to build custom tooling, have invested time and engineering effort into creating their own machine learning-specific tools. Their goal is to have a seamless, end-to-end machine learning platform that is fully compatible with the company’s engineering infrastructure.

Facebook’s FBLearner Flow and Uber’s Michelangelo are internal machine learning platforms that do just that. They allow engineers to construct training and validation data sets with an intuitive user interface, decreasing time spent on this stage from days to hours. Then, engineers can train models with (more or less) the click of a button. Finally, they can monitor and directly update production models with ease.

Services like Azure Machine Learning and Amazon Machine Learning are publicly available alternatives that provide similar end-to-end platform functionality but only integrate with other Amazon or Microsoft services for the data storage and deployment components of the pipeline.

Despite all the emphasis big tech companies have placed on enhancing their products with machine learning, at most companies there are still major challenges and inefficiencies in the process. They still use traditional machine learning models instead of more-advanced deep learning, and still depend on a traditional infrastructure of tools poorly suited to machine learning.

Fortunately, with the current focus on AI at these companies, they are investing in specialized tools to make machine learning work better. With these internal tools, or potentially with third-party machine learning platforms that are able to integrate tightly into their existing infrastructures, organizations can realize the potential of AI.