As companies are looking to unleash the potential of machine learning, many are finding that its benefits are hard to realize. The primary reasons for this are challenges in scaling training data, aligning production ers with the broader goals of the company, and realizing gains from an organization-wide machine learning strategy. No one factor is insurmountable but together they pose a significant challenge.
This article explores each of these factors, their causes, and some ways they can be resolved.
- Scaling training data is the biggest challenge facing most teams implementing ML at scale. As more users engage with our site or app, they generate more data that needs to be used to train new models that predict how likely it is that they will churn, respond to a promotion, or purchase a product. In most cases, the amount of training data companies have is not sufficient for their needs and they need to scale it by orders of magnitude. This is not as simple as uploading more files into our data lake. Training data needs to be annotated by humans at various levels so that machine learning models can extract high-value signals from it. These annotations are often done manually which can take a long time and usually require many users. For example, imagine having to label every image on your site with what’s in it – this approach would require hiring massive numbers of people who could then spend all day labeling images instead of doing other productive work (like building new features for customers).
- To overcome the manual annotation bottleneck, companies need to invest in scalable systems that can automate the process of annotating their data. The annotations must be consistent and accurate so that models built on top of them yield great predictions. Once these scalable systems are in place companies can use their existing resources to scale training data which enables them to build better predictive models. In other words, they can get more bang for their buck from their current engineering team by using automation.
- Companies also have a lot of data about how users interact with various features across different devices and channels. This includes click logs from phones, tablets, and web browsers; events tracked via SDKs placed in mobile apps or websites; signals from offline point-of-sale systems including purchases and returns; information from CRM systems about who has contacted a company and how often; and data from backend systems like databases, caches, and even search engines. In many cases, companies do not have the tools in place to collect or use much of this data when they are building new models for their users.
- The simplest way management can ensure that teams have all the necessary data when they build a model is by establishing a policy that every team must maintain a clean pipeline for collecting and processing their training data. This starts with having a clear understanding of what types of signals need to be captured from each source the business interacts with so that it can train relevant machine learning models. For example, if you’re doing work in computer vision you’ll want to capture images from various devices (phones, tablets, laptops). If you’re working on speech, you would want to capture spoken audio. Once you know what types of data need to be collected from which sources, you can invest in building the necessary infrastructure.
Q: What is the relation between training data and other data?
A: Training data is only one of many types of different datasets that companies use. Other common types include validation/test, production, and cold start (i.e., new user) data. Each type has its own advantages and disadvantages which should be considered when building models at scale. For example, production data tends to be the most accurate but it’s not practical to collect entirely new sets each time a model needs to be retrained with updated inputs since companies need their users’ experiences while online to produce revenue for them!
Q: How does a company know which models to build?
A: They should develop a clear strategy for building machine learning-driven products that map to the business’s KPIs. This includes first figuring out how each KPI can be turned into a good customer experience than building a product that directly targets those experiences using machine learning. The best way to develop this strategy is through total business alignment including with marketing, sales, and PMs who represent key customers/user personas in the process of designing new models.
Q: What hierarchy should management set up regarding who builds what models?
A: There are three main types of teams that companies need to build when they want to scale model development at scale – data science, engineering, and analytics.
By using the right technology and setting the right organizational structure companies can keep up with ever-changing user tastes and requirements. This is vital as they continue to develop new, innovative products that solve pain points for current and future customers.