Machine Learning challenges in legacy organizations
Fans of machine learning suggest it as a possible solution for everything. From customer service to finding tumours, any industry in which big data can be easily accessed, analyzed and organized is ripe for bringing about new and compelling use cases. This is especially attractive for legacy organizations, such as financial services firms, looking to gain an advantage.
These businesses are usually well embedded in their markets, fighting with competitors over small margins and looking for new ways to innovate and drive efficiency. They also have an abundance of historical and contemporary data to exploit. One asset any start-up lacks is owned historical data, which gives legacy firms an edge in the competitive landscape. The promise of machine learning is therefore particularly seductive – feed in your extensive customer and business insights along with your desired outcome and let algorithms work out the best path forward.
However, established businesses such as these are also the ones that can face the biggest challenges in driving value through machine learning due to technical debt, poor infrastructure and low-quality data, leading to higher costs of deployment as well as higher maintenance costs.
Take a legacy financial institution as an example. Though the organization may have extensive historical data, much of it may be held in old documents and unstructured formats. Without effective data mining capabilities, both in terms of expertise and technology, this data will remain largely unusable. It’s only when dedicated data science teams and tools are put to work that this value can be unlocked.
At a recent developer meetup, I heard from Teodor Popescu, a Data Engineer at BBC about how he deals with these issues for the nation’s biggest broadcaster.
Too much of a good thing
“There is so much hype about machine learning, but no one talks about the infrastructure behind it,” explained Teodor.
In every machine learning project, the raw material is high quality data. In legacy businesses, while there may be a lot of data around, it is often unstructured, incomplete or hard to find. IBM confirms that for most companies, 80 percent of a data scientist’s time is spent simply finding, cleansing, and organizing data, leaving only 20 percent to actually perform analysis and to run the clean data through a model.
Large volumes also lead to issues with scaling, as Teodor found when training a machine learning model on three billion data points. Infrastructure struggles to keep up with the volume of information, while processes that deploy and track the results need to be scaled at the same time.
The power of pipelines
At the BBC, there are over 1.3bn events generated every day. This requires machine learning teams to focus a lot of their time finding, maintaining and expanding sources of reliable data.
By working with third party integrations, teams can mitigate some of the existing issues around data management by sourcing new, structured data. However, these integrations still require maintenance, with broken pipelines slowing down development and deployment.
Instead of focusing solely on how to bring more data in, organizations can instead focus on the infrastructure for managing the data internally.
There are two approaches to this problem: specific data infrastructure and specialized team structure.
One example is personalization. In order to maximize speed, click data from iPlayer is channeled through a distributed streaming service (such as Apache Kafka, Amazon Kinesis or a collection of Amazon SQS services) and processor Apache Spark in sequence, before being delivered to storage in AWS or back to iPlayer, via API, so that personalized options can be presented back to the user.
This is also reflected in the way of structuring data science teams, introducing DataOps and MLOps to take on specific roles.
These teams work behind the scenes to enable better performance across the data science teams, focusing on the robust implementation of data version control, ensuring adequate testing is conducted for both models and the data; working to accelerate the journey to machine learning deployment and reproducibility.
Given the specialized nature of many of the systems, developers can play a key role in determining what is possible, efficient and valuable for the organization. Legacy organizations looking to leverage their data therefore need to focus on specific issues and datasets in order to deliver targeted solutions efficiently, rather than taking a broad approach. Machine learning models are only as good as the data that feeds them, so establishing the use case for specific datasets is vital.
Finding the right problems
Despite the difficulties, machine learning can still be an incredibly valuable tool for legacy businesses. The key to success is tailoring your approach to data and tools to the specific needs of the organization.
The goal for internal development and data science teams in this process is to align the business on goals, methods and infrastructure. In essence, fully understanding a business problem is the key to creating the right use cases for any data project. This is the only way to ensure robust processes once in production, while managing scope and efficiency. In this way, teams can build incremental gains that drive long-term value throughout the organization.