Data Partnerships crucial for Data Moat

I was at the meeting recently with a machine learning (ML) data product company at a conference and as we chatted about the benefits of partnering, I realized what little knowledge there is about data management and strategy when it comes to new product development. Coincidentally, I taught a class at NYU last semester on the same topic. 

First of all, before a dive into datasets, it is imperative to define product strategy, user story, use cases and the competitive advantage aka the data moat. Moat, coined by Warren Buffet refers to a business ability to maintain a competitive advantage to protect long term profits. While data is not the new oil, the right data strategy is critical for ML companies since models/algorithms have become commoditized. 

Data Acquisition 

Data acquisition requires a well thought out approach, depending on the goals, resources and moat. At a rudimentary level, companies should already have an initial inventory of sources of data that will power this product. A quick and dirty visual wireframe should be made available so that there is quick buy-in from internal and external stakeholders of the direction. 

Conducting the inventory and categorization of data sources should include internal/enterprise data sources, and external data sources such as public available/open data sources, data/API sources that are publicly listed for purchase, and data sources that are closed (not publicly available). While internal/enterprise owned data can be a competitive advantage, it is usually homogeneous and has bias limitations. For industry B2B use cases, a diverse mix of data sources produces discovery, better insights and predictions. For example, even though Google and Facebook have significant internal customer data, they spend an inordinate amount of time building unique external data partnerships for new product innovation, increased monetization and access to new markets.   

Open Data or Purchase Data 

Sourcing data from open (free) sources is great but comes with significant caveats. Initial access is usually straight forward. However, there may be limits with speed of data ingestion and corresponding data quality. Changes in access, API endpoint changes, ease of use, data crawler limits, introduction of fees and complete shutdown are also limiting factors especially long term. Twitter API (restricted) and Facebook API (shutdown) are great examples that demonstrate how data access can change. 

Free data sources are so complicated that a new category of startups sprouted up with business models solely focused on services to ease use, improve quality and limits of open data. 

The category of data sources that can be purchased are growing. Popular in financial services, where companies such as Reuters and Bloomberg company data feed Wall Street trading desks, industries such as life sciences are increasingly offering options. For example IVQIA offers one of the largest datasets in genomics, wearables, and patient reported outcomes from hospitals, payers and pharmacies. Yet these datasets available to anyone with a wallet, come at a significant cost – both financial and technical resources. 

Data Partnerships

With that in mind, data sources that are not public should be taken very seriously. Arguably, a strategic method to build a data moat, it requires identifying and convincing other companies that have transaction data that may be useful for ML product development. Exploratory in nature, convincing other companies of the value of data partnership requires tact and patience. Before any deal is signed, requires both parties to demonstrate the tangible monetary value (commercial terms) in the partnership. There are tradeoffs to think about when it comes to pursuing data partnerships such as lead time to close, marketing expense, partnership duration, monetization value, potential sales cannibalization and mutual trust. 

Test before you Sign

While data partnerships are great in principle, they can be challenging particularly in due diligence. Contracts are signed based on a data sheet (or dictionary) and a demo of the partners product showcasing the way data is used. There are limitations with this as there isn’t much clarity with the underlying data, it’s coverage, and how it addresses privacy and quality. There really isn’t any best practice here – test before you buy. It is typically based on goodwill and trust especially since no one is ‘purchasing’ anything at this stage. 

The table stakes are high in partnerships and if the partner offering up competitive data to build the product moat is not in the data business, they can consider the talks a distraction and walk away anytime. Compounding the urgency to get into a deal, another competitor may be negotiating a deal at the same time. It is thus imperative to come up with an approach to test the data as soon as possible. This is all the more challenging as multiple partners are required to form the best single source of truth and data moat. 

The best approach to test the value of any data in a partnership, is to ask for a sample along with the data sheet. Assuming the company agrees to provide one, it should be tested for coverage (how complete is the data, are there gaps?), duplicates, accuracy, inconsistency in formatting (ie dates, currency), strong privacy practices (was the data collected with consent and can the personal information be de-identified) and refresh rate (how old is this data and often is it updated?)

In product development, post contract, it is typical to further explore data sets and their value through an initial pilot. 

Pilot: Quality and Testing  

As all data scientists and analysts have experienced, not all data is created equal. Some have well-designed schemas while others are by-products of a process. Most will have coverage limitations, and require heavy cleaning. Quality testing takes two forms. First, checking for completeness (coverage), and cleanliness is necessary (i.e. removing data duplications, null values, empty cells). This is best done through a type of data audit, which could be written though an automated script. A data audit takes into consideration the business context and ensures that content for example is not juxtaposed in a non corresponding cell. 

Second, ensuring business rules correspond to the data is another quality test. This includes making sure that the data is verified and is authoritative. This is especially true for external/open data sources such as government data, self reported data and social media data. Misinformation and disinformation is real, especially as the sources of open data expand tremendously. For example, if one is building a chatbot app for Covid-19 symptoms checking, verifying the content sources is critical to having a trustable algorithm for consumers. 

Data In Production 

With the results of the pilot, the product is ready to go to production. Companies underestimate the effort required to maintain data. Data scaling, cloud storage optimization and content management are critical in this regard. Data scaling considerations ensure the right database setup supports vertical scaling or horizontal scaling. Best practice indicates that NoSQL databases are the ideal option for growing data operations. Optimizing that choice with the best cloud storage aligns with horizontal scaling makes sense in that case. However for smaller operations, there may be business reasons why a SQL relational database still makes sense. 

As data gets voluminous, automated scripts are necessary to continuously test data latency in addition to audits to check coverage, validity, accuracy, consistency and availability. 

Content Management 

Just as it was a challenge building the moat by convincing new partners, and curating open data and purchased data, it is just as equally important to practice content management. With that in mind, here are the points to watch out for. 

Cost of data increase: For open public data, a check to confirm the data is still free and widely available is important. Typically this should be done at regular intervals anyway. Sometimes it’s not so much a cost issue, but a business issue. The data provider may have decided to close the data and only work with certified third party aggregators. It is best to just have a mitigation plan in advance. For paid public sources of data, it is pretty straight forward, since the costs are explicit for example in a SaaS agreement. 

Terms of data partner relationship: Review of the fine print of the agreement is necessary. Does it require an annual license and what happens after the first year? A plan to have this conversation in advance of the anniversary is critical. 

Technical consumption: Did the data sharing method change? Typically, with open data public sources, there will be challenges. Data owners may not have an API to ease data ingestion, and may move from one complicated method to another. Data latency issues, possible limits per time period are typical problems. Paid sources typically won’t have such problems although that holds true for the most sophisticated data providers. Technical issues related to data partners should be an ongoing conversion based on pilot results and should be discussed as issues arise.

Privacy requirements. Lastly staying in line with compliance requirements and data privacy protection is just as important. With GDPR, CCPA, HIPAA, and as states erect new local laws, consumer consent, the right to be forgotten, proper anonymization procedures and timely notification in the event of a security breach are things to note as data is maintained.