So says Marc Andreessen, and that phrase has evolved since then to “Data is eating the world.” Companies with troves of data are finding themselves in a lucky situation, where the byproduct of their transactions and offerings have become a commodity themselves.
Data is that commodity. Direct mailing and delivery companies transitioned into data-first companies. Now, they are beginning to catalog and organize their siloed, disparate data. However, not all organizations or industries are equipped with the skills and capabilities to deal with the growing demand for data.
And so, a new field emerges – data management.
The Strata Data Conference gathers professionals and students of all skill levels. The goal is to exchange current best practices and explore new ways of cleaning, cataloging, and analyzing data, among other things.
This year’s conference saw the greater adoption of Jupyter notebooks as the swiss knife of both data analysts and data scientists. Vendors have adapted the popular web app’s interface to make it easier for data analysts and data scientists to transition to a Jupyter product.
Privacy is a growing concern, especially for those in the C-suite level. With GDPR in full swing, more companies are looking at ways to integrate GDPR legislation compliance and build it into their tech stack. For instance, Aldo partnered with Talend to track and organize its customers’ personal data. Such a process is now known as data lineage, and it’s part of a broader concept called data management. As a quick summary of these concepts:
1. Data Management is the development and execution of architectures, policies, practices, and procedures that properly manage the full data lifecycle needs of an enterprise.
2. Data Lineage is the organization’s ability to track and monitor the flow of data from the origin, (which may be external), as it flows through the organization. Companies that track their data lineage should know where their data is coming from and tells them where they are existing.
3. Data Discovery is another new term, coined to refer to the ability to know an organization’s current data assets. In other words, it functions as a data catalog.
For the most part, organizations have built an infrastructure in-house to tackle these problems. Bayer Crop Science’s Haystack, LinkedIn’s WhereHows, Lyft’s Amundsten, Intuit’s SuperGLUE and Uber’s Databook are just some of the companies present at the Strata Data Conference to share what they built.
Streaming data is also now the norm, with B2C companies collecting data that can go down to the last millisecond. There is an emergence of new data formats, and these do not come in your standard formats like integers and text. These can be complex formats like videos and images, and they can communicate richer information, like geospatial data.
Geospatial data requires a whole new tech stack to deal with it. With that comes the need for technologies that are both reliable and powerful to be able to cope with the continuous and massive ingestion process.
The open-source community heeded the call and came up with great projects such as Apache Kafka, Apache Spark, and Apache Flink. Based on conversations I had with attendees at Strata, Apache Kafka seems to be the most popular for ETL of streaming data.
With 80% of a data scientist’s time spent on data cleaning, it is no surprise that using machine learning for such tedious tasks has been a popular topic. Holoclean has been considered to be the leading open-source too, and there are a variety of vendors that provide targeted solutions for the same problem.
These tools are meant to complement and improve the productivity of a dedicated team with domain expertise, and by all means, are not a silver bullet to all data cleaning and imputation problems. Data imputation and deduplication sessions in this year’s conference have been very popular for this reason.
For some companies with a substantial amount of data and thousands of KPIs, using machine learning for BI and visualization has been top of mind. Model development, governance, and operations are now becoming more important as companies aim to reduce their time to actionable insights.
There has been a lot of hype over what possibilities data can bring, and the industry has delivered tremendously. After attending the Strata Data Conference, there’s no doubt in my mind. However, there are still gaps to be filled; namely, data discrepancy and better management of privacy. IBM’s Robert Thomas says it best, “the hype phase is over and the work needs to start.”
Attending the Strata Data Conference was an incredible experience in its own right. I was able to meet intelligent minds from many different organizations and industries, share thoughts and perspectives, and discuss the future of data. It’s my pleasure to share what I learned with you!
In the next post of this series, we’ll further talk about the specifics.