Data

Dealing with Data Dilemmas

At Refinery Labs, data is the foundation of our business.

We thrive in muddling through the copious amounts of raw data available to us.

So much of our daily routine has been turned into aggregated and anonymous data. However, the abundance of this raw data doesn’t mean it’s necessarily easy to use. Raw data comes in a messy and unstructured mountain – and it’s often a challenge to pull anything actionable out that mess.

So how does one put this data to work? Well, we’re happy to help you get to that.

Alternative Data

Alternative data is a substitute for data that was originally intended to be collected, but wasn’t available.

You could even think of it as complementary data.

As an example, Orbital would use satellite data to estimate the amount of oil in tankers around the world. However, like any other form of raw data, alternative data can come in a rough shape and will require some work to become anything actionable.

Two issues you may encounter in working with alternative data are noise and ephemerality.

Noise

The problem with too much noise in your data is that it becomes difficult to make the correct interpretation – there’s too much going on in the data set, often in the form of multiple outliers.

It’s a bit like the paradox of choice, where having too many options–data points in our case–hinders us from making the proper deduction.

Ephemerality

Ephemerality refers to the lack of sufficient data to make a reliable conclusion.

For example, when trying to evaluate the profitability of a certain product, should a store owner look at a week’s worth of data or a year’s?

Just collecting a week’s worth of POS data on the item won’t be enough for a store owner to say that she should stock up more of it. She’ll most likely need to observe it for a much longer period to see how the product performs through the seasons. She may learn much about the product, like seasonality trends.

Small sample sizes are always something to avoid whenever possible.

How We Avoid Both

At Refinery Labs, we heavily use Pandas and Numpy libraries for data manipulation and analysis. To correct noise issues, one can rely on either denoising or smoothing or both.

Turn Down That Noise!

Denoising is the process of removing the unimportant details of data, making it easier for you to make an reliable interpretation. There are two methods to denoise.

  1. Statistical Approach – if any of your team have domain expertise and understand what your data should look like, you can model it. Modelling your data will help remove the areas where data points are not expected. While out-of-the-box solutions exist for denoising data, the right library to use varies depending on the domain you’re working in. Once you have selected an algorithm to use and have implemented it, you can then look at its signal-to-noise ratio to give you a bit of guidance as to how much to denoise. The danger in denoising too much is that it can lead to a self-fulfilling bias – you’re modeling the data too much in the direction of what you expect it to be.
  2. Engineering Approach – if there is a way to collect a given data point multiple times in different methods, you should do it.Having a duplicate data point will allow you to cross reference the collected sets of data and help you discern what should or shouldn’t be included.

However, if you’re like most of us, you probably don’t have the luxury of collecting data for long periods of time. And that’s before we start having multiple data collections happening simultaneously.

This stuff costs money!

Smooth Riding

Smoothing, on the other hand, is trying to approximate the function that best describes the data.

Smoothing and denoising usually go hand in hand. Caution should be taken when smoothing data, as it is more prone to losing valuable information and may exhibit the bias you have about the data.

Furthermore, you must be careful with the technique you select in smoothing data. Whether that’s moving average or holt-winters just to name a few, each method has a certain purpose and appropriate use-case.

Imputation

For problems related to ephemeral data, imputation will be your friend.

Given actual data collected at regular intervals (non-continuous collection, in other words) you’d have to simulate what would have been the data during the times you weren’t collecting data.

Essentially, you’d be like a detective in a crime scene trying to role play what events happened leading up to the crime. Be aware that data can only be ephemeral if it were a time-series data.

We go through this same problem almost daily at Refinery Labs.

You see, we regularly deal with census data – and censuses only occur every 5 years. I’m sure you can imagine how many educated guesses we have to make – if it wasn’t for imputation techniques.

Imputation Computation

First, we determine how many variables we have.

If we’re only imputing a variable in our dataset, consider using these imputation techniques:

  1. Mean or Median Imputation
  2. Carry Forward Last Observed Datapoint
  3. Linear Interpolation
  4. Kalman Filter
  5. Moving Average
  6. Random

Otherwise, you can use multivariate time-series imputation techniques:

  1. K-Nearest Neighbors
  2. Random Forest
  3. Multiple Singular Spectral Analysis
  4. Expectation-Maximization
  5. Multiple Imputation with Chained Equations

You can further narrow down your dataset by looking at missing data.

Are you trying to fill in gaps? DO you have a bit of past and a bit of future values?

Or are you planning to use past values to interpolate future values?

If the former applies to you, pick from these techniques:

  1. Linear Interpolation
  2. Kalman Smoothing
  3. Moving Average
  4. K-Nearest Neighbors
  5. Random Forest
  6. Multiple Singular Spectral Analysis
  7. Expectation-Maximization
  8. Multiple Imputation with Chained equations

If it’s the latter, the options are more limited and may even require a greater level of domain expertise (if possible):

  1. Carry Over Last Observed Data Point
  2. Mean or Median Imputation
  3. Random

Just to note – when you do impute, you have to evaluate and test.

You can choose between different statistical approaches to evaluate your imputation choices

  1. Mean Absolute Error (MAE)
  2. Root Mean Squared Error (RMSE)
  3. Pearson’s Correlation Coefficient (R^2)
  4. Rank Product

Furthermore, the statistical properties of your data should not change after adding your imputed data.

Supervised Machine Learning

So we just took a look at how Refinery Labs fills in the gaps when the desired data isn’t completely available. But what about when you have all of the data – so much that you don’t even know where to start?

Maybe you should consider supervised machine learning.

This is a bleeding problem for those in traditional industries, like health and insurance, that are just undergoing digital transformation. These companies are realizing they can’t really bring their knowledge or their vast quantities of data into their new systems because their old data formats aren’t readable by machines – in effect, their data has become artifacts.

So how does this supervised machine learning works?

It sounds a bit utopian, but it’s a collaboration between machine and humans.

The process looks like this:

Given a small amount of labelled data, we humans create a machine learning model. We then apply this machine learning model to a large pool of unlabelled data on several iterations. At the end of each iteration, we discern the points that are difficult for the machine to label, and we label them ourselves.

In the process, we are making the model, or machine, smarter. Sounds easy, right?

Not.

There are many concerns to consider, when conducting supervised machine learning.

As an example, how do you figure out which data points to label for the machine?

You need to really know your data to know which data points will accurately represent reality.

Synthetic Data

The last way to deal with nearly or entirely non-existent data would be the newly-coined term synthetic data.

It’s like the raw data’s Beyond Meat burger to McDonald’s Quarter Pounder.

This approach to generating data is still at its infancy and for it to even be credible, the foundational data used to produce synthetic data needs to be representative of the data. It also requires a greater domain expertise which requires time and maturity before it’s ready.

Closing

We hope this gives you a glimpse of how we turn messy and limited data to actionable insights.

Overwhelming? Don’t fret – our clients all feel the same way, and that’s why we work so well together. It seems like a lot to take in but with a process in place and the right skillset on the table, it is definitely doable.

Not sure how “data” can help your organization? Refinery Labs is here to help.

We look forward to meeting you.