How do data science projects work?

A primer for managers, stakeholders and those that are interested

No, not that kind of science. But sort of.

Scaling deep learning

A number of weeks ago, I had my attention drawn to an excellent research paper via Adrian Colyer’s Morning Paper newsletter. The paper is an exploration into understanding how we can improve the state of the art in deep learning, and whether we can make progress more predictable.

  1. Evaluating it with a small about of data to see whether it works.
  2. Iteratively training and adjusting the network based on the amount of error seen.
  3. Stopping training when the network performs well enough, or when it isn’t improving any more.
Deep learning with little data is rarely better than random. But after finding a suitable architecture, throwing more data at it makes it improve, to a point.

Managing data science projects

If you’re used to running or observing software teams who aren’t doing data science, then you’ll notice that the above sequence of steps sounds fairly peculiar and unpredictable when compared writing your typical CRUD applications, and you’d be right! If your business expects the same kind of predictability and time commitments for data science that they do for their regular software projects, then there may be some uncomfortable conversations required: they work very differently.

  1. Finding a model: Assuming we think we can solve the problem, what sort of techniques may be likely to work? How can we prototype them and get some initial results that let us be confident enough to proceed?
  2. Training the model: How and where do we get more data from?
  3. Application to real data: Now that our model is trained and is giving acceptable results, how do we apply it to real data to prove that it works?
  4. Production: Now that we have a successful model, how does it get moved into production and by whom?
  5. Maintenance: Will this model degrade over time and, if so, how much? Does it need retraining at regular intervals in the future?

1. Defining the problem

Contrary to belief, this can be the riskiest, most difficult and time consuming part of the entire process. Business problems aren’t often easily translatable into a scientific problem that can be immediately worked on. Stakeholders will be wanting to deliver a feature to users, or they may be wanting to gain an insight into some data. But the team will have to start with first principles and build the problem from the ground up.

2. Finding a model

Given that there is a clear definition of a problem, such as whether it’s possible to predict whether users are about to leave your website, or whether you can classify images of credit cards, the project begins with finding a model. These stages are typically run in a time box, depending on the size and difficulty of the problem. Therefore the first question that the business has to answer is how long are we willing to spend seeing if this might be possible?

3. Training the model

If the first phase has been a success, it’s time to train the model. Like the first phase, this centers around time and money. In broad terms, the more training data that are available, the better the model will get. You will need to discuss, again, where to get the data, how to get them annotated, and how much you are willing to spend on that task. It will typically be much more than in the first phase — maybe even thousands of dollars if you need human annotation to be done.

4. Application of the model to real data

Precision is the accuracy of your model on the data that it has classified. Recall is the amount of data that it has correctly classified out of all possible correct classifications. (Interested parties might find it fun to read a thorough definition.) There will always be errors in any model, but tuning it to balance precision and recall can greatly improve the model’s effectiveness: high recall typically lowers precision, which means your users may see more errors. Is that acceptable? It could be for classifiers diagnosing illness where you’d rather be safe than sorry, but it could be extremely irritating for users of text analysis software.

5. Production

If your team has made it to this phase, then it’s looking very likely that you’ll be getting that feature delivered. But they’re not done yet. Unless you’re extremely lucky, your data scientists will not be experts at production engineering, so it’s at this point that they’ll partner up with other engineers to move the project forward.

6. Maintenance

Like regular software, models that you build will also need maintenance with time. Depending on the type of data you are processing, especially if it is topical data such as social media feeds, then inputs will change: consider how widespread the use of emojis are today compared with five years ago. Consider the names of popular video games now compared to last year. Models won’t know about these if they are abandoned once shipped.

Cycle complete

And that’s it: the rough life cycle of a data science project. In many ways, they are harder to manage than traditional software projects.

VP Engineering @brandwatch. Writing things that interest me. Hopefully they'll interest you as well.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store