Probably the hottest topic most companies are talking about and still struggling with: machine learning. AI, Data Science, algorithms. self-drivng cars. What?! It can hard to figure out what’s really going on as a product manager in this field that will power a lot of use cases in the future.
This is the machine learning guide for product managers.
Want to end the confusion and get a kickstart as a manager on the basics of machine learning within 5 minutes? You’ve come to the right place!
What is Machine Learning?
Machine learning (ML) is a specific field that powers what the public knows as Artificial Intelligence (AI). Machine learning is the science behind structuring and classifying data in order for the data to be understood and used in a certain use case.
Machine learning is more about data than complex models. The best analogy would be trying to educate a small child what bunnies are as a parent. You’ve seen a child been taught from a book like this:
That’s basically what Machine Learning is: you try to teach an interpreter (= machine instead of a child) how to interpret given data. Next time it encounters such data, it can process it.
The biggest difference: Machines are a billion times faster than humans in learning anything.
For it to be Artificial Intelligent, it must Machine Learn first.
What makes Machine Learning so unique?
This is where you need to understand a bit of computer science. All code or traditional algorithms your developers are writing are strictly sequenced in execution. The outcome of the code has already been determined upon creation. Like code for user login, showing a blog post or searching for a hotel. Code which always returns expected output is called deterministic code.
But what about if I ask a computer to pick something randomly? That’s different, right? Wrong. Although we interpret it as random, a computer needs exact instructions. The algorithm behind that random picking is still very much sequenced but has a few ways to pick something ‘random’ from a given amount of options. The outcome is therefor determined but may vary in a range of values. This is called semi-deterministic code.
That’s how everything worked for every program and computer for a long time. Everything was in our control because of the deterministic element.
Machine learning is unique because it’s non-deterministic and goes beyond our own control. The sequence of decision making is not obvious upfront, interpretation of data can change and it’s only possible to determine the outcome with very low (or zero) certainty upfront based on the given dataset and given computational footprint.
How does Machine Learning actually learn?
Back to the analogy of teaching a child about bunnies from a book. How does a machine actually learn? There are two ways to teach a machine here:
This type of learning looks like our analogy. You give the machine data (like pictures of bunnies) and tell it what it is (it’s a bunny). Like a parent pointing at the picture and telling the child what animal it is. A computer actually doesn’t know it is a bunny, it classifies it something that is labeled a bunny. The machine recognizing it is actually called classification.
When it gets new data, it can tell you when pictures without any information are bunnies or not based on the previous data it was taught/trained on.
In the real world it is applied to more advanced use cases: like stock-market pricing based on historical data and predicting future events (like possible pricing fluctuations).
Another example: Your Gmail messages are also handled by machine learning to recognize spam and prevent future spam mail to your mailbox. Keeping your inbox clean. Thanks ML!
Unsupervised learning is like giving the kid a book and the parent not saying a word about what it is looking at. It needs to figure it out by itself. In this case, a machine will look for commonalities. It means it will look for features or attributes that the pieces of given data have in common, like the shape of the bunny.
This method of learning is probably more preferred because of one thing: most data in this world is not labeled in a way to tell a computer what it is to teach it. The real value is when a machine can figure it out for you. That’s where this method comes in.
Bonus: Transfer learning
This is the most complex method and I would consider it more experimental as it is not that common in practice. Transfer learning is really worth mentioning because it solves a big problem training a ML model:
What if I want to classify something (like pictures where birds should be recognized) but I don’t have a big-enough dataset (= a huge set of pictures already correctly labeled as birds) to teach a machine to recognize them?
The answer: you give the model a dataset that is similar to your actual problem to solve first (recognizing bunnies) and then apply it to your actual problem (recognizing birds). So you transfer the model from training to actual deployed application.
I consider this the most interesting method because most datasets companies have are not big enough to really do proper Machine Learning. And in some important cases you’ll need this method because you can’t use the ideal data due to privacy regulations (like medical data).
If you really want to know more with a in-depth example, check out this video:
What data do you need for Machine Learning?
If you’ve chosen the method of learning, you can go ahead and gather a dataset. A dataset is just a pile of data examples (also called data points) for the thing you want it to learn. There are a few things you need to take into account as a manager:
- Without quality data, no Machine Learning. You need proper data engineering to provide quality input data. It needs to be preprocessed and undergo certain operations like data cleansing. You don’t want to give your model corrupt or incorrect data. It’s like giving the engine of your car the wrong type of fuel: it will sputter and eventually fail.
- Bigger is not automatically better. See ‘how much data do I need’ below.
- You’ll need to chop up your dataset into 2 specific sets. The first and biggest piece is meant to teach/train your machine learning model. The second piece of the dataset is there to run after the training in the same model to verify the accuracy (= the percentage of correctly determined data examples) of the model. You can use the 80/20 rule to chop the dataset. Never use data from a different dataset as verification data, as the possibility of discrepancy is too high.
Legal note: Machine learning means processing data for a specific purpose. You’ll need explicit consent from the respected data owners (by user agreement or legal binding contract) to process the data for the defined purpose to remain complaint with privacy (GDPR) regulations.
How much data does Machine Learning need?
This is probably the hardest question to answer but there is a rule of thumb that most people in machine learning already know: you need 10x the data points as there are dimensions (number of columns aka. features).
The reason why more is not better is that research has found that crossing a certain critical mass of data will not give you extra accuracy correlated in results.
In the real-world most machine learning efforts won’t reach the 10x sweetspot anyway. Most will aim to get as close as possible to the rule of thumb to be on the safe side.
Note: although there are some technical factors that really determine how much more/less data your need, the only thing you need to remember here: non-linear algorithms need more data than linear algorithms. This is because there is more complexity on the nonlinear relationship between data input and output features.
In layman terms: because a nonlinear model is inherently more flexible, it has more non-obvious factors to look at before it can determine any relationship. These so called dimensions are more to work with and therefor more complex.
Machine learning guide: how to start?
By now you know two things:
- How ML learning works on a high level and the foundational methods of learning.
- How to choose a learning method and what to look for as a dataset.
To begin you’ll need a few more ingredients (mostly people):
- Data engineer(s). These people are important because they are the ones that can get you the dataset on a technical level, prep it for ML and make sure the quality of data is on point.
- Data Scientist(s). These are the actual magicians behind the algorithms and machine learning models. They make sure the right ML approach chosen, the teaching/training process and are responsible for getting to the actual (golden) model that can be used for the purpose given.
- Developers. Although you don’t really need them for ML specifically, in most companies they build/maintain the data validation on the application your company provides and gets data from customers or users. In order to keep data control and quality up, they need to make sure the data validation is up to standard in production. If not done correctly, you’ll piss off number one.
- Your company credit card. Machine learning requires serious computational power. Depending on the computational difficulty it can really add up fast in terms of cost. Make sure you reserve the sufficient funding needed to finance ML efforts. Your CFO will definitely notice the invoices.
- Machine learning defined in your company mission/vision. Don’t do machine learning because it’s cool, have a clear understanding on how Machine Learning fits into your company’s mission and vision.
Don’t do machine learning because it’s cool, have a clear understanding on how Machine Learning fits into your company’s mission and vision.
As a product manager your responsibility within ML would be:
- Delegate the correct responsibilities/roles within your team operations.
- Educate your stakeholders in your sprint demo. ML is an advanced topic and it’s important they have a basic understanding why things happen. Giving information bit-by-bit and explaining in small segments will go a long way. Avoid going on a documentation spree, nobody will read all that star-wars kungfu science of yours.
- Negotiate and get access to the right data from within your company and third-parties.
- Make sure your dataset is legally complaint and provide re-assurance from/to data vendors.
- Define Quality Control (QC) for your data process and execute regular checks on things like inconsistency, incompleteness, accuracy, precision, missing data.
- Create and maintain a space where your ML team can experiment and learn. ML is actual science. If you want to create the optimal space for machine learning, I would recommend separating the team doing this from normal operations to give them the necessary space for experimentation and is more science driven. Executives breathing down their necks for results will definitely produce the opposite of results.
If you hold on to the above you’ll have a solid foundation for your product operations and can incorporate it with you’re own product management style.
More to come
Did you enjoy this machine learning guide for product managers? There is more to come on this topic. Stick around by subscribing to the newsletter below this article (just an occasional e-mail) and don’t forget to forward to any of your colleagues or business contacts to share the knowledge!