S1E03 AI and Machine Learning

Welcome to Conversionations, Effective Experiments‘ podcast discussing topics within the conversion rate optimization space.  We’re here with Tim Stewart and Chad Sanderson discussing AI and machine learning. If you want to view this and other podcasts by Effective Experiments, check out Conversionations.

Who is Tim Stewart?

Tim is a consultant who works for SiteSpect. He’s been doing marketing and sales for more than 20 years, focusing on planning, selling, and delivering commercial solutions in a B2B and B2C process capacity.

Tim has a background in media and advertising sales with an emphasis on negotiation, analytics, and testing technologies, as well as budget and multi-channel campaign planning. He runs his own consultancy, TRS Digital, based on business optimization.

Who is Chad Sanderson?

Chad’s official role is the personalization and experimentation manager at Subway.

He is a digital optimization professional and data enthusiast focused on the strategy, design, implementation, and analysis of winning experiments. Chad focuses on UX and content strategy to create better websites and mobile apps.

Vishal Maini did a fantastic piece for Medium answering all of your need-to-know questions (image above included).

Transcript of “AI and Machine Learning”

Manuel: When A/B testing first became widely available, a lot of people were talking about how easy was to set up tests.

In the last podcast, we talked about how AI and machine learning are going the same route of promising really quick wins and really easy setups.

What I want to do in this podcast is dig a bit more underneath all those promises.  We’ll try to understand the truth there and how to separate the two from each other.

AI and machine learning is something that marketers and vendors have promised us, but have they delivered? How easy is personalization? Chad: how do you get pitched personalization by other vendors?

Chad: You have to understand the people usually making the decisions about these types of systems are people who are coming from the old way of marketing – they have an opinion, make a decision, and then supposedly magic happens.  Hopefully, you’re right, but you really have no way of knowing because you haven’t measured it.

Now, we’re using this very data-driven approach. In the case of A/B testing, it’s not just a data-driven approach, but an extremely scientific approach where you have measurable outcomes to form your opinion. AI and personalization are like the next step of that.

It’s still delivering the same sort of data-driven approach. With the controlled experiment side, it gets even more complex and more complicated. The tools are very powerful, there’s no question about that. Like with anything else, though, you have to understand what the tools are actually for and what you’re doing with them.

A lot of marketers and a lot of people making the decisions really want the tools to be a smarter version of themselves, to make the decisions and then money just happens. That’s not how it works. Personalization is a process in the same way A/B testing is a process. If more people were looking at it that way, we’d be in a better place. Sadly, that’s not currently the case.


Image courtesy of Colin Anderson/Getty Images

Tim: The fact that AI and optimization are kind of mentioned in the same breath indicates a lack of understanding. They’re related disciplines, but AI is more of a buzzword right now, the way “virtual reality” (VR) was in the ‘80s and ‘90s.

Something closer to what was promised 25 years ago with VR is now turning up. There’s a big gap between what the correct application and what the appropriate application of machine learning and AI algorithms is.

There’s huge potential; part of the problem is the rocket ships we’ve been promised aren’t outside the realms of possibility. The stuff we can’t yet imagine, we can probably do with these tool sets by using computers to iteratively go through data.

They could essentially have initial states fed to it from data sources and then have the machine go through some supervised learning to come up with pattern-matching we wouldn’t necessarily spot ourselves.

This way, we get multiple data sets and pattern matching across data sets that we wouldn’t be able to cohort group ourselves or it would take a skilled analyst to do it.

So I think AI will deliver what is being promised; it will democratize some of the insight that is typically gleaned from days of me crunching numbers and user experience and variations of progression and cohort techniques to pull some nuggets out. That time-burning effect will go down.  We’ll have computers doing more of that.

Image courtesy of Future of Life Institute

There will be more stuff that’s “plug in and play,” but as it stands at the moment, in terms of making it a “productized” thing, we can’t just turn it on and remove all of the responsibility of decision-making from you.  We’re a million miles away.

Part of the reason for that is the people using it don’t have the right criteria to feed to the machines, and the other part of it is that we haven’t built machines that can adapt to those responses and have yet to be fed decent criteria.

The maturity cycle is classic and gives perspective on this. You can look back at your tests and realize how much you’ve improved from 10 years ago. People’s “pie in the sky” ideals of companies already being at that level, to just turn the machine on and leave, is unrealistic. Maybe we’ll be there in ten years.

Image courtesy of Daily News.

I think you’re right to start off comparing it to when we first started testing and we’d say, “Just add one line of code and boom, you’re done!”  That was the testing ideal that people were selling.

But the people being sold to aren’t data scientists, no matter what their job title says. They aren’t math majors. They’re coming from the marketing and sales and commercial side. To sell to those people, you have to match a solution they recognize to the product you’re selling.

We’re not shown the full picture, we’re showing the highlight reel of the set up because that suits the sales objectives. To a degree, as an audience, we fall for that stuff so they do it more often. It’s proven time and again – they get you hooked in.

You’re not using the potential of the system, they’re not making the ROI you claimed, but the vendor is making money.

Manuel: What is the reality in this situation? What is the ideal company that can benefit from this? What do they need to do on their end in order to get value out of tools like this?

Are they at the right stage? Should they wait and get the basics down?

Chad: It’s something that we say all the time when it comes to A/B testing. If you’re starting an A/B testing program, then you’re always going to need someone who knows A/B testing very well; specifically, not a marketer who’s been saddled with the tool or an analyst from another discipline who’s now doing something that he’s never done before.

Of course, that person won’t be able to achieve the same results as a fleet of data scientists or statisticians. It’s just not possible.

I think Tim’s point was exactly correct about a lot of the case studies that you see from large companies. A lot of the success with personalization is from the same companies that are also investing massively in the infrastructure of personalization.

They’re not just bringing in a tool and saying “go at it.” They’re bringing in people. They’re training people.

Oftentimes, the people that they’re bringing in with these programs are consultants at the tool they use. So it’s people who massively understand every single facet of the tool itself and can completely exploit it.

Image courtesy of Cleveroad

They also have access to all the other aspects of a gigantic business, all the creative resources, all the development resources. That enables them to do a lot more.

Success at testing is about scale. The largest companies that have massive amounts of success at testing can afford to operate at that huge scale. The more that you do, the more success you’re going to have, and the larger successes you’re going to have. It’s a numbers game.

Consultants who are skilled at this discipline try to offset that a little bit and say that “Okay, we have the skill, and are trying to maximize the amount of wins that we can get from smaller resources.” But big companies don’t have to worry about that.

Sure, they try to maximize wins, but they’re also trying to maximize scale. Sometimes, when you see personalization efforts like “personalizing ‘XYZ’ resulted in a 45% lift,” you’re only looking at things that won. What you don’t see is that it took 5,000 losses to get to that point.

With a smaller company, they can’t always afford to have 5,000 losses before they have a major win. Google can – they can afford to win because they’re big.

I think the first question that medium-to-small businesses need to think about is “What would it take to effectively run a program, not just have a program?” That’s going to take a lot of research. It’s not something that’s going to come overnight.

I think the best way to do it is to bring in a consultant or at least talk to somebody with some sort of outside knowledge that can tell you what to do.

Another major issue is that the only contact these businesses have is with the vendor.

For example, let’s say the vendor approaches them saying “With this year’s amazing tool, you can supercharge your ROI,” or something, and the small company thinks that’s great, so they do a little bit of Googling “who do I need to hire for this type of role?” They bring on one person, and that’s as far as they get.

It’s a lot deeper than that, though. It’s a lot more complicated and you need to operate beyond just the person who’s the salesman trying to get their foot in the door so you can spend money.

Manuel: So to summarize what you just said, Chad, the first thing is it’s not just about the tool, it’s the mindset of the people or the resources they need within the company to find success with the tool like that.

Most companies, when they pitch the tool, will paint it as a silver ROI bullet, but actually, there’s a lot of groundwork that they need to do first.

Image courtesy of Business Analyst Times

And the second point that you mentioned was the fact that companies should be talking to other people, not just the vendor or the salesperson because ultimately they need to understand the real value of the tool. Is that correct?

Chad: Yeah. People have it backward. They think that the value is in the tool. It’s not in the tool, it’s in the process. In these recent “A/B testing is dead” articles that have been coming out, they argue “A/B testing is really hard, there’s a lot of math involved, it takes a long time, it’s not as easy as a lot of people think it is.”

They’re one-hundred percent correct. But what they don’t seem to grasp is that A/B testing is not specific to marketing; it’s not something that we are exclusively using.

It’s not actually called A/B testing, it’s called “null hypothesis testing” – A/B testing is just a more digestible term. This type of testing is used in clinical research, it’s used in engineering, it’s used in developing artificial intelligence, it’s used everywhere.

As far as we’ve come in human history, there hasn’t been a better method of error reduction that we found than null hypothesis testing. So saying that it’s dead is one of the most ridiculous things ever because you’re saying that the best process we have of understanding how likely we are to be wrong is pointless.

It’s the same with personalization – all these things are the same. It’s not the tool, it’s the underlying methodology to figure out how to best tackle a problem.

Tim: We work on the assumption of normal distribution. We work on the assumption of independent actions, whereas with AI and machine learning will then look at changing the pattern from the back of that.

In web interactions, they’re not necessarily independent actions. If you’re looking at a user’s flow through a journey or repeated visits through a journey, then you may find that you’ve got a cumulative effect.

From there, you might go to ABC sequencing, in terms of which are the things should be happening in which order to get the greatest outcomes, and this is not an independent variable.

It depends on the prior choices as to which data from the next set they should be given.

Ultimately, a company needs maturity. The companies that are jumping to automate personalization are the ones who skipped past doing A/B testing because it’s hard.

The prerequisite for getting the most out of null hypothesis testing or a testing program is being of maturity and scale where you can run a program suitable to company size.

When you get to the point when your ability to scale the team, your ability to run tests and the way you prioritize testing and choose value, is starting to become limited by the human factor in that, then you’ve got an opportunity cost.

If there are 10x the number of tests you did run versus what you could have run where you’ve got tests that if you were to run them exempt of humans as a full null hypothesis test with a whole lot through to completion, would be cost prohibitive.

There would not be enough value in those other than the ability to use the scale you’ve got in terms of traffic – divide that up into giving it as data to a machine that can spot opportunities and look to do sets of many evolving tests to basically use up that spare capacity.

Ton Wesseling from Online Dialogue says you shouldn’t even start testing until you’ve got a thousand a month, then 10,000 a month. It’s a stepped explanation.

When you get to a point where your cadence of testing is limited by how fast you can hire and how much traffic you can go through, the gap between what you can currently test at versus what’s left to test in your audience that could be tested – that’s the ideal point to start plugging in your computers to do that.

What sort of company needs to be using it, what would be a good signifier? The need to have scale, both of people and of traffic. They need to have an established process for running a good cadence for testing.

Small and medium companies, or testing setups with large enough sites, could be running two or three hundred if they have their stuff lined up.

Some of these A/B tools will limit you to 2,3, or 4 tests concurrently if you have the traffic for it, and then the tool taps out. If you’ve never run more than that, then that’s what your traffic can handle.

Turning around and saying ‘I’m going to subdivide this,” then plugging in all the stuff from the CRM and saying, “Turn it all off, I’m going to let the machines decide,” effectively forces you out of responsibility.

The companies that do well with it are the companies that have done well with A/B testing, and before that with user testing and usability. They are the companies who took testing seriously, took their database, their CRM, and direct mail seriously before online came along.

When given a new tool that fits with how we do stuff, these companies fill up that spare capacity, reducing the opportunity cost of tests we could’ve run that we didn’t. It’s a magnifier that catalyzed the effect that they’ve already got.

The shame of it is the sales pitch – it’s mostly the people that go “A/B testing is hard, let the computers do it.”

The people that buy that are buying magic beans. Every other prerequisite needed – in terms of scale, statistical ability, understanding how to interpret results, understanding how to take action on those results – isn’t there. They wouldn’t have built that up in the process of building a highly-skilled A/B team.

I’m not saying you couldn’t – you could come in cold, never having done any testing. You do it over a period of realistic timelines, like 2-3 years.

You could take time to build up a team that does a mixture of A/B testing and personalization tools to go from nothing to a full optimization team using all the toys in the toolbox within 2 years.

The amount of money you’d have to spend over the previous 2 years wouldn’t provide the ROI, because you’d have to go from 0 to 60 mph at a great cost, with no way of knowing what the ROI would be. You’re going in blind.

A company that spent the last 10 years A/B testing can quantify the ROI.

They’ve got at least some prerequisite data to understand how much more money and time they need to invest, when they realistically could see a return from that, and what their return would be. In those cases, it would fill the gap between what they’re already doing. It may free up time to let the machine do it.

However, you need to feed the machine some knowledge. You need to feed it some guided learning and see how well it does against that. Feed it some patterns that you established previously and see how well it does against those. Lastly, feed it the full dataset and see what it comes up with that you didn’t.

You’ve got to have all those prior pieces to feed the machine otherwise it’s just “turn on the machine and let it collect data.”  The learning period with that would be so long as to be useless.

Even if it weren’t so long, the value you’d get out of it will be not human-interpretable, because it would just go “There are lots of things that might correlate.”

You couldn’t take action off of them, and without action, data collection is worthless. It’s just doing numbers for number’s sake. If you’re taking action on it, if you’re realizing better results from it, then it has a value.

At some point, you need to change the way you do business, change the way you present your website, change the way you get your audience. If you’re not changing based on the results you get, you’re going to get the same results over and over again.

Manuel: My concern as well is with humans shirking responsibility and putting it on machines to get those results. How do you really know whether those results are valid? Are we just looking at a black box?

Wikipedia offers a good explanation of the “black box” effect, illustrated by their image above.

Chad: I think that’s one of the main issues right now in the personalization community. So just one thing to sort of quickly go over – Tim already sort of nailed this point earlier – but all AI and machine learning is doing, depending on the model that it’s using, is looking at a set of data and asking the question “What correlations can we find so that we try to understand a particular person better?”

For example, someone who visits the Subway site during lunchtime on a desktop computer is a lot more likely to purchase than someone who visits at night time. Knowing this, maybe we’d show that person version A of the website versus version B. It’s just looking at that kind of data.

Sometimes it gets deeper. Sometimes you have classifications and decision trees, but at its essence, it’s really just correlation. Whenever you’re creating a correlation, you still have to train your machine to know what to look for.

If you train it incorrectly, or if you train it on data that isn’t representative of your total population, then you could get results that are possibly even worse than if you had done nothing at all.

Decision trees for machine learning. Image courtesy of Amr Barakat.

Take Google. It had a classification system using image recognition where the data was trained on a particular image set and the point was to figure out whether or not there was a way to properly identify a particular image.

Based on the image set it got, it identified anything involving a kitchen as a woman, even if there wasn’t even a woman in the image at all. Anything involved with baseball was a man, even if there wasn’t a man in the image.

That’s how AI and machines make decisions. A machine can make both good and bad decisions based off of data. It’s always based on a degree of logic that humans create.

When talking about a black box, in many cases, that’s true. There are some systems that are very transparent – Matt Gershaw of Conductix is one example – but many AI systems that are out right now, you have no idea what the logic is.

You don’t know what sort of decisions that machine is making based on the criteria that it’s seeing. It could be generating something worse off for you. Maybe not now, but potentially in the long run.

Another really interesting example is a credit loan company from the UK that was using a personalization system that would change the amount of money you’d be eligible to receive based off of data it received from you.

All of the variables that it can possibly analyze – it looks at all of them. It looked at things like your name, or your gender, or your location.

Image courtesy of Pixabay via WittySparks.com

In this example, the machine based its recommendations on the names of some customers that were also affiliated with really low scores, and asked that they pay more. Obviously, that caused massive outrage.

The question is, even if we weren’t talking about outrage or issues with media and backlash in that way, is that a better system? Is it a better system to give somebody a better price based on all these other factors? Does it lead to more ROI in the long run? I think that’s a question.

You can create this amazing and fantastic algorithm and for usability purposes performs worse than a traditional website.

That’s something a lot of people don’t think about in a lot of cases and you can’t think of if you’re using a black box system.

Manuel:  How do case studies feature in all of this? Case studies always portray the tool in a good light, but ultimately, if you scratch beneath the surface, the tool isn’t really the successful component of that case study.  

It’s the people behind it, the thought process behind it.  What are your thoughts on case studies that you read when vendors talk about their tool?

Chad: You’ve got to take case studies with a grain of salt. Pretty much all of them. There’s an official term for it, “the graveyard of unspoken ideas.” Another way you can look at it is “survivorship bias.”

Of course, the things that you see in the end are the good ones, because the bad ones are never going to make it into the case study. No one’s ever going to include a case study of all the times that they applied a program and it made things worse, or when they spent a lot of money and they got no ROI. No one’s ever going to mention that. So you have to look at a case study understanding that you’re only seeing a very small percentage of the overall effects.

Just like in the lottery. Every lottery commercial shows pictures of all the people that won the lottery. Well, yeah, of course, but tens of thousands of people didn’t win the lottery, too. When you look at it holistically, are you going to be one of the few that won the lottery, or one of the thousands or millions that lost?

That’s how you need to look at case studies. Don’t base your decisions off of that.

Base it off of your answer to the questions, “Is this tool something that I’m capable of using?” Really, it’s not about the tool, it’s about the process you have in place to execute the program. After you have at least the strategy and the skeleton of what the program is going to be fundamentally from the ground up, then say “What is the tool that best fits our needs?” Don’t try to build a program around a tool.

Tim: There’s an awful lot of box-ticking at the moment. Lots of companies say “We should have it, it’s hot right now!” And when you say “What are you going to do with it?” they reply “…We’ve got one?”

Manuel: I see this quite often – talking about your point, Chad. When people ask for a tool recommendation, loads of people will recommend one tool after the other, but no one actually stops to ask “What’s the criteria that you’re looking for? What are you looking to solve?”

Look at any thread on Facebook or LinkedIn, it’s just that, a lot of vendors saying “You should try us because we’re amazing!” None of the responders asked the person why they need that tool in the first place.  Especially with server-side testing and tools.

Tim:  No one asks “What could we do differently?”

Well, have you got to the limit of what you need to test or what you can test with your client-side tool? If you use your client-side tool to the maximum of your ability, you’ll get about 80% of what you need to do, done.

But if that remaining 20% is where the value lies – testing the algorithms, testing different recommendation engines, testing against user basis to increase lifetime value, pulling in data sources, adjusting who they are, using machine learning to group traffic and where it’s coming from – that’s where your server-side switches can come into play.

There are some speed advantages and technical pieces, but for the most part, if you’re good at coding JavaScript and you know what you’re doing it for, you can fake up an awful lot of that stuff.  The value comes from the server side.
It’s a pain to get implemented because you basically have to embed it into the way your system works. There’s a fair bit of configuration.

Once you’ve done it, though, you can test that sort of stuff with every single test you do. And you can report on a person’s lifetime value as a correlated metric, etc.

Whereas if you wanted to do something like that with a tag tool, you’d have to build that sort of complicated kind of multi-system interaction into your JavaScript, which becomes prohibitive in the size, or so complex it might as well be.

A lot of people start off using the “Amazon’s recommended products” example. When A/B testing was coming out, they decided they were going to do a recommendation engine it would help them sell merchandise.

Image courtesy of WooCommerce

Even though it was 10 or 12 years ago, it was fairly simple. The first 10 products would tend to get more secondary ads than the second 10, so they showed more ads from that group, and it was a slow but simple learning curve. Personalization dictates what you should be showing based on prior behavior.

Server-side testing became hot in the last couple years finally in the last couple of years because some companies have reached a maturity level where they do want to test real business numbers and real business data sets. These require a different testing method.

We are starting to see some use cases where AI is used properly for machine learning tools and things that could kind of develop different approaches from that.

We are starting to see customers of a certain maturity who are making good advantages on top of an existing optimization program. The mass market is suddenly going “Me, too!” There’s bandwagon-jumping going on all over the place.

The problem is your sales life cycle. The early adopters, having learned from their mistakes, they aren’t sharing those mistakes with other people.

The early adopters are held up as examples for testing, big companies, like Amazon and eBay.  But people are only seeing what happened as a result of actions on thousands of tests over 10 years.

If you have an A/B tool people think they’ll be as big as Amazon immediately. That’s not necessarily the case; they’ve been doing it for 10 years longer.

That’s kind of the good side is that there’s a demand now not because they’re selling magic beans, but because there are genuine examples of people out there doing it.

There are companies that have got most of the ingredients we’ve discussed today that mean that they can make use of it and they have become those case studies.

Survivorship bias ignores testing for failure.

Image courtesy of StarTribune

Tim: There’s the classic example of planes back in World War II.  They kept losing too many bombers, so mechanics looked at where the holes were on the bombers that came back and reinforced those places.

The problem? If you want to reinforce the wings, that’s great, but you want to see the planes that didn’t make it home, not the ones that did.  They were obviously hit in different places, like the cockpit or the body, rather than the wings.

None of the ones that didn’t make it back were there to sample, though. That’s survivor bias. The takeaway here would be to reinforce the cockpit and body because that correlates to fatal crashes, rather than survival.

Testing to find things that you don’t see is just as important, and machines won’t spot the inverse all the time.

A failure can indicate where you need to focus rather than doubling down on a success, and this is the kind of data you can feed into a machine that the machine itself might not catch. We can observe a whole, whereas machine learning can’t, because it has no data points work from.

Manuel: We almost need a Maslow’s hierarchy For optimization where you start out with A/B testing, what you need to have in place before that then you move on to AP testing and personalization and certified testing. 

Tim: Bryan Eisenberg has one of those. It doesn’t include AI, but includes functional steps to sort your basics first.


Image courtesy of Bryan Eisenberg – the hierarchy of optimization.

Most of the time, I go around sorting people’s businesses because that’s what I do. I’m not running a test. I fix the broken stuff – the things that aren’t working. Because with those things broken, else nothing works.

There’s no point to A/B testing if the basics are broken. You need to have that maturity model in a decent place to implement AI and machine learning. Yes, you can get good results, but you have to have your foundation set. You need to build a process that understands what optimization is, and then build a team around that.

That way, you can correctly do vendor selection.


Image courtesy of Kathy Sierra

When you talk to the people who built the AI tools, they talk about the learning period. It’s a bullet point on the sales deck. The guys who built this stuff, that’s what they’re doing – they’re A/B testing their own algorithms!

Before they let the machine go off and A/B test itself, they’re going to give it a premise.  They plan out the expected outcome that they know from previous experience what it should look like, then see how long it takes to get to that point – or if it gets that point at all. Will it come to an acceptable outcome?

If it comes back and it learns something wholly different from your personal knowledge and experience, you have to question whether the algorithm is correct or not.

Never just accept the argument just because the machine says yes to it. If the machine did find something, why? Would the money in the bank be more if you hadn’t touched anything?

Just keeping the boss man quiet is not the end goal. The end goal is improving the company.

Chad: I think one of the biggest issues is that the people who are listening to this podcast are probably not the people who most need to hear the message. The people that really need to hear the message are the people and the stakeholders who are making decisions.

That’s why one of the things I like to do anytime I go into a new company, is starting A/B testing or starting Evo and CRO and anyway I try to drive home the message that what we are doing is a scientific process, and the point is better decision-making.

Not just incrementally better decision-making. Fundamentally better decision-making than any method that any marketer has used up until this point.

It’s just a fact. It may not always come to the right conclusion, but as far as decision-making processes, it’s much, much better.

One of the things I was mentioning at the very beginning of this was that the marketers and the directors who are making these decisions are making these decisions based on their own experience.

I’ve seen this many times. People come in and say “Well, we don’t need to test, we just need to implement.” Why do they say that? Because their whole life has been implementing. That’s what they’ve done for the last 20 to 40 years.

They say “This is what I want, I’m paid to give you my opinion. My opinion I believe is right, that’s how I make my money.” We are proposing an inherently different system, and I think that if more stakeholders understood the value of the system, then a lot of these things would become common sense to them.

“Of course you don’t just invest in a tool right off the bat. You don’t start a personalization program having a team build up. Obviously.”

Now you’re thinking in terms of validating your results, not just doing things and measuring them afterward.

What I hope is for the people who are listening to this and going back to their bosses who need to answer questions about personalization or machine learning, the first thing you should try to get across is the importance of making decisions based on data and some scientific process.

Not just in the digital sphere, but anytime you’re making a decision. If you’re just saying something and that something has an outcome and you don’t know whether it’s going to work or not, you need to re-examine your decision-making process and ask yourself “How I can structure my decisions so that there’s some evidence to base them on?”

Manuel: That sounds good! I think we ran out of time. We shall just continue discussions on our next podcast next week, but it was a pleasure having you on, Chad and Tim!

Manuel da Costa

A passionate evangelist of all things experimentation, Manuel da Costa founded Effective Experiments to help organizations to make experimentation a core part of every business. On the blog, he talks about experimentation as a driver of innovation, experimentation program management, change management and building better practices in A/B testing.