We’ve just launched our agentic capabilities to make your AI workflows even more powerful. Try it out for a limited time →
We’ve just launched our agentic capabilities to make your AI workflows even more powerful. Try it out for a limited time →
Q: Tell me a little bit about Shopify's work in the generative AI space. What are the things that you guys have done that you're really excited about that you feel like have gone really well?
Spencer: Shopify has been working on generative AI for a while at least when it comes to applied use cases. I remember some years ago even before chat GPT we worked on text expansion was the big use case - bullet points for an email, put it in open AI playground and then it would expand into a full email. Then you sort of do the reverse like take this email now put it into bullet points.
We did some work on that tested it with support advisors. Shopify's e-commerce has lots and lots of merchants, we have a strong support motion, really customer support, and so we're looking for ways to make them more efficient.
One of the core use cases that is live in production today is something called the help center assistant. If you have a Shopify store you just go on our help center, you can actually chat with an assistant and it analyzes all the help center articles and things like that. It's quite efficient, mostly more so than talking to a human being for many use cases.
The way that we're expanding that is with something called Sidekick. The idea here is how do we help merchants grow their business, how do we help them grow their stores through a similar interface. "Why are my orders trending down? What could I do for that?" And ultimately giving it the ability to take actions on the store and help merchants build their business.
It's all very pursuing the cutting edge and yet very mission aligned. We never deviate from "make commerce better for everyone, help merchants grow their businesses, help drive more GMV" and all these other tools merely service that. You can see it play out in what we build and ship.
Q: How does that feedback process work for you guys? Is it primarily like thumbs up thumbs down, is there more to it than that?
Spencer: We'll do a pretty rigorous rubric. Let's say we're doing email generation - we've created a rubric and we have several LLM judges. We'll do things like general impression - what's the general impression of the email. We'll do relevance, personalization, correctness, safety, and we'll have quite a few different categories ranging from hallucinations, preventing obviously dangerous behavior, to just like what's the quality of this email. We'll do a one through five scale so we can get pretty rigorous metrics on this.
Q: Let's say the LLM judge comes back says this is a four, do you still have a human somewhere reviewing that?
Spencer: We actually have them review independently in parallel. As they're reviewing we calibrate the judge accordingly.
Q: How does that calibration process work?
Spencer: We have the humans input reasons and rationale for their ratings and why. The humans themselves need calibrating because sometimes they diverge and they're too strict or they're not thinking about the problem the right way.
The core is the human annotators and they're constantly reviewing, providing context. We also do things like inner annotator agreement so we have multiple annotators doing the same subject example to see if they're agreeing. We use that and literally just tune this LLM judge - not fine-tune it, but we prompt engineer with fast iteration cycles.
Q: How do you think about building this tooling in house versus buying from AI evals companies?
Spencer: Just start with a spreadsheet. Just rip and run on a spreadsheet and go for it. There are lots of good eval companies, they make it very easy I'm sure, but you can get a long way just doing 50 to 100 samples at a time, running them through your annotators or even the ML engineer if they have time. They should be looking at each thing and dog fooding this product.
I don't think sophisticated eval software is needed for basic use cases. I'm sure you graduate at some point, but start with a spreadsheet always.
Q: Do you guys view the content here - the eval story, the human annotations - as part of the production software life cycle process?
Spencer: We do emphasize speed, and early iterations we're fine to do those offline and just work very very quickly. Infra and tooling are always in a state of getting better, so there's always gaps, but we will move fast initially and get as many iteration cycles under our belts. We do one week sprints across all of our work and then eventually graduate to a more production system as you see stability in the outputs.
Q: How do you actually go about shepherding the cats or herding the cats into figuring out what is the strategy that we can actually execute on?
Spencer: Having a vision set out is number one. Our CEO Toby sets a very strong vision for AI - it's very important to us, this is the future, this is something everyone needs to think about every day. Starting there and having that as a company-wide focus really matters.
Let's use the email example again. We want to send emails, we want to do personalized emails at scale, and we need to work with the legal team who does review of marketing templates. This is very different than what they're used to because they're used to "hey this template, their variables, it's going to go out to 10,000 people, I'll review the template." We come to them like "hey here's this email, we're generating these, what do you think about this?" "No, they're all kind of a little bit different, it's not just template."
The key there is really educating people - here's what we're doing, understanding what their key objectives are, what do they want to ensure in terms of safety, brand guidelines, etc., and then sort of co-creating whatever product it is and making sure that it meets the requirements they have in their mind.
There are some people who start out with "well it should never be wrong" and you have to explain "well it can be wrong" - there's no system that can never be wrong, which a person can never be wrong either.
Top down vision, lots of educating, and then the other piece organizationally is just execution speed. How do you move quickly, do one to two week sprints, ship things, test with users, get the right eval in place. That is critical as well because things are moving so fast you have to actually ship things. There is no world of "okay this is what we're going to do over the next 6 months."
Q: How do you approach figuring out what to build - letting everybody experiment versus focusing on specific business problems?
Spencer: The way I position this with my teams is pretty simple. For us, the barbell in AI is on one end sort of lifting all boats - lots of small use cases. There's value there but it's not super high value, it's sort of like table stakes. This is what you talked about earlier - how do we automate repetitive tasks? Can you revise my Slack message? Can you draft this email? Can you make this process a little bit easier for me? These are smaller, piecemeal yet still valuable use cases.
Then there are bigger bets - larger, talking full scale automation, multimodal, how do we make a step change in the business. The way we think about that is how do we enable people to self-serve on this lower end, and then how do we make the right investments that are durable in the bigger bets for us.
Where people get lost is in the middle - lots of engineering investment to do something that's more or less trivial that comes out of the box with GPT or an in-house implementation.
It's important to let people experiment a lot. I would recommend anybody do that because that enables people to understand what's possible. The problem with the deductive approach of "hey what are our big business priorities, how does AI fit into those" is that many people don't intuit how AI can add value to a particular business problem. They often just sort of write it off - "oh that thing, AI could never do that."
We have a joke which is if we want to figure out what to build, we just go around and ask everybody what AI can't do, and that's our things to build.
You have to enable both sides of it. Just picking out your biggest business problems and trying to throw AI at them is probably not going to work. We enable the low end and pick our bets. We do try to quantify everything, and your bar for how serious the quantification and how reliable it needs to be is going to follow your actual investment of engineering.
Q: Do you have your own internal framework beyond just asking people how much time they save for how to measure the ROI on this stuff?
Spencer: We understand all the levers of our business - how does Shopify help merchants succeed, how does Shopify help itself succeed in terms of how do we bring new merchants on, how do we cross-sell them. We understand what levers to pull, how far they can be pulled, what kind of lift we can see, and we have a framework for impact in all these areas.
Then there are things like time savings, which is not super reliable. The thing people forget with time savings is you're essentially talking about opportunity cost. That has to be filled with something else. I sometimes worry that we're saving people a bunch of time which they then go spend on YouTube shorts or scrolling, as opposed to doing something more productive with that time.
Q: How have you approached creating policies about what people can and can't do with AI?
Spencer: We have a policy in place. I think it's pretty good - I'm sure it answers 80 to 90% of people's questions. I mentioned chatting with legal teams, procurement teams and things like that - they're all very accessible. We're a fully remote company so we have a strong internal communication framework, the pathways are very clear.
It is a combination of a policy - you can put this type of information here but not this kind of information. We make that available and accessible in our docs, and then there's always room for people to ask questions.
Q: Do you have any kind of mechanism or initiatives to try and enforce those policies automatically through software?
Spencer: We do both parts of it. We do operate in a high trust environment, yet there are checks. For example, you can't just download any piece of software on your computer - it has to be approved and those approvals have to go through a rigorous process like security review and things like that.
We do tighten up there. We do have other automation in place to check, but then there is a layer of "let's trust people."
Q: How do you think AI will change expectations around work?
Spencer: I think there's one version of an AI future where jobs are automated and I roll in and now I work two hours a day instead of eight hours a day because of all these AI tools. I actually don't think that's going to happen.
A different theory I've been thinking about recently is I think the bar is just going to get much higher for everybody. I think you're going to be expected to do more. An easy example - a salesperson has a quota, I think that quota's going to go up. This concept of time savings - well, what are you doing with that time? I don't know, but I'm going to raise the bar because I know how long it takes to do your job now.
I think we'll see that sort of play out over time - yes, time savings, but how are we able to raise the bar for everybody across disciplines? Could be engineers, data, could be salespeople, anybody - how do we consistently raise that because we know all these tools are available? Much like "now we have Excel, I know you're not writing all these things by hand and spending hours a day calculating." I think productivity expectations will increase significantly.
Another way that expectations will increase - something as simple as manager wingspans, how many people can you manage. In AI world, six to eight depending on your function. Well, if you have this sort of suite of AI tools - the other day I downloaded all my feedback on various projects, captured all that, and I made just a little bot. I told my team "hey use this before you come to me for feedback because this is going to do 95% of it." Now I have more capacity, so other managers out there - you were doing 6-8 before, why can't you have 15 directs now?
Q: For you personally, in your day-to-day usage of AI and experimentation with it, what are the things that you're actually most excited about for your life?
Spencer: What has always excited me the most - I don't care as much about the sort of task level like writing and things like that. I use it all the time for that, so many messages that I'm sending out I'm often getting feedback on. That's all great, not very exciting, I see that as just table stakes.
Where I found a ton of leverage is using it as a strategic thought partner. One of the first things I played around with - I built a couple bots that would just help me reason about things. I built what I called a steelman bot, so if there was something I didn't agree with I would put it in there and then it would tell me all the reasons why it was a good idea or a good point of view.
I had a similar devil's advocate bot - tell me all the reasons why this is a bad idea. I'll put a dozen different books I like and then I'll use that for navigating strategic situations in my life or in work. Many times it has good ideas, sometimes not, but it's a really interesting way to think better and think about my own ideas. I can see the gaps - "okay well I see why it's recommending that but it's missing this piece of context that I have" - but often it'll also give me actually really good novel ideas.
If I have two options in mind of how I want to handle a situation, it might give me a third option. To me, AI and LLMs have a huge upside in strategic thought partnership, teaching you to be a better thinker and see many different sides of issues, challenges, strategic situations. I find it totally fascinating and use it all the time.
Q: How do you manage the risk of becoming intellectually lazy with these tools?
Spencer: I'm highly engaged in the process. If I sent a longer message and I put it in there, I wrote it originally, got some feedback, I go back and forth with it and I read it in detail and make sure that it's matching the tone and sending the right message.
Similarly, if I'm working on something more strategic, I never quite fully believe it. I'm always aware of what I'm talking to, who I'm talking to, that it doesn't have all the context because I know what I put in there, what's in my head, and just out in the world. I definitely approach it with a healthy degree of skepticism.
I like AIs that don't have names - I just think it's the wrong impression, which is why I like chat GPT, the name is perfect because it's just a tool, not a name. Understanding who you're talking to - I'm never very reliant on it, I never just send things that it says, and I'm always sort of checking in and asking my own questions of it.
Q: When will AGI replace your job?
Spencer: I'm working on it. I would love to see that. I think AI should replace every job - we should all have different jobs. I think about that often - how do I replace myself?
Even with feedback and performance things - we collect tons of data on that, we do 360 reviews, and we have a system that does many of these things. You can see how my job as a people manager is not going to be to summarize 360s, and I love that. I don't want that to be my job or anybody's job. I need to work on some other element where I add value.
If you're on this spectrum of what you work on, there's always going to be something that's being scooped out of the bottom, and then you have to reinvest it back where you actually add value. Over time, just constantly getting out the junk that is low yield - AI is going to constantly creep up - and then reinvesting in higher and higher orders of thinking, adding value to the company or whatever you're working on.
When will it replace my job? As soon as possible, I truly cross my fingers. Bit by bit every day.
Credal gives you everything you need to supercharge your business using generative AI, securely.