Macadamian Blog

Creating a New Alexa Skill

Martin Larochelle

Over the last several weeks, we went through the entire process of designing and developing a new Alexa Skill for Amazon’s Echo and then submitting it for approval by Amazon. The skill in question? One we would likely use many times a day: sending text messages.

Creating an Alexa Skill for Amazon Echo

Challenges and Lessons Learned from Sending a Text Message 

Over the last several weeks, we went through the entire process of developing and designing a new Alexa Skill for Amazon’s Echo and then submitting it for approval by Amazon. It was a great experience, not only allowing us to have some fun, but to also learn a few new things about Alexa. The skill in question? One we would likely use many times a day: sending text messages.

The invocation sentence dilemma

While discussing the concept for this skill with our Interaction Designers, we continued to come up against the required sentence structures. At first, you would expect a natural language interface to feel intuitive to users, but there are some barriers forcing users to learn how to interact with Alexa.

  • Remember the “Alexa” name? This seems trivial, but it’s already an impediment for some user personas (e.g. the elderly, or those where English is not their first language).
  • Remember to use the invocation name of the skill. As users of Alexa ourselves, we often forgot the names of the skills we installed. The lack of a visual home screen makes opening a skill harder than, for example, launching a mobile app.
  • The ask/tell sentence structure further complicates things. With Siri, you can say: “Siri, send a message to Sarah, please pick up milk on the way home.” However, with Echo it needs to be something like: “Alexa, ask My Friend, to tell Sarah, please pickup milk on the way home.”

To keep the sentence structure simple, we decided to support a single destination phone number. With an example skill name of “My Friend,” this allows a sentence like: “Alexa, tell My Friend to please pick-up milk on the way home.” We are trying this approach so that the interaction will be the easiest in most scenarios.

Understanding free form text

When we first looked at Echo and the Alexa Skill Kit, we expected it to do speech-to-text recognition and then text-to-intent mapping. As it turns out, Alexa does it all in one step. While this makes the recognition more reliable for deterministic cases (when you have a fixed number of keywords you expect the user to say), it really makes it more complicated for the skill designer when free-form interaction is needed. As in this text message use case, or for a skill that would add items to a task list, allowing the user to send free-form text to a skill requires the developer to list sample sentences with an even distribution of likely word counts. Amazon recommends that you provide several hundred samples to address all the variations. Creating theses samples is very tedious for the general cases and is not skill specific. It would be nice if the Skill Kit would provide a general purpose slot type for these cases. This will be even more complex when Alexa skills need to support more than one language.

Repeating the phone number

The first time we had Alexa repeat a phone number, it went on to say: “Six million, one hundred and….” This provided a good example of how a sentence should be structured and formatted differently depending on whether it is spoken or displayed. In the case of a phone number, we are displaying (613) 555-1234 on the card of the companion app, while we send “6, 1, 3, 5, 5, 5, 1, 2, 3, 4” to Alexa as the number spoken.

To give our skill a bit more personality and to make it sound more natural, we changed the spoken phone numbers from “6, 1, 3, 5, 5, 5, 1, 2, 3, 4” to “6 1 3, 5 5 5, 1 2 3 4.” This provides pauses between number groups which sounds less like a text-to-speech robot.

In the next iteration, we intend to experiment with removing random spaces between numbers to make it say 3 4 as thirty-four, for example. These types of changes will make our skill better reflect common language patterns.

Twilio integration

To send the SMS messages, we chose Twilio. The Twilio API makes it easy and relatively cheap to send text messages. To get started, you need a Twilio number that is $1/month, then it is only $0.0075 per message sent. As long as the skill does not go viral, that will be a manageable cost. Using the API is straightforward, simply specify the Twilio “from number”, set a “to number” and provide the message, and you are done.

No voice notifications from an Amazon Echo

An Alexa Skill limitation we kept running into is that we cannot send notifications to the users. It would be great if a skill could just notify the user that an event occurred and allow Alexa to speak to the user. As a user, I’d like to be able to ask Alexa to remind me to do something in 20 minutes and have a notification more descriptive than an alert sound. As a developer, I’d like to be able to build a specialized skill for a specific type of reminder, or notify the user that something happened as we do with push notifications to a mobile phone.

After living with the Echo a little longer, I’m starting to think that providing push notifications will not be so simple. I’d be interested in seeing user research results of users experiencing voice push notifications. To share a specific experience, last weekend my daughter and I were having a conversation and we experienced an Alexa false trigger. Out of the blue, Alexa began speaking and totally startled both of us. If I had been alone in a silent room, I am sure that it would have been more startling had Alexa delivered a push notification. I think there is something unnatural with having something suddenly begin speaking to you. It’s like someone sneaking up behind you when you think you are alone.

I’ve been thinking that a smart voice timer would be very useful, and have considered including a pre-voice announcement tone, but would that subtle notification be enough? Perhaps, if there were a two to three second tone that would start at low volume and then gently increase, this could eliminate the surprise effect.

In some instances it would also be useful to ask Alexa to repeat the push notification. However, we would need to consider and explore some of these scenarios and interactions:

  • How long to play the notification? Should Alexa only say it once? What if the user didn’t hear it entirely? What if there were other noises in the room? What if the user were too far away to hear the reminder and the timer was important…. at one point it is impossible to not notice the continuous timer tone. Perhaps after the intro tone and the voice notification, depending on the priority/importance level of the alert, it would be followed by the alert sound?
  • How loudly should Alexa speak? If it were a person, it would automatically know at what level to speak so that he/she could be heard without sounding rude or startling anyone. It gets more complicated with Alexa. There is no way for the device to judge the users distance and ability to hear. Looking at the Alexa companion mobile app, there is a volume control for the alert sound. I’m curious whether this control would be enough for a voice notification.

I’m sure Amazon is working diligently on enhancing the Echo and the Alexa experience, and I’m hoping that an updated Alexa Skill Kit will be available soon with extended flexibility for third-party developers. I’m also sure that the day will come when ambient computing can provide a bidirectional interaction. The question is, how will Amazon approach this evolution?

Getting Alexa to chat with Azure

Our new Alexa Skill uses an Azure Web app as the Web service. The connection between the two works well, but we did have one issue that took some time.

When testing from the Developer Console, we received this generic error:

“The remote endpoint could not be called, or the response it returned was invalid.”
To get more details on this error, we spoke the command to an Echo and then went to the Alexa companion app. The app card should provide a more useful error and in our case, the error was about the SSL certificate:

“The certificate of the endpoint Resource [https://<appName>.azurewebsites.net/alexaWS], Type [HTTP] uses a wildcard domain name in its cname or subject alts: *.azurewebsites.net”

Azure Web Apps use a wildcard in the domain until you set it to a paid subscription and get a custom domain, but there is an easy way to make Alexa talk to your test web service.

From the developer console, in the SSL Certificate step, select:

My development endpoint is a subdomain of a domain that has a wildcard certificate from a certificate authority

Then, it works like a charm. Amazon even provided NodeJS samples that were quickly integrated to our loopback.io server.

Publishing our SMS Skill

Once we were done polishing the Alexa SMS Skill, we pushed the submit button. We also learned a lot during this submission preparation and I suspect that we have more to learn before we get final approval.

Validating the signature of Alexa requests

Our initial plan was to publish our skill, implemented in Node.js, in an Azure Web App. One Alexa requirement that made this a challenge is validating the signature of the request. Since our skill is not published as an AWS Lambda, our Web Service needed to validate that the request actually comes from Amazon. This library provides just that, including validating the certificate.

It’s the certificate validation that has been a challenge. To do a validation, alexa-verifier first uses openssl-cert-tools, which then spawns an openssl process. Since we can’t start a process in an Azure Web App, that approach didn’t work.

Next, we tried to use node-x509 which worked in our dev environment. Unfortunately, it uses node-gyp which comes with its own challenges on Windows when trying to deploy it on Azure Web apps ref. So, we gave up on that option as well.

As we work towards a continuous delivery cycle, these types of development environment vs. production issues have finally convinced us to start looking at Docker. Our skill is now published on Azure through docker-cloud. The Azure Container Service is also an alternative that we are looking at for future deployments.

Completing the skill

To add a bit of additional polish to the skill, we did a few tweaks:

  • If the user initiates the skill with “Alexa Open, SkillName,” we keep the conversation open by asking the user what they want to text. This occurs during general use or after a secondary action such as setting the phone number. Once the main action of the skill is completed, sending a text in this case, we close it.
  • To comply with the interaction guidelines, we handle the Cancel and Stop commands in both cases by answering “Goodbye” and closing the conversation.
  • After looking at other skills, we added an answer for the “Who built you?” question which is like a spoken replacement for the about page of a mobile app.

Saying “Macadamian” correctly

When we first implemented the “Who built you?” response, Alexa wasn’t saying Macadamian as we would have liked. This is fair, considering it is a brand name and not in the English dictionary. This became a great opportunity to try out the Speech Synthesis Markup Language (SSML) functionality of Alexa.

The simple approach to having Alexa say something is to send plain text and let Alexa do its standard text-to-speech conversion. When more control is needed, SSML is available. To make this work, an understanding is needed of how phonemes work. After a few trial and errors attempts, we settled with mækəˈdeɪmɪən and are sending our response as:

{type: ‘SSML’, speech: ‘<speak>I was built by the m 21 lab of <phoneme alphabet=\”ipa\” ph=\”mækəˈdeɪmɪən\”>madadamian</phoneme></speak>’}

This allowed Alexa to say Macadamian just the way we wanted (or, at least close enough).

One issue we saw with this is that the Service Simulator of the developer console does not support unicode and displays this as mk??de?m??n, without pronouncing any of the vowels. In any case, the extent of the testing we can do with this console is limited. So, we did most of the testing with unit tests locally, then using an Echo once deployed.

Final testing of the app

Before the submission, our QA lead did a round of testing on the skill. The checklist provided by Amazon has proven to be really useful.

While doing our final testing against the guidelines from Amazon, a few things came up:

  • An app needs to handle being invoked with partial intents. We handled this for changing the phone number, but not for text messages. While we didn’t expect many users to say: “Alexa, ask SkillName to send a message,” then “The message is, I’m home,” we handled it to better cover any option.
  • The testing guidelines ask to handle: Leave, Bye, and Goodbye, as Stop and Cancel, but these words don’t trigger the built-in AMAZON.CancelIntent and AMAZON.StopIntent intents as the other words do. So, we followed the approach to make the Help interaction more natural and added our own intent that extends how the user can trigger this interaction.
  • The open prompt needs to include the skill’s name. In one of our design meetings, we had decided to make the prompt as short as possible and even considered adding a quick mode where it would be just a tone. But, after reviewing the guidelines, we agree that some added context can’t hurt, so we added a “Welcome to SkillName” intro.
  • Handling invalid data is tricky: When an intent is triggered by invalid data, handling this cleverly is a challenge. Often, when this occurs the wrong intent is triggered by Alexa. Since the skill does not receive what Alexa heard even though it displays the text in the Alexa companion app, it is hard to do a “smart handling.” If we knew an intent such as “Who built you?” was triggered by something totally unexpected like “abc,” then the skill would have the opportunity to use that context plus any previous partial intents that the user triggered to provide a response that is likely more accurate. For instance, instead of incorrectly saying our “About Us” message, we could ask the user something like: “Sorry, I didn’t understand what you meant. Could you phrase that another way?”

Selecting the name of the skill

Selecting the same of the skill triggered some long discussions. One name we played with was “My Friend.” Since we focus on a single destination, this seemed like a good choice, but it does not cover all usage scenarios. For example, if someone wants to use the skill to check in with a caregiver.

We decided to submit the skill with the name of “Dash.” One reason we picked it is that it is easy to say in a fluid sentence and hints to the limitation of the one-way communication. One concern we had with it is that the guidelines ask for the invocation name to be at least two syllables. We still went with it knowing that if it was declined, we had a replacement for it that we didn’t like as much.

Sure enough, our Alexa Skill was not approved on the first iteration. Our choice of name was mentioned in the comments, although not for the expected reason. As it turns out, Amazon named its order button: Amazon Dash. So, we will re-submit with our alternate name: Scribe.

Next steps

Now that we have submitted our first skill, our exploration into Alexa is far from over. We have been ordering more Echo devices to get more designers and developers to experience it, and we are getting some shipped to our offices in Romania and Armenia to get the staff there on board too.

Once we get feedback from the certification of this skill, our priority will be to adapt it to get it through the review process. While we wait for their response, we are working on other ideas that will enable Alexa to control some IoT devices.

But wait, there’s more: What about having Alexa make phone calls?

As an extension of our Alexa Skill that sends SMS messages, we used Twilio to build a concept that allows Alexa to make phone calls. Well, to be clear, Alexa itself cannot initiate the phone call, but Alexa can make our phone make a call to one of our contacts using click-to-call flow.

The trigger

To trigger the flow, the user says something like:

Alexa, open My Skill

My number is 613–555–1111

The number of my friend is 613–555–2222

Call my friend

Establishing the call

Once the user says “call my friend”, Twilio places a call to the user’s phone. When the user answers, he/she hears a voice message “I’m connecting the call for you,” followed by a ring- back tone. At this point the phone of his/her friend rings and they are in a two-way call as soon as the friend answers.

Twilio API

The Twilio API that make this happen is a bit more complex than the simple case of sending an SMS message. To control the call, our Alexa Skill needs to expose a webhook that Twilio calls in order to know what to do with the call. As a response to the request, a Twilio Markup Language (TwiML) XML response is sent that specifies the voice message to be spoken, along with the request to dial the phone number of the contact.

As we need our skill to allow connecting the call to any number, our skill needs to remember where to connect the specific call. There is no session established during this flow, but Twilio provides the ID of the call we established. So, we use Redis to store the context of the request. In this case, we kept it simple and pre-build the response and store it in Redis. So, our Twilio webhook just needs to use the call ID as a key, get the response, and send it back to Twilio to make the click-to-call flow work.

How could this be useful?

While we built this skill just as a fun concept, a skill like this could be useful, for example, for the elderly that still use a home phone, but often forget the number of their primary contacts. Alexa would be a convenient voice-activated quick dial for their home phone.

Before making this publicly available, the voice call pricing of Twilio is something to consider. If users start using this functionality for long phone calls, with a pricing of about $1 per hour, a skill like this could quickly become expensive to operate. So, I’m not sure we will ever end up publishing this skill unless we have a way to monetize the usage to cover the costs.


six oversights IoT product

Six Oversights to Consider Before Building an IoT Product

In this white paper, we outline six oversights that organizations face when entering the realm of IoT.

Download Now


Tags:

Author Overview

Martin Larochelle

Martin Larochelle has been with Macadamian since 2005. In his ten years with the company, he has tackled projects both big and small as Chief Architect. An expert in C++ and VOIP, his focus has been on mobile platforms. Martin was instrumental for all things BlackBerry providing technical leadership and project oversight. Martin now leads the Macadamian Innovation Lab, a team focused on developing concepts to solve the needs of small and medium businesses and key verticals such as healthcare. While we're all a little nuts at Macadamian, Martin counts himself as the biggest HeadBlade fan in Canada.
  • BajaBarry

    FYI – there’s a free (up to a limit) phone calling skill from Ooma. It is setup with your number. Then when you ask it to dial a number, it asks if you want to assign a name, so you can use the name afterward.

  • Daniel Roy

    Good article! I am new with Alexa (echo dot generation 2), and am trying to create Alexa (custom) skills integrated with Azure, and am glad to hear that it’s possible! Good work making that setup work!