Bulk Reviewing Chatlogs in Excel

Follow these steps to learn how to review questions asked to the virtual assistant that the it is unable to answer.

What to do with the conversation data that comes in regularly for training

First, log in to the dashboard and go to the home page.

Chatlog review - dashboard

At the top right corner, under “Daily Training Data”, select the date range and click on “Download” to download the file.

chatlog review - date download

The downloaded file looks like this:

The main columns to note are:

  • Prediction: The intent name of the intent predicted by the model (empty means the input was given a fallback response)
  • Question: Questions that the users asked
  • appSource: What is the entry point of the user, i.e. which institution’s website did the user ask this question
  • Confidence_level: The confidence level (between 0 to 1) of the model for the intent prediction (empty means the input was given a fallback response)
  • Custom_id: Timestamp when the user’s question was received

Here are the steps

  • Filter the “AppSource” column by the department, institution, website or app chat entry point you are managing
  • Add a column named “correct_intent”. This column is for you to fill in the correct intent for those wrongly answered questions, or questions that you would like to add to the training dataset
  • At the “prediction” column, filter for the blank predictions. These questions are questions that were given a fallback response
Chatlog review - 5
  • If these questions are within the scope and belong to existing intent or are commonly asked, fill in the correct intent under the “correct_intent” column.
  • If these questions are within scope but are not commonly asked, label as “KIV” under the “correct_intent” column and ignore for the time being.
  • If these questions are out of scope but are commonly asked, fill in the new intent name under the “correct_intent” column.
  • If these questions are out of scope and not commonly asked, label as “KIV” under the “correct_intent” column and ignore for the time being. *Or you may follow the decision tree after this section
Chatlog review - 4

When you are done with step 4, now is the time to check on those that are given an answer by the model.

  • Unfilter the “prediction” column and filter out the blank prediction
Chatlog review - 5
  • Sort the “confidence_level” column in ascending order (i.e. Sort Smallest to Largest)
Chatlog review - 6
  • Look through if the predictions are wrong: If wrong, type in the correct intent in the “correct_intent” column. If correct, but the confidence level is low (e.g. below 0.4), do add the intent name in the “correct_intent” column. If correct and the confidence level is high, you may leave it as it is, i.e. no need to fill in the “correct_intent” column
Chatlog review - 7
  1. In step 4 or 5, if you encounter any gibberish e.g. “Hhhhgg”, you should not add them to the training data. You can ignore them.

Tips to help you while working on the above:

  1. Have the dashboard open in the background while you are working on this exercise:
    1. If you are unsure of the intent name to classify to, please search in the FAQ and copy-paste the intent name to “correct_intent” column
    2. Before creating new intent, please search for the question/keywords to check if they have been added before. This is extremely important as otherwise, you might be confusing the model when adding the same or similar intent
  2. Please do not pile up your review, as some older questions might have already been trained and improved over time
  3. You can use the “Remove Duplicates” function in Excel to remove repeated questions so that you can shorten the dataset to review

Thought process of whether to add as a new intent

When you encounter a new utterance or entity, you may wonder if you should add it into the training set as an example or should you add it as a new intent/entity.

Here is the thought process flowchart:

chatlog review- thought process

Frequency

Depending on the volume, best is to not take a few months’ or years’ worth of data, as it will be very hard to sort and monitor. Data might not be relevant as well, as there could already be training done to the model.

We suggest to review on at least monthly basis. Consistent review keeps things more current and that it is easier to manage.

Background Reading

Go through the list of intents to understand

  • When looking through the chats, we would know what are the existing intents and what should be created

Be familiarized with entities

  • New synonyms to add

Additional observations to pick up

Monitor of virtual assistant’s performance comes in a few different ways. One is to only monitor the user questions and the bot’s given answers. But in some cases, the monitor is not complete without reading through the full chatlogs that give a context to the user data.

The purpose of doing a chatlog analysis is to constantly improve our prediction model and to make it stay relevant. On top of the ongoing training, we should also pick up valuable insights as we monitor the chatlogs:

  • Any interesting user patterns?
    • Demand for new topics
    • Seasonal topics/intents that we need to address e.g. Covid policies / Visiting information to premises / Questions on new vaccinations or medical policy
    • Follow-up questions for certain intents
    • Interesting insights e.g. people say “life chat” for live chat
    • High-value conversations to the company/industry
    • Pre-login / post-login observations (if any) – differences in the type of questions asked
  • Opportunities to improve empathy
    • E.g. missing “no” or “others” option
    • E.g. gets to fallback easily but the intent predicted is actually correct > in this case, we should consider lowering the threshold
    • E.g. gets to fallback directly and the user is upset about it > in this case, should we have a fallback counter so that user gets to rephrase their question again. Or can the virual assistant provide a few more options for the user to select – attempt to solve their question first
    • Any examples of empathy or methods of explaining that the live agent did well in that I can copy?
    • Any parts of the conversation that the user got upset or drop off because of lack of empathy?
  • Any useful answers are given by live chat agents?
  • Where should I improve conversation design? (maybe the intent is correct, but the user still did not get what they wanted or was still confused)
  • What are areas that are hard to solve that I decide not to solve?
    • E.g. too many API calls that cannot be solved immediately
    • How should we handle such a situation?
    • Can I solve by escalating to live chat or with a hard-coded trigger or a condition or giving a menu (disable user input) to force the user to select an option?
  • What tools do I use?
    • E.g. if a conflict is identified in the dataset, use deconflicting tool to deconflict
    • Can we use the word cloud? Try that for one word, 2, 3, 4 to watch out for patterns
  • Comparison of topics asked about when agents were online vs. offline (either because it was after work hours or an agent was not available)

Tips to find out if a question is commonly asked:

You can observe the column “user ID” – when a question is asked repeatedly, have a look at the user id – if it is the same person, then you can ignore it, as it is likely a person who is spamming or repeating the same question. However, if many different people are asking the same question, there’s a chance that it is a valid commonly asked question.

Monitor chatlogs From Ratings

Ratings are the best feedback as these are the direct feedback we get from a user who uses the virtual assistant. It is also important that we address these ratings, as users are giving us a chance to improve and it is better to leave their feedback with the virtual assistant than to post elsewhere then blow up the matter.

Good feedback is also an indication that the bot is moving towards the right direction and they can be a great case study to on good user experience.

The outcome that we want to achieve:

Scenario 1: If the bot has ratings collected for conversation flow

  • We can observe what are the good and helpful conversation flows, or what can be improved on
  • Sort by user ratings – 1-2 stars (what needs to be improved on) and 4-5 stars (what went right)
  • Possible scenarios:
    • Long flows and many drop-offs
    • Seemingly irrelevant questions and yet users’ expectations are not managed
    • Lack of empathy

Scenario 2: If the bot has ratings for live chat

  • We can observe what are the conversations that the virtual assistant cannot yet answer
  • Observe what people like about the live agent responses and how the virtual assistant can manage user expectations like a human

To monitor chats, for both virtual assistant-only or virtual assistant and live chat, go to “Live chat” > “Monitor”

Monitor chatlogs From Fallback

From fallback (e.g. questions that do not get a predicted intent), we can search for the conversations where these fallbacks come from. Read through the conversation from the start to find out at which point did the conversation “fail”.

  • Is this a common trend?
  • Is it a follow-up question and the bot does not have a context? – For this scenario, we should add guided options to guide users through the next point of the conversation
  • Did the person tried multiple times and still did not get the answer? – for this scenario, we can design a better fallback handling. E.g. after 2 times of failing, we offer live chat etc