Voice vs Speech

Voice vs. Speech…How vs What

Bruce E. Peoples, Cognitive Scientist, RankMiner Inc.

RankMiner is a Predictive Analytics engine for enhancing a company’s business performance by predicting future customer results and prioritizing those results to maximize value.  Our strength is in a layered proprietary approach to voice analytics to optimize business decisions.

The technology is ushering in a new wave of Analytics focused on emotional aspects of the voice in determining and understanding the cognitive behavior behind what is being said…a new paradigm that enables modern machine learning techniques understand thetrue meaning behind the words used.

Speech Analytics is quite different than Voice Analytics.  Speech Analytics is focused on what is being said, keywords and key phrases. Speech Analysis Systems depend on converting spoken speech to text to perform the analysis on keywords and phrases. The resultant information can be used to analyze live and recorded phone conversations in a variety of business domains.  Speech Analysis Systems have limitations.  The process is costly, cumbersome and requires human expertise and manual updates.  Specifically, a dictionary must be created and maintained for each language and sub-languages encountered. These dictionaries must contain the specific keywords and phrases germane to a particular business as well as the slang used by people.  Another limitation of a Speech Analysis System is what is known as the Word Error Rate (WER), the error in accurately transcribing speech into text.  Depending on the Speech to Text components used, the WER can be quite large, making system accuracy difficult.  Lastly, Speech Analysis Systems cannot determine cognitive state information of human speakers, such as emotional aspects of what is being said.  Without cognitive state information, the context of what is being said cannot be determined accurately, which may invalidate the assumptions of what is actually being communicated.

Voice Analytics focuses on human vocal intonations to determine emotions…moods, attitudes, and personality based behaviors of what is being said…the HOW aspect of WHAT is being said. According to psychology studies, about 90% of the total impact we make through our verbal communications has nothing to do with our choice of words.  We depend on “our ear” to determine the context of what is being said, for example to determine the true meaning of what the phrase “yeah right” really means in a conversation.  Unlike Speech Analytics, Voice Analytics is language independent and requires no dictionaries. Voice Analysis Systems utilizes acoustic features such as audio segmentation, feature extraction, and classification algorithms to determine the emotional context of what is being said.  Emotion recognition starts with processing a speech signal and calculating features contained in the voice waveform contained in the signal. These features include pitch, energy, formants, amplitude, frequencies, duration, loudness, Mel-Frequency cepstral coefficients, wavelets, and other spectral properties.  Statistical functions like mean, maximum, minimum, variance, etc. are applied to form the features used in the analytic process.  An Emotional Voice Database, also known as an EMO Database, is then used to compare features generated and processed by a Voice Analytic System, with known features associated with emotions such as happy, sad, angry, terrified, afraid, excited, neutral etc. to properly classify the emotional states of voices contained in the waveform.  This provides modern machine learning techniques data needed to understand the true meaning of what is being said. Armed with this data, “leading edge” technologies such as those patented by RankMiner can accurately determine speaker emotional states and use the information to predict human behavior in almost any domain and defined context, for example, applications in Call Centers, Financial Institutions, Schools, Airlines, Militaries, Governments, etc.