Implementing Emotions to Computer Systems with Text to Speech

Luis Bacca

School of Computing, DePaul University
243 South Wabash, Chicago, IL, 60604
baccal@icloud.com

ABSTRACT
The objective of this research is to learn more about Test to Speech, and how new technologies could help to improve desensitizing the voice and expression while exploring how Text to Speech can change the way we interact with systems, in particular with e-commerce, education, and communications. For example, text to speech with personality can support sales of products online or help users make educated decisions. Also, in education, TTS can assist students to rewrite essays or can help teachers lecture classes. Additionally, TTS can help robots and avatar technologies to communicate better by developing expressions and movements to become more persuasive communicators.

KEYWORDS
Text to Speech, Social Richness, Social Presence

LITERATURE REVIEW
Ever since I discovered that my mac book pro was able to speak, I began to believe in the idea that one day users could interact with their systems in an interpersonal manner. Similarly, to the movie “her,” where Samantha (an intelligent personal assistant software) and the user engaged in a deep and serious relation; extremely realistic that the user's external world and social skills began to change in a significant way. Nowadays, with systems like Siri, Alexa, and artificial intelligence, it makes me wonder how long a system similar to “Samantha” is going to be part of our everyday lives. Systems with personality, expression, and emotions have been studied in the past. Consequently, in this research, my goal is to understand what scientist in the field have been doing with text to speech relating to e-commerce, emotional speech, and college essays. TTS could be a very technical topic, far surpassing my knowledge, but my fascination with this technology makes me wonder when society will be able to interact with a system in the same way we interact with a close friend, when this happens, I believe that HCI will be playing an important role on how we communicate with such systems in the near future.

The problem with text to speech is that the voice is too synthetic or too robotic, and as a consequence, listening to TTS can be a tedious task. Because it lacks emotions and expressions, in many cases users cannot even identify where a new topic starts because it does not make a distinction when TTS reads a title. Therefore, I firmly believe that once smart personal assistant technologies can identify these distinctions by improving the technical barrier of expressions and better artificial intelligence. Computers at some point are not going to be just tools any longer, but instead, it could be an intelligent expressive or an emotional system. Thus, our relations with systems could change in the future since the communications with these types of futuristic systems will be more efficient and perhaps more personal. Hence, the objective of the scientist community is to bring ideas and theories to overcome the obstacle of expressions in systems. Consequently, I’m under the assumption that to improve smart personal assistant systems such as Alexa or Siri, scientists in the field will need to increase text to speech development furthermore because systems can identify emotions throughout reading patterns.

Interface Gulfs
One of the technologies that have been changing the industry of e-commerce is text to speech or better known as Text-to-speech software. These programs change computer text into phonetic words via utilizing a synthesized machine voice, by reading any text-based application like HTML. (Garrison, K. 2009).  For this reason, Norman’s theory of gulfs of execution and evaluation; can be used to comprehend how developers are working on improving the bridge among physical systems and the user’s goals since they include intentions, action specification, and interface mechanisms. One example that applies to this theory was the study conducted by Lingyun Qiu and Izak Benbasat. Where they found that websites with text-based shared functionality can offer users superior experiences, compare to those sites without this type of functionality. Consequently, integrating human assistance into websites pages could make sites more pleasurable to use and could increase users’ belief to the sites.

Social Presence
Therefore, to improve these systems, Lingyun and Izak came by the idea of a low-cost multimedia (VR) and Text-to-speech voice technology and 3D avatars. The idea was to enrich the consumers’ shopping experiences by adding “social presence” in which “it believes on a communication medium that involves more human senses that can generate stronger feelings of social presence” (Shadiev, R., Hwang, W., & Huang, Y. 2014). According to (Biocca et al. 2003), it is useful to model how users sense the existence of people who are in separate locations. Therefore, by having the impression of being with another person, it creates "social richness" in which can reveal the amount to which a factor is looked as friendly, warm, delicate, private or intimate when interacting with other people.  (Shadiev, R., Hwang, W., & Huang, Y. 2014).

The second objective of the study was to find methods to improve the “Flow” in which according to the authors Shadiev, R., Hwang, 2003 of the research describes a pleasant experience in which users are wholly absorbed in their activities by being playful and exploratory to the system. But to accomplish such approach, it was critical to identify four restraints of flow: control, attention focus, curiosity, and intrinsic interest. For example, when users interact with customer service online, customers have less social or emotional signals available to sense the person’s character or integrity compare to physical stores where users can have a conversation with someone face to face. Therefore, it is difficult to convey users with the impression that they are being nicely tended. On the other hand, when using TTS and 3D avatars, extra sensory signs can be involved or observed when users interact with a human-like virtual actor that talks and moves in strongly simulated conditions. (Gupta, R., Banville, H. J., & Falk, T. H. 2017).  Because if the simulated environment, users could have higher stimulations and it could increase the noticed levels of social presence. But this issue with this is that technology will have to improve significantly.

In HCI, the embodied interaction theory seems to be an essential aspect of the social presence because it focused on the interaction in a way that can be realistic enough to become ubiquitous in their settings by having physical and emotional expressions. Hence, by improving the personal connection between websites and users, the execution bridge could be more universal, because systems that use TTS or an avatar can improve the intentions, action specifications, and the interface mechanisms. With a media-rich information channels that can nearly represent a face-to-face communication added to an interface, the gap between the consumer and a website could be significantly diminished.

Technology as Experience
Many research has been conducted to document the benefits and the limitations of text-to-speech. One of these have been in the area of telephone-based transactions in which the technology has to save millions of dollars to many businesses by computerizing voices to handle telephone transactions, TTS could also be useful by letting the computer to do translations or help ESL students learn languages. Even more significantly, TTS has been extremely useful in the area of disability studies (Garrison, K. 2009).

Consequently, the relation with TTS can change the experience of users in significant ways. Designers would need to improve better relationships between this technology and how people perceive these technologies. One example, of this approach, is where TTS enabled students with inabilities to grasp a greater section of mistakes than receiving no assistance or receiving the limited assistance of having someone else read their essay aloud. (Garrison, K. 2009). Tammy Conard-Salvo (2004-2008) mentioned in her research that computers and writing discussion about how TTS compares to writing center tutorials at the Purdue University. Her research suggested that incorporating TTS in a writing center could be positive because it gathers information on students’ reaction when they hear their essays. Therefore, such programs like “Natural Reader” software can help with composition and structures.

Emotions in Computer Systems
Never the less, it is important to recognize that TTS need emotions because it can be beneficial in numerous fields like text-to-scene processing, cognitive ergonomics, game design to name a few. The range of cognitive ergonomics could hugely helpful from the use of computer learning methods to automate tasks to develop situations that can adjust to human behavior and sensations. (Calix, Javadpour, L., & Knapp, 2012).

Also, (Tapus, Tapus, & Mataric 2008) conducted a study that shows that subjects have greater performance in their rehabilitation tasks when helped by computerized systems that appropriately responds to their practice. Besides, they also noticed that robots could improve the behavior and cognitive processing to autistic children while using TTS. But again, the lack of emotions in these systems was a significant barrier that needed to develop to create closer relations with these systems.

Therefore, Other studies have found that emotions apprehension from language can be a tough task for programmers since emotions can be subjective and the set of features that capture emotions in language are not too clearly defined. Hence, vocabulary methods alone are not sufficient enough. Since acoustic elements are regularly related to feelings, while noise and other parts can alter the experience of a system to predict emotions. Therefore, making an electronic emotion detector is very challenging. Chuang and Wu (2004) showed that emotions in speech are connected to prosody qualities such as pitch and energy. Consequently, it proves how the pitch for happiness or anger is higher than for sadness. Furthermore, in a speech correlated to anger or surprise is higher than the energy associated with anxiety. In addition to this issue, people can hold multiple emotions simultaneously. (Calix, Javadpour, L., & Knapp, 2012). Developing emotions in systems could be powerful for areas such as voice call centers queuing systems, security by detecting people’s emotions. Also, it could help speech understanding (Nicholson et al. 1999)

On the other hand, I also believe in the idea that by assimilating people emotions while using intelligent machines with neuroimaging could help reach the goal of Human-machine speech interaction someday. By using the factors of Natural Language Processing (NLP). Some of the theories used for this field are Spoken Language Understanding (SLU) the goal of this is to understand the meaning from spoken words. Spoken language generation (SLG) which focus on generating spoken words from depending on semantic models. Human-machine dialogue management that focuses on a system that can support a dialogue like humans, finally Recognition and production of vocal emotions, which focus on modeling human emotions and the features of human speech. (Delic, Secujski, Jakovljevic, et al. 2013). Also, the factors such as culture or education can make this development more challenging to develop emotion speech recognition.

GOMS
My primary goal was to focus on learning more about how the scientific community is working to improve our relations with systems, knowing that people are unpredictable and difficult to assimilate. With new technologies such as artificial intelligence, I believe is possible to conclude that someday machines will be able to interact with people on a personal level because artificial intelligence can learn from patterns. The issue is that this type of interactions could take some time to fully developed due to all the technical restrictions. When that happens, Norman’s gulfs bridge could be narrow concerning this aspect of communication and computer systems by improving the human-machine dialog, and by doing so, the technology could change the way people perceive and engage with powerful communication mechanisms compare to the interactive visual systems at this moment.

Therefore, Since Goals, Operators, Methods and Selection rules. Focus mainly in Model Human Processor; it can help focusing on perceptual, cognitive and motor. It could help improve aspects of human expression. GOMS Could be useful to develop new tactics further to create interactive systems by analyzing human communications in particular with TTS, body movements, and facial expressions. From the research, it was clear to me that at some level intuitively or not, some of the studies have been using GOMS methods, by focusing on the three interconnecting systems, such as perceptual, cognitive and motor to create or improve the interaction of such computer systems.

CONCLUSION
The purpose of this research was to learn more about what designers and programmers have been working on to improve personality and expressions to systems while using text to speech. Base on my idea that systems could learn expression patterns base on text. Similar to parents reading books to children, parents use emotions and expressions. I believe that artificial intelligence could learn from these types of expression patterns. But to my surprise, I learned that this type of interaction is more complicated than I thought because it needs many factors that constituted proper communications, such as pronunciation, expression culture, memory and much more. Making a system or software that constitute all these aspects before interaction with people in seconds is challenging. Therefore, it is possible to conclude that it could take programmers some time to get to the level of Samantha (from the movie Her). But it was also exciting to find how Text-to-speech has been used in education, health, and processing languages. I feel that I learned more about the factors of HCI and how designers’ workout issues with technology to improve our experiences with computerizing systems. Ever since I had Siri on my computer; I always had the dream to have a superior interaction with my Mac. It will be nice one day to interact with my computer, in the same way, I ask my wife for ideas on what to wear for a meeting or even more to help me write an essay. A higher level of interaction that can stimulate our senses is essential for designers and programmers to build an execution bridge that can narrow the gap between physical systems and user’s goals.

REFERENCE
Biocca, F., Harms, C., and Burgoon J. K. 2003. Towards a more robust theory and measure of social presence: Review and suggested criteria. Presence: Teleoperators Virtual Environ. 12, 456–480.

Calix, R. A., Javadpour, L., & Knapp, G. M. (2012). Detection of Affective States From Text and Speech for Real-Time Human–Computer Interaction. Human Factors, 54(4), 530-545. doi:10.1177/0018720811425922

Chuang, Z., & Wu, CH. (2004). Multi-modal emotion recognition from speech and text. Computational Linguistics and Chinese Language Processing, 9(2), 45–62.

Conard-Salvo, Tammy. (2004). Beyond disabilities: Text-to-speech software in the writing center. Computers &Writing Conference, Honolulu, Hawaii.

Delic, V., Secujski, M., Jakovljevic, N., Gnjatovic, M., & Stankovic, I. Challenges of Natural Language Communication with Machines. Daaam International Scientific Book, 371-388. doi:10.2507/daaam.scibook.2013.19

Garrison, K. (2009). An Empirical Analysis of Using Text-to-Speech Software to Revise First-Year College Students’ Essays. Computers & Composition, 26(4), 288-301. doi:10.1016/j.compcom.2009.09.002

Gupta, R., Banville, H. J., & Falk, T. H. (2017). Multimodal Physiological Quality-of-Experience. IEEE Journal of Selected Topics in Signal Processing, 11(1), 22-36. Retrieved Oct. & nov., 2017, from http://ieeexplore.ieee.org/document/7781639/references

Lingyun, Q., & Benbasat, I. (2005). An Investigation into the Effects of Text-to-Speech Voice and 3D Avatars on the Perception of Presence and Flow of Live Help in Electronic Commerce. ACM Transactions On Computer-Human Interaction (TOCHI), 12(4), 329-355

Nicholson, J., Takahashi, K., and Nakatsu, R. (1999) “Emotion recognition in speech using neural networks,” Proceedings of the Sixth International Conference on Natural Information Processing (ICONIP’99), Perth, Australia, November 16-20, Vol. 2, pp. 495-501

Shadiev, R., Hwang, W., & Huang, Y. (2014). Investigating Applications of Speech-to-Text Recognition to    Assist Learning in Online and Traditional Classrooms. International Journal Of Humanities & Arts Computing: A Journal Of Digital Humanities, 8179-189. doi:10.3366/ijhac.2014.0106

Tapus, A., Mataric, M., & Scassellati, B. (2007). The grand challenges in socially assistive robotics. IEEE Robotics and Automation Magazine, 14, 35–42.