TU Berlin

Quality and Usability Lab2019_02_04_Li

Inhalt des Dokuments

zur Navigation

Evaluation of Different Text Representations on Classification of Consumer Care Inquires

LOCATION:  TEL, Room Auditorium 2 (20th floor), Ernst-Reuter-Platz 7, 10587 Berlin 

Date/Time: 04.02.2019, 17:00-17:45  

SPEAKER: Hao Li (TU Berlin)


Text classification is a vital step toward a natural language understanding system for a chatbot, while there are not many researches available about text classification on consumer care inquiry data due to data availability and privacy concerns. This paper tried to explore the options on different approaches to construct text representation and how they may affect multiclass text classification performance on consumer care inquires. Word embedding models involved are three pre-trained, commonly used ones available online, and four trained locally with different training corpus and algorithm. Then word embeddings would be aggregated using three different featurization methods. Differently constructed text representations will be fed to the same neural network classifier, and evaluated on the text classification task using five-fold crossvalidation. During test, using term frequency-inverse document frequency (TFIDF) weighted average of word embeddings from SpaCy model yielded the best performance on the text classification task, while not significantly outscored the baseline bag-of-words model. It has been found that increasing the size of training corpus of word embedding model, as well as including domain related texts into the training corpus, helps to achieve a better performance. Meanwhile, using TFIDF weighted average of word embeddings slightly outperformed using unweighted average, while taking extremes of word embeddings was never on a par with the above two methods. These findings may be valuable to later research in text classification on consumer care data.



Schnellnavigation zur Seite über Nummerneingabe