A CLUSTERING TECHNIQUE FOR THE VIETNAMESE WORD CATEGORIZATION

Nguyễn  Minh Hiệp; Nguyễn Thị Minh Huyền; Ngô  Thế Quyền; Trần Thị Phương Linh

Nguyễn Minh Hiệp
Nguyễn Thị Minh Huyền
Ngô Thế Quyền
Trần Thị Phương Linh

Abstract

In natural language processing, part-of-speech (POS) tagging plays an important role, as its output is the input of many other tasks (syntax analysis, semantic analysis. . . ). One of the problems related to POS tagging is to define the POS set. This could be solved using unsupervised machine learning methods. This paper presents an application of the DBSCAN clustering algorithm to classify Vietnamese words from a large corpus. The features used to characterize each word are naturally defined by the context of that word in a sentence. We use a large corpus containing sentences automatically extracted from the online Nhan Dan newspaper.

A CLUSTERING TECHNIQUE FOR THE VIETNAMESE WORD CATEGORIZATION

Abstract

BỘ KHOA HỌC VÀ CÔNG NGHỆ - MINISTRY OF SCIENCE AND TECHNOLOGY OF VIETNAM

CỤC THÔNG TIN KHOA HỌC VÀ CÔNG NGHỆ QUỐC GIA - NATIONAL AGENCY FOR SCIENCE AND TECHNOLOGY INFORMATION