The growth of the web can be seen as an expanding public digital library collection. Online digital information extends far beyond the web and its publicly available information. Huge amounts of information are private and are of interest to local communities, such as the records of customers of a business. This information is overwhelmingly text and has its record-keeping purpose, but an automated analysis might be desirable to find patterns in the stored records. Analogous to this data mining is text mining, which also finds patterns and trends in information samples but which does so with far less structured-though with greater immediate utility for users-ingredients. This book focuses on the concepts and methods needed to expand horizons beyond structured, numeric data to automated mining of text samples. It introduces the new world of text mining and examines proven methods for various critical text-mining tasks, such as automated document indexing and information retrieval and search. New research areas are explored, such as information extraction and document summarization, that rely on evolving text-mining techniques.
Data mining is a mature technology. The prediction problem, looking for predictive patterns in data, has been widely studied. Strong me- ods are available to the practitioner. These methods process structured numerical information, where uniform measurements are taken over a sample of data. Text is often described as unstructured information. So, it would seem, text and numerical data are different, requiring different methods. Or are they? In our view, a prediction problem can be solved by the same methods, whether the data are structured - merical measurements or unstructured text. Text and documents can be transformed into measured values, such as the presence or absence of words, and the same methods that have proven successful for pred- tive data mining can be applied to text. Yet, there are key differences. Evaluation techniques must be adapted to the chronological order of publication and to alternative measures of error. Because the data are documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modi?ed to accommodate very high dimensions: tens of thousands of words and documents. Still, the central themes are similar.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Data mining is a mature technology. The prediction problem, looking for predictive patterns in data, has been widely studied. Strong me- ods are available to the practitioner. These methods process structured numerical information, where uniform measurements are taken over a sample of data. Text is often described as unstructured information. So, it would seem, text and numerical data are different, requiring different methods. Or are they? In our view, a prediction problem can be solved by the same methods, whether the data are structured - merical measurements or unstructured text. Text and documents can be transformed into measured values, such as the presence or absence of words, and the same methods that have proven successful for pred- tive data mining can be applied to text. Yet, there are key differences. Evaluation techniques must be adapted to the chronological order of publication and to alternative measures of error. Because the data are documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modi?ed to accommodate very high dimensions: tens of thousands of words and documents. Still, the central themes are similar.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.