JINR (ISSN 1916-7423) is an electronic journal, with a printed version to be negotiated with a major publisher once we have established a steady presence. The journal will bring to the fore research in Natural Language Processing and Machine Learning that uncovers interesting negative results.
- anand
Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and pro- foundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theo- rem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these prop- erties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k -means, and k -median.
- anand
Common Tag is an open tagging format developed to make content more connected, discoverable and engaging. Unlike free-text tags, Common Tags are references to unique, well-defined concepts, complete with metadata and their own URLs. With Common Tag, site owners can more easily create topic hubs, cross-promote their content, and enrich their pages with free data, images and widgets.
- anand
phpSyntaxTree is a web application that creates syntax tree graphs from phrases entered in labelled bracket notation. phpSyntaxTree generated graphs can be used in linguistic homework, assignments and other documents.
- anand
In this paper, we explore a streaming algorithm paradigm to handle large amounts of data for NLP problems. We present an efficient low-memory method for constructing high-order approximate n-gram frequency counts. The method is based on a deterministic streaming algorithm which efficiently computes approximate frequency counts over a stream of data while employing a small memory footprint.
- anand
Boosting is a general method for producing a very accurate classification rule by combining rough and moderately inaccurate "rules of thumb." While rooted in a theoretical framework of machine learning, boosting has been found to perform quite well empirically. This tutorial will introduce the boosting algorithm AdaBoost?, and explain the underlying theory of boosting, including explanations that have been given as to why boosting often does not suffer from overfitting, as well as some of the myriad other theoretical points of view that have been taken on this algorithm. Some recent applications and extensions of boosting will also be described.
- anand
Taking a New Look at Health What are the major health issues facing Americans today? What are some of the most common conditions, and how are they related to one another? What can we do to improve our health?
- anand
This course is designed to introduce students to the fundamental concepts and ideas in natural language processing (NLP), and to get them up to speed with current research in the area. It develops an in-depth understanding of both the algorithms available for the processing of linguistic information and the underlying computational properties of natural languages. Wordlevel, syntactic, and semantic processing from both a linguistic and an algorithmic perspective are considered. The focus is on modern quantitative techniques in NLP: using large corpora, statistical models for acquisition, disambiguation, and parsing. Also, it examines and constructs representative systems
- anand