Natural Language Processing and Text Mining
Format: PDF / Kindle (mobi) / ePub
Natural Language Processing and Text Mining not only discusses applications of Natural Language Processing techniques to certain Text Mining tasks, but also the converse, the use of Text Mining to assist NLP. It assembles a diverse views from internationally recognized researchers and emphasizes caveats in the attempt to apply Natural Language Processing to text mining. This state-of-the-art survey is a must-have for advanced students, professionals, and researchers.
Ann Arbor, MI, June 2005. 5. M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, 1999. 6. W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch. TiMBL: Tilburg Memory Based Learner. 2004. 7. A. Dubey. Statistical Parsing for German. PhD thesis, University of Saarland, Germany, 2003. 8. A. Dubey. What to do when lexicalization fails: Parsing German with suﬃx analysis and smoothing. In Proc. of 43rd Annual Meeting of ACL, Ann
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Root POS Appointed Appoint V commander N of Prep the Det Continental N Army in Prep N 1775 George N Washington molded mold V a Det ﬁghting A force N that N eventually A won win V independence N from Prep Great N Britain Head Role Label Antecedent Attributes 1 2 5 Obj Mod Det Person/title 3 Pcomp Organization/name 1 6 Mod Pcomp Numeric/date 9 Subj Person/name 12 12 9 15 15 Det Mod Obj Subj Mod 15 16 Obj Mod 17 Pcomp 12
the input sequence, ﬁnd the most likely output sequence (sequence of document types), and, thus, eﬀectively separate the sequence of input pages by document type. 8.4.1 Sequence Model Formally, the procedure described above can be modeled as a Markov chain. Denoting the input sequence of ordered pages6 pc by P = (pc1 , . . . , pcn ) and the output sequence of document types by D = (d1 , . . . , dn ), the probability of a speciﬁc sequence of document types D given the input sequence of pages can
|dj−1 , pcj ) (8.3) p(dj |dj−1 , dj−2 , pcj ) . (8.4) p(D|P) ≈ j=1 and ﬁnally n p(D|P) ≈ j=1 Instead of trying to approximate the probality of p(D|P) ever more accurately by relaxing the independence assumptions one also can describe pages in more detail by breaking up the document types based on the page position within a document. Functionally, this is achieved by altering the output language. In the extreme, this would lead to a model of the data in which the symbols of the output
two, and the odds ratio does not performe as well as the other TFFVs. Global based classic schemes, i.e., TFIDF, ‘ltc’, and normalized ‘ltc’ form, do not work well for either MCV1 or Reuters-21578. A close look at their performance reveals that classiﬁers built for minor categories, e.g., composites manufacturing, electronics manufacturing and so on, do not produce satisfactory results. This has largely aﬀected the overall performance. Among TFFVs, odds ratio does not work very well. This is a