The official version of ictclas is a powerful word segmentation system. The latest version of ictclas supports Chinese word segmentation, part-of-speech tagging, named entity recognition, new word recognition, user dictionaries and other functions, which can help users conduct analysis and research on Chinese language morphology. The ictclas software also provides users with functions such as part-of-speech standards, keyword extraction, and interface expansion to meet the needs of different users.
Introduction to ictclas software
Based on years of research work, the Institute of Computing Technology of the Chinese Academy of Sciences has developed a Chinese lexical analysis system ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System). Its main functions include Chinese word segmentation; part-of-speech tagging; named entity recognition; new word recognition; and also supports user dictionaries. . We have carefully built it for five years and upgraded the kernel 7 times. Currently, it has been upgraded to ICTCLAS2009 user dictionary interface extension. Users can dynamically add and delete words in the user dictionary and adjust the effect of word segmentation. Improved the flexibility of user dictionary usage.
Since 2009, ICTCLAS lexical analysis system has been renamed NLPIR word segmentation system in order to distinguish it from previous work and promote the NLPIR natural language processing and information retrieval sharing platform. Dr. Zhang Huaping has worked hard for more than ten years and upgraded the kernel more than ten times. He has won the first prize of the Qian Weichang Chinese Information Processing Science and Technology Award in 2010, the overall first place in the International SIGHAN Word Segmentation Competition in 2003, and the 2002 Domestic 973 Evaluation Overall first place. The number of global users has exceeded 300,000, including enterprises such as China Mobile, Huawei, China Sou, 3721, NEC, China Business Network, Silicon Valley Power, Yunnan Daily, and institutions such as Tsinghua University, Xinjiang University, South China Institute of Technology, and the University of Massachusetts: At the same time, ICTCLAS is widely used It has been reported by many media including Science Times, People's Daily Overseas Edition, Science and Technology Daily and many other media. You can visit Google to learn more about the application of ICTCLAS.
ictclas software functions
1. Fingerprint extraction
Based on the content, structure, and relationship between words of the article, the semantic fingerprint that can represent the article is analyzed and represented by a numerical sequence.
2. The word segmentation granularity is adjustable
You can control the granularity of word segmentation results. The shared version provides two word segmentation granularities, standard granularity and coarse granularity, to meet the needs of different users.
3. User dictionary interface extension
Users can dynamically add and delete words in the user dictionary and adjust the effect of word segmentation. Improved the flexibility of user dictionary usage.
4. Enhanced part-of-speech tagging function
There are multiple annotation levels to choose from. The annotation levels available for the system include: Institute of Computing Technology first-level annotation level, Institute of Computing Technology second-level annotation set, Peking University first-level annotation set, and Peking University second-level annotation set.
5. Keyword extraction
Automatically extract several words or phrases that can well represent the topic of the document. Keyword extraction technology is widely used in various intelligent text information processing fields such as information retrieval, text classification/clustering, information filtering, document summarization, etc., and has great application value.
6. New word discovery and adaptive word segmentation function
From longer text content, new feature languages are automatically discovered based on information cross-entropy, and the language probability distribution model of the test corpus is adaptively tested to achieve adaptive word segmentation.
ictclas software advantages
1. Public evaluation by domestic and international authorities, recognition by 30,000 customers
For commercial purposes, some companies close their doors and conduct self-tests, claiming that the accuracy is 99.50%, without introducing the test environment and test methods. It is not surprising that the accuracy of closed tests or small-scale open tests is 100%. ICTCLAS1.0 has been approved by the domestic 973 expert group The organization won first place in the evaluation activities. ICTCLAS2.0 won multiple first places in the evaluation organized by SigHan, the first international Chinese processing research institution. For details, please refer to the system evaluation section. These are the results of large-scale on-site open testing conducted by authoritative organizations and are authentic and credible.
ICTCLAS has issued more than 30,000 authorizations to domestic and foreign enterprises and academic institutions, including 3721, NEC, China Business Network, Silicon Valley Power, Yunnan Daily and other enterprises, Xinjiang University, Tsinghua University, South China Institute of Technology, University of Massachusetts; at the same time, ICTCLAS has been widely reported by many media including Science Times, People's Daily Overseas Edition, Science and Technology Daily, etc. You can visit Google to learn more about the application of ICTCLAS.
2. Optimum overall performance
Whether the word segmentation system can meet practical requirements mainly depends on two factors: word segmentation accuracy and analysis speed. The two restrict each other and are difficult to balance. Most systems tend to fall into the dilemma of "fast but not accurate, accurate but not fast". We have developed the perfect PDAT large-scale knowledge base management technology, which has made a major breakthrough between high speed and high accuracy. This technology can manage millions of dictionary knowledge bases. A single machine can query 1 million entries per second, and the memory Consumes less than 1.5 times the size of the knowledge base. Based on this technology, ICTCLAS3.0 has a word segmentation speed of 996KB/s on a single machine, a word segmentation accuracy of 98.45%, an API of no more than 200KB, and various dictionary data after compression of less than 3M. It is currently the best Chinese lexical analyzer in the world.
3. Unified linguistic computing theoretical framework
Chinese word segmentation involves many factors such as Chinese word segmentation, undefined word recognition, part-of-speech tagging, and language special cases. Most systems lack a unified processing method and often use loosely coupled module combinations. The final model cannot accurately and effectively express the vast diversity of words. language phenomenon, and ICTCLAS uses a Hierarchical Hidden Markov Model (Hierarchical
Hidden Markov
Model), unifies all aspects of Chinese lexical analysis into a complete theoretical framework to achieve the best overall effect. Relevant theoretical research has been published in top international conferences and magazines, confirming the model both theoretically and practically. advanced nature.
4. Comprehensive support for application development in various environments
ICTCLAS is all written in C/C++, supports Linux, FreeBSD and Windows series operating systems, and supports mainstream development languages such as C/C++/C#/Delphi/Java.
5. Change according to needs and tailor-made
All functional modules can be disassembled and assembled. ICTCLAS has GB2312 and BIG5 versions, which can handle simplified and traditional Chinese respectively; it supports currently widely recognized word segmentation and part-of-speech standards, including the Computing Institute’s part-of-speech annotation set ICTPOS3.0, Peking University standard, Binzhou University standard, National Language Commission standards, "Academia Sinica" in Taiwan, and "City University" in Hong Kong; users can directly customize the output part-of-speech standards and define the output format; users can customize a word segmentation system that suits them based on their own needs.
ictclas update log
1. Optimized some functions
2. Solved many unbearable bugs
Huajun editor recommends:
Looking around, there are software similar to this software everywhere on the Internet. If you are not used to this software, you might as well try the electronic version of Chinese Idiom Dictionary, .NET, Cloud Machine Manager and other software. I hope you like it!