Development of environmental management lexicon based on new word discovery and its empirical application
-
摘要: 随着我国环境政策法规数量的不断增加,采用纯人工方式对政策法规进行整理归纳和分析解读变得越来越困难。运用文本挖掘等计算机技术辅助开展环境政策法规信息提取、内容分析以及智能化管理应用具有重要意义。精准分词则是实现文本挖掘各项功能的必要条件。为改善政策法规文本分词效果,以我国各级生态环境部门官网发布的环境政策法规文本为语料基础,通过新词发现算法与人工补充修正构建得到环境管理专业词库。应用实证结果表明:添加专业词库能将政策法规文本的分词准确率由72.6%升至94.1%;将基于支持向量机模型的政策法规文本自动分类误判率降低22.7%;且添加词库后的词频统计和关键词提取结果能为环境政策法规分析提供更全面、更具有时效性的统计信息。Abstract: With the rapid development of environmental policies in China, collating, inducing, analyzing and interpreting a large number of policies and regulations in a purely manual way has become more and more difficult. Therefore, it is of great significance to use computer technologies, such as text mining, to support intelligent environmental policy management and environmental policy analysis, including information extraction and text analysis. Accurate word segmentation, or tokenization, is the basis of all text mining functions. In order to improve the effect of policy text segmentation, the environmental policies published on official websites of China?s ecological and environmental departments of all levels were collected and taken as corpus. New word discovery algorithms and manual supplement and modification were adopted to develop the environmental management professional lexicon. The empirical results showed that with addition of the environmental lexicon, the accuracy of environmental policy segmentation could improve from 72.6% to 94.1%, and the misjudgment rate of policy automatic classification based on support vector machine could reduce by 22.7%. Besides, the results of word frequency statistics and keyword extraction after adding lexicon could also provide more comprehensive and more timely statistical information for environmental policy analysis.
-
Key words:
- new word discovery /
- environmental policy /
- lexicon /
- text mining
-
[1] 许阳, 王琪, 孔德意. 我国海洋环境保护政策的历史演进与结构特征:基于政策文本的量化分析[J]. 上海行政学院学报, 2016,17(4):81-91.XU Y, WANG Q, KONG D Y. Research on historical evolutions and structural features of Chinese marine environment policy:quantitative analysis based on policy content[J]. The Journal of Shanghai Administration Institute, 2016,17(4):81-91. [2] 杨志军, 耿旭, 王若雪. 环境治理政策的工具偏好与路径优化:基于43个政策文本的内容分析[J]. 东北大学学报(社会科学版), 2017,19(3):276-283.YANG Z J, GENG X, WANG R X. Tool preference and path optimization of environmental governance policies:based on the content analysis of 43 policy texts[J]. Journal of Northeastern University(Social Science), 2017,19(3):276-283. [3] LIAO Z J. Content analysis of Chinas environmental policy instruments on promoting firms environmental innovation[J]. Environmental Science & Policy, 2018,88:46-51. [4] RIVERA S, MINSKER B S, WORK D B, et al. A text mining framework for advancing sustainability indicators[J]. Environmental Modelling and Software, 2014,62:128-138. [5] BOUSSALIS C, COAN T. Text-mining the signals of climate change doubt[J]. Global Environmental Change-human and Policy Dimensions, 2016,36:89-100. [6] 徐一方, 许鑫, 张秀敏. 基于词频计算原理的环境政策分析与评价[J]. 中国科技论坛, 2014(7):37-43.XU Y F, XU X, ZHANG X M. Analysis and evaluation of environmental policy based on wordscore theory[J]. Forum on Science and Technology in China, 2014(7):37-43. [7] 张卉, 张捷. 基于环境保护视角的村镇建设政策内容变迁研究[J]. 环境科学与管理, 2018,43(7):1-4.ZHANG H, ZHANG J. Qualitative analysis of policy documents on village and town construction in China:based on environment protection[J]. Environmental Science and Management, 2018,43(7):1-4. [8] 李文坤, 张仰森, 陈若愚. 基于词内部结合度和边界自由度的新词发现[J]. 计算机应用研究, 2015,32(8):2302-2304.LI W K, ZHANG Y S, CHEN R Y. New word detection based on inner combination degree and boundary freedom degree of word[J]. Application Research of Computers, 2015,32(8):2302-2304. [9] 刘伟童, 刘培玉, 刘文锋, 等. 基于互信息和邻接熵的新词发现算法[J]. 计算机应用研究, 2019,36(5):1293-1296.LIU W T, LIU P Y, LIU W F, et al. New word discovery algorithm based on mutual information and branch entropy[J]. Application Research of Computers, 2019,36(5):1293-1296. [10] 陈先来, 韩超鹏, 安莹, 等. 基于互信息和逻辑回归的新词发现[J]. 数据分析与知识发现, 2019(8):105-113.CHEN X L, HAN C P, AN Y, et al. Extracting new words with mutual information and logistic regression[J]. Data Analysis and Knowledge Discovery, 2019(8):105-113. [11] 郭理, 张恒旭, 王嘉岐, 等. 基于Trie树的词语左右熵和互信息新词发现算法[J]. 现代电子技术, 2020,43(6):65-69.GUO L, ZHANG H X, WANG J Q, et al. Trie tree based new word discovery algorithm using left-right entropy and mutual information[J]. Modern Electronics Technique, 2020,43(6):65-69. [12] 苏剑林. 速度更快、效果更好的中文新词发现[EB/OL]. (2019-12-04)[2020-05-16]. https://github.com/bojone/word-discovery. [13] LAM J C, CHEUNG L Y, WANG S, et al. Stakeholder concerns of air pollution in Hong Kong and policy implications:a big-data computational text analysis approach[J]. Environmental Science & Policy, 2019,101:374-382. [14] 尤众喜, 华薇娜, 潘雪莲. 中文分词器对图书评论和情感词典匹配程度的影响[J]. 数据分析与知识发现, 2019(7):23-33.YOU Z X, HUA W N, PAN X L. Matching book reviews and essential sentiment lexicons with Chinese word segmenters[J]. Data Analysis and Knowledge Discovery, 2019(7):23-33. [15] 生态环境部办公厅. 关于做好新型冠状病毒感染的肺炎疫情医疗污水和城镇污水监管工作的通知:环办水体函〔2020〕52号[A/OL]. (2020-02-02)[2020-05-17]. http://www.gov.cn/zhengce/zhengceku/2020-02/02/content_5473898.htm. [16] SILVA C, RIBEIRO B. The importance of stop word removal on recall values in text categorization[C]// Proceedings of the International Joint Conference on Neural Networks, 2003:1661-1666. [17] ONAN A, KORUKOGLU S, BULUT H, et al. Ensemble of keyword extraction methods and classifiers in text classification[J]. Expert Systems with Applications, 2016,57:232-247.
doi: 10.1016/j.eswa.2016.03.045[18] 郑石明, 彭芮, 高灿玉. 中国环境政策变迁逻辑与展望:基于共词与聚类分析[J]. 吉首大学学报(社会科学版), 2019,40(2):7-20.ZHENG S M, PENG R, GAO C Y. The logic of change and prospect of environmental policy of China:based on co-word and cluster analysis[J]. Journal of Jishou University(Social Sciences), 2019,40(2):7-20. [19] 叶娟丽, 韩瑞波, 王亚茹. 我国环境治理政策的研究路径与演变规律分析:基于CNKI论文的文献计量分析[J]. 吉首大学学报(社会科学版), 2018,39(5):76-83.YE J L, HAN R B, WANG Y R. Analysis on the research path and evolution law of domestic environmental governance policy:a literature metrological analysis based on CNKI papers[J]. Journal of Jishou University(Social Sciences), 2018,39(5):76-83. [20] SALTON G, BUCKLEY C. Term-weighting approaches in automatic text retrieval[J]. Information Processing and Management, 1988,24(5):323-328. [21] 叶雪梅, 毛雪岷, 夏锦春, 等. 文本分类TF-IDF算法的改进研究[J]. 计算机工程与应用, 2019,55(2):104-109.YE X M, MAO X M, XIA J C, et al. Improved approach to TF-IDF algorithm in text classification[J]. Computer Engineering and Applications, 2019,55(2):104-109. [22] PRANCKEVICIUS T MARCINKEVIČIUS V, Comparison of naive bayes,random forest,decision tree,support vector machines,and logistic regression classifiers for text reviews classification [J]. Baltic J ournal of Modern Computing, 2017,5(2):221-232.
点击查看大图
计量
- 文章访问数: 265
- HTML全文浏览量: 58
- PDF下载量: 108
- 被引次数: 0