中文分词处理源代码C++(共6页).doc

上传人：飞****2

文档编号：13864573

上传时间：2022-05-01

格式：DOC

页数：6

大小：25KB

( 4.5 )

《中文分词处理源代码C++(共6页).doc》由会员分享，可在线阅读，更多相关《中文分词处理源代码C++(共6页).doc（6页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、精选优质文档-倾情为你奉上#include #include #include #include using namespace std;const int START1 = 0XB0,START2 = 0XA1, END1 = 0XF8,END2 = 0XFF;const int MAXWORDLEN = 48;ifstream fin(segdict.txt);ofstream out(out1.txt);/- 建树部分-struct Node3 string S; bool IsWord; Node3 *L,*R; Node3(string s = ,bool isWord = 0, N

2、ode3 *l = 0, Node3 *r = 0): S(s),IsWord(isWord),L(l),R(r);struct Node2 string S; bool IsWord; Node3 *Child; Node2(string s =,bool isWord = 0, Node3* child =0): S(s),IsWord(isWord),Child(child);struct Node string S; vectorv;vectorDic;int HASHEND1 - START1END2 - START2;void Begin() /初始化 for (int i = 0

3、; i END1- START1; i+) for (int j = 0;j L != 0) LAST = LAST-L; if (LAST-S != t) LAST-L = new Node3(t,(len = 2),0, 0); LAST = LAST-L; if (len 2) BuildTree(s.substr(2,MAXWORDLEN),LAST-R);void Dictionary() /构造整个结构 Begin(); string s; int N,k = 0; while(fin s) Node n; n.S = s.substr(0,2); int m1 = (unsign

4、ed char)s0 - START1; int m2 = (unsigned char)s1 - START2; HASHm1m2 = k+; out s HASHm1m2 N; out N endl; for (int i = 0; i s; out s 0 & n.vSIZE-1.S != t) n.v.push_back(Node2(t, (Len = 4),0); SIZE = n.v.size(); if (Len 4) BuildTree(s.substr(4,MAXWORDLEN),n.vSIZE-1.Child); Dic.push_back(n); out END HASH

5、 endl endl;/-查询部分-vectorDest;int BinarySearch(int x, string Sec)/二分查找第二个字 int L = 0,R = Dicx.v.size() - 1; while (L 1; if (Dicx.vmid.S = Sec) return mid; else if (Dicx.vmid.S S = cc) return p; else p = p-L; return 0;unsigned CharToInt(char c) return unsigned(unsigned char)c) ;bool IsCC(char c) unsig

6、ned val= CharToInt(c); return val = START1 & val END1;bool IsEC(char c) unsigned val= CharToInt(c); return val 0x80;void FindNum(string src, vector&dest, int &StarPos,int &EndPos) int Strlen = src.length(); while (EndPos StarPos) dest.push_back(src.substr(StarPos,EndPos-StarPos); StarPos = EndPos; v

7、oid Segment(string src, vector&dest) int StrLen = src.length(); int StartPos = 0, EndPos; while (StartPos = StrLen) return ; unsigned SegLen = 2; string HeadCC = src.substr(StartPos, 2); cout HeadCC endl = 0); string SecCC = src.substr(StartPos + 2,2); if (SecCC.length() 0 & IsCC(SecCC0) int B2 = Bi

8、narySearch(HeadIndex,SecCC); if (B2=0) if (DicHeadIndex.vB2.IsWord) SegLen += 2; EndPos = StartPos + 4; Node3 *p = DicHeadIndex.vB2.Child; while(EndPos IsWord) SegLen = EndPos - StartPos; p = p-R; dest.push_back(src.substr(StartPos,SegLen); StartPos += SegLen; int main() Dictionary(); ofstream out2(

9、out2.txt); / string SS =有时，我会抬头，看一看这喧嚣的人群，有没有我想见得身影，若是有那身影，或许我会看着她，看她慢慢的融入人群，直到不见。然后我会低下头，走着我的道。; / string SS=中华人民万岁;string SS=程序编码基本正确，实现了程序设计中提到的两种分词策略，分词结果就在预料之中。; / string SS= 在词典中对于特定的首字，前两字相同的词条很少，前三字相同的词条更少。当我们以这种形式组织词典后，除子表的第一层外，各个节点的兄弟数目都很小，对它们的查找采用顺序查找方法较为适宜。 ; /string SS = 主要分为两大模块：一个建立一棵树，一个是查询。建树有三个层次，第一层是HASH表，第二层是数组，用于二分查找使用，第三层是二叉树。查询分为直接查询第一层的HASH表，第二层用二分查找（第二层汉子相同的平均概率是26，一般第二字成词切相同），第三层直接顺序查找，以及查找句子中的数字和汉子标点。; Segment(SS,Dest); int LEN = Dest.size(); for (int i =0;i LEN; i+) out2 Desti endl; system(Pause); return 0;/-专心-专注-专业

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

20 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 中文分词处理源代码

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：中文分词处理源代码C++(共6页).doc
链接地址：https://www.taowenge.com/p-13864573.html