293 Vol.29No.3 2012 3 Application Research of Computers Mar. 2012
*
12 2 2
( 1. 215006; 2.
215006)
: TF-IDF TF-IDF H o w N e t T F I D F F T F I D F : ; ; ; ;
: TP391 : A : 1001-3695( 2012) 03-0833-04 doi: 10. 3969 /j. issn. 1001-3695. 2012. 03. 008
Chinese text similarity method research by combining semantic analysis with statistics
HUA Xiu-li12 ZHU Qiao-ming2 LI Pei-feng2
( 1. School of Computer Science & TechnologySoochow UniversitySuzhou Jiangsu 215006China; 2. Provincial Key Laboratory of Computer
Information Processing Technology of JiangsuSuzhou Jiangsu 215006China)
Abstract: Based on the statistical text similarity measurements method used TF-IDF method to model text documents as term frequency vectorsand computed similarity between documents by using cosine similarity. This method ignored semantic infor- mation of text documentsthe similarity value wasnt correct. Although based on semantics method made up for the drawback but need of knowledge to construct the relationship between words. By studying the advantages and disadvantages of two kinds of methodsthis paper presented a novel text similarity methodwhich firstly pre-processed textthen chose the terms with higher TF-IDF value as the feature itemsnext used semantic dictionary and TF-IDF method to compute the text similarityfi- nally used several K-means clustering methods for evaluating performance of the new text document similarity. Experimental results show that the methods F-measure is superior to the otherswhich proves that the proposed method is effective.
Key words: vector space model; semantic analysis; term frequency; probability distribution; text similarity
0
[ 1 ] [ 2 ] [ 3 ] [4]
Jaccard
: ; ;
[56] HowNet[7] [ 8 ] [9] WordNet[10]
: 2011-08-23; : 2011-10-15 : ( 609700566107012361003155) ; ; ( BK2008160) ; ( 20093201110006)
: ( 1986-) ( huaxiuliemail@ 126. com) ; ( 1963-) ; ( 1970-) .
834 29
1
1. 1 (VSM)
TF-IDF TF-IDF :
a)
b) [11] 100 A 50 B 5 B A
TF-IDF :
TFIDF(i) =tf(i) idf(i) =tfj(i) log(N/df(i)) (1)
:TFIDF(i)i TF-IDF i tf( i ) idf( i ) j i TF-IDFtfj(i)log(N/df(i)) ; tfj ( i ) i j ; N ; df( i ) i TF-IDF TF-IDF
1. 2
( )
WordNet[10] How- Net[7] HowNet[7]
[1] HowNet HowNet
:
s i m ( S 1 S 2 ) = ( 2 )
:S1S2 ;dist(S1S2); 0. 5
( 2) HowNet ( 2) :
+( dist( S1S2) )
sim( S1S2) =( depth( S1) +depth( S2) ) / ( ( depth( S1) +depth( S2) ) +
d i s t ( S 1 S 2 ) )
( 3 )
: depth( S) S
HowNet
HowNet
2 2. 1
ICTCLAS( http: / /ictclas. org/) :
a) PER LOC ORG
b)
c)
2. 2
TF-IDF TF-IDF ?
3 : 835
TF-IDFTOPP(P ) TF-IDF 1 P
3
ij i = ( i1 i2im)j =(j1j2jn)
( 8)
cosSim(ij) = TFIDF( ik) TFIDF( jk) /
k=1 mn
(TFIDF( ))2 (TFIDF( ))2 (5) k = 1 ik l = 1 jl
( 5) vi vj vi vj ( vi vj )
TF-IDF vecSim( i j ) TF-IDF TF-IDF ( 6) :
( 3) ; vecSim( vi vj ) vi vj
1
:vivj :vivj
a)i i1(3)j i1
jk ( sim( i1 jk ) ) i1 jk sim( i1 jk ) i1ivi
b) i i i i j sim( ij) a) b) j i sim(ji)
c)sim(ij)sim(ji)i j v e c S i m ( i j )
d) ( 1) i j TF-IDF (5)i j
e) i j ( 6) wf
f) ( 4) i j
textSim(ij) =wfvectSim(ij) +(1-wf)cosSim(ij) (4) : wf v v ; vec-
ij
Sim( i j ) vi vj
4 4. 1
TFIDF( ik ) TFIDF( jl ) wf= 12 (ki +lj )
mn
TFIDF( ) TFIDF( ) k=1 ik l=1 jl
(6)
500 1
1
6 110 8 20 16 8 105 10 15 13 8 106 7 20 13 5 69 8 16 12 5 60 10 15 11 5 50 8 14 10
ICTCLAS ;TF-IDF TOP ; T F I D F
[12]( SemanticSim) ( [12] WordNet HowNet SemanticSim : 5 )
CLUTO[13] CLUTO K-( DKM) K- ( BKM) K-( AKM)
TFIDF( ik ) ik TF-IDF vi ik ( ki ) TF-IDF vi TF-IDF ( 6) i j :
i = { k : 1 k m m a x { s i m ( i k j l ) } } 1ln
j = { l : 1 l m m a x { s i m ( j l i k ) } } 1kn
( 7 )
vi ik vj jl(l=12n) ik i j i vj
1(1m
vecSim( vivj) = max {sim( ikjl) }+
2 mk=1 1ln
1n) max{sim(jlik)}
(8) : sim( jl ik ) ik jl
n l=1 1kn
836 29
F-F-
F
j i P( ij) R( ij)
P(ij) =nij/njnj jnijj
i P( ij) = nij /ni F-
F( ij) =2 P( ij) R( ij) P( ij) + R( ij)
TF-IDF
5
TF-IDF
:
[1] KUMAR N. Approximate string matching algorithm[J]. International Journal on Computer Science and Engineering20102 ( 3 ) : 641 -644 .
[2] COELHO T A SCALADO P PSOUZA L Vet al. Image retrieval using multiple evidence ranking[J]. IEEE Trans on Knowledge and Data Engineering200416( 4) : 408-417.
[3] KO YPARK JSEO J. Improving text categorization using the im- portance of sentences[J]. Information Processing and Manage- ment200440( 1) : 65-79.
[4] THEOBALD MSIDDHARTH J. SpotSigs: robust and efficient near duplicate detection in large Web collection[C]/ /Proc of the 31st An- nual International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval. New York: ACM Press2008: 563-570.
[5] .[C]// .2002:59-76.
[6] . [J]. 2 0 0 2 3 8 ( 7 ) : 7 5 7 8 .
[7] .[EB/OL].(2003).http://www.keenage.com.
[8] . [C]/ /. :
2 0 0 3 : 8 1 8 8 .
[9] PATWARDHAN SBANERJEE SPEDERSEN T. Using measures of
semantic relatedness for word sense disambiguation[C]/ / Proc of the 4th International Conference on Intelligent Text Processing and Com- putational Linguistics. 2003: 301-308.
[10]MILLERG. WordNet: alexicaldatabaseforEnglish[J].Communi- cations of the ACM199538( 11) : 39-41.
[11] SALTON G. The SMART retrieval system-experiments in automatic document processing[M]. Upper Saddle River: Prentice-Hall1971: 207 -214 .
[12]HOTHO ASTAAB SSTUMME G. WordNet improves text docu- ment clustering[C]/ / Proc of SIGIR Semantic Web Workshop. New York: ACM Press2003: 505-514.
[13] KARYPIS G. CLUTO: a clustering tookit[R]. Minneapolis: University of Minnesota2002.
4. 2
1
TOP = 0 DKM 1 TOP 60% TOP ;
4. 3 2
TOP 60% TOP DKM
2 [0. 70. 75] 0. 75 HowNet 0. 75 F-
3
TOP 60% = 0. 70
4. 4
DKM
BKMAKM 3 ~ 5
3 ~ 5 TF-IDF SemanticSim F-
Reviews
There are no reviews yet.