Analysis of Baidu word segmentation algorithm three

three: further analysis of Baidu word segmentation algorithm

said above, through the analysis of Baidu word segmentation system using bidirectional maximum matching, but later found a loophole in the reasoning process, and the steps of Baidu word segmentation algorithm derived is too cumbersome, so further analysis, to see whether the deduction is wrong.

then the previous analysis what vulnerabilities? We derived Baidu word segmentation based on reverse maximum matching is the Baidu " " cloud; Beijing China; word segmentation is < the north, JINGWAH cloud, like > reverse maximum matching from here, because the forward maximum matching result should be < Beijing. China, >, but the smoke; infers that Baidu uses bidirectional maximum matching or too hasty, and the US also said that Baidu has two dictionaries, a common dictionary, a proprietary dictionary, and proprietary dictionary word segmentation first, and then the remaining fragments to ordinary dictionary segmentation so the above " Beijing China cloud was cut into " < north, JINGWAH cloud, > another possibility is: the JINGWAH cloud is stored in a proprietary vocabulary dictionary, the first analysis, so that the smoke JINGWAH " Cloud ", " left; north; ", not what good segmentation, so the output of < the north, JINGWAH cloud >


this is only a hypothesis, if indeed " JINGWAH cloud " in the proprietary dictionary? We’ll see an example of " Shandong Beijing China, Baidu cloud " segmentation result is < Shandong, north, JINGWAH cloud >, JINGWAH cloud; if " " in the ordinary dictionary, if it is reverse then segmentation results should be < the mountain, the northeast, JINGWAH cloud >, if it is positive; segmentation should be < Shandong, Beijing, China, cloud >, are not in any case; < Shandong, North >, JINGWAH cloud; it explains what? " JINGWAH " is in the cloud; the proprietary dictionary, so the first cut off " JINGWAH cloud " and then the rest of the North Shandong " " by ordinary dictionary segmentation, is obviously the forward maximum matching result output < Shandong, North >. Of course; according to our algorithm in " first article; Shandong North of " The segmentation will come to Shandong, North < > the conclusion, but significantly higher than the maximum matching a few more steps to judge, since the same effect, another method is more simple can be done, of course choose simple method. So the preliminary judgment of Baidu is taking the forward maximum matching


we continue to test what kind of word segmentation algorithm, in order to reduce the impact of the first word of the proprietary dictionary, then the query can not appear in the relatively special vocabulary, building query "

