实验表明,提出的方法能有效的检测汉语相似重复记录。
The experimental results prove: the approach can detect efficiently the approximately duplicate Chinese database records.
针对当前相似重复记录检测方法中存在的问题,提出一种改进方法。
Problems in current existing methods of detecting approximately duplicate records are discussed, and an improved method is proposed.
该方法根据关系表的决定属性值划分记录集,并在每个决定属性值类中检测相似重复记录。
The proposed method partitions record set according to decided attribute values, and then detects approximately duplicate records in each class of decided attribute value.
数据仓库中相似重复记录的识别与消除是数据清洗的热点问题,其中地址类信息对相同实体识别起着非常重要的作用。
Its a hot issue to eliminate approximately duplicated records in data cleansing operation of data warehouse, in which the address information play an important role to identify the same entity.
本文设计并实现了数据获取系统,主要研究数据获取中的两个关键技术:数据源增量数据获取技术和相似重复记录检测技术。
This paper intends to illustrate the data extracting system design, with focus on two key technologies in data extracting, namely, incremental data extracting and duplicate record detecting.
要把数据表中的相似重复记录标识出来,常用的方法是先将所有记录按照某个关键字进行索引,然后在一个固定长度的窗口范围内进行记录的两两比对。
The common approach of marking the approximately duplicate records is that a pair of records are compared in a window with fixed length after these records are indexed by a certain keyword.
通过查找“相似汉字表”解决部分输入错误的问题,计算相似度函数判断被比较的记录是否是重复记录。
Solving the input mistakes by looking up the "Similarity Chinese Characters Table" and the similarity function which was used to determine whether two records were duplicate or not.
通过查找“相似汉字表”解决部分输入错误的问题,计算相似度函数判断被比较的记录是否是重复记录。
Solving the input mistakes by looking up the "Similarity Chinese Characters Table" and the similarity function which was used to determine whether two records were duplicate or not.
应用推荐