然后,在搜索引擎关键技术的基础上,基于一个轻量级的架构设计了搜索引擎的三个主要模块:网页爬虫、索引器与搜索器。
Then, on basic of search engine's core technologies, based on a lightweight architecture, its three main modules were designed: crawler, indexer and searcher.
例如,一个网站可能不排名,如果你的服务器停止服务的网页爬虫,或如果您已经改变了的网址,有很大一部分你网站的页面。
For example, a site may not rank well if your server stops serving pages to Googlebot, or if you've changed the URLs for a large portion of your site's pages.
当爬虫程序爬过不同的Web站点时,它将建立一个数据库,该数据库中包括它所爬过的站点和网页、每一页所包含的链接、每一页的分析结果等数据。
As the program crawled the various Web sites, it would build a database of the sites and pages crawled, the links each page contained, the results of analysis on each pages, and so on.
每个搜索引擎都有自己爬行网页的自动化程序,叫做“网络蜘蛛(web spider)”或“网络爬虫(web crawler)”。
Each search engine has its own automated program called a "web spider" or "web crawler" that crawls the web.
beacon也称为“网络爬虫(Webbug)”和“像素”,是可以在网页上运行的小段软件。
Beacons, also known as "Web bugs" and "pixels," are small pieces of software that run on a Web page.
对SEO的影响:URL中的关键词有助于告诉爬虫网页与哪些内容有关。
SEO impact: Keywords in the URL help tell the spider what the page is about.
目前这些数据集可以通过非均匀方(heterogeneous)式访问,比如通过语义网页浏览器,或者通过语义搜索引擎爬虫收录。
The data sets currently can be accessed in heterogeneous ways; for example, through a semantic web browser or by being crawled by a semantic search engine.
metarobots标签是如何影响搜索引擎爬虫抓取、索引并显示网页的?
How can the meta robots tag impact how search engines crawl, index and display content on a web page?
聚焦网络爬虫并不追求大的覆盖,而将目标定为抓取与某一特定主题内容相关的网页,为面向主题的用户查询准备数据资源。
The main goals of focused web crawler are to get more web pages which are correlative with a certain topic and prepare data for users querying.
然而目前的主题爬虫所采用的两种基本抓取网页的方式效率比较低下。
However the current two ways of web crawling used by focus crawler are low efficiency.
在此基础上设计并实现了一个主题爬虫系统,该系统利用主题敏感HITS来计算网页优先级。
Then a topic crawler system was designed and implemented, employing topic sensitive Hyperlink-Induced Topic Search (HITS) to predict the priority of fetched Web pages.
传统的聚焦爬虫抓取的目标是与某一特定主题内容相关的网页,而在有些应用中,如网络目录,更多的是给用户提供主题相关网站。
Traditional focused crawler is targeting web pages that are relevant to some specific topics. But some applications, such as web directory, are providing users with relevant websites.
本文提出了一种维护WAP网站的网络爬虫系统,该系统可以自动遍历WAP网站,并对网页进行分析,检查语法和语义的错误。
This paper provides a Maintaining WAP Site Crawler system. This system can automatically traverse the WAP site, parse every page in the site and check syntax and semantic faults.
网络爬虫是一个可以从因特网上自动提取网页的系统,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成。
Web crawler is a system which can automatically get web pages from Internet. It helps searching engine download web pages, so it is an important part of searching engine.
通过网络爬虫技术实现对互联网上的网页内容进行提取,并对提取的网页进行文本和图像识别。
Through the web crawler technology to realize the extracting of the content on the web page, and the recognizing of the text and image appeared on the web page.
即对爬虫程序在网站内获取的超链接采用URL比较法进行先过滤,去掉不满足匹配条件的网页。
Then the hyperlinks are filtered by the method of URL comparison and the ones which satisfy matching condition are left.
即对爬虫程序在网站内获取的超链接采用URL比较法进行先过滤,去掉不满足匹配条件的网页。
Then the hyperlinks are filtered by the method of URL comparison and the ones which satisfy matching condition are left.
应用推荐