Analysis of search engines to first grab the most important pages


so how search engines first, the most important "

5) priority collection website home page, and give the page weight high value. The number of sites is far less than the number of pages, and important pages must be used from these web page links, so the collection work should give priority to get as much as possible the website home page.

"massive, search engine that important pages have the following basic characteristics, although not always accurate, but most of the time it is:

3) the content of the page is spread widely reproduced.

4) web directory depth is small, easy to browse to the user. Here the definition of "URL directory depth": remove the name part of the directory hierarchy "URL, URL 贵族宝贝domain贵族宝贝, the directory depth is 0; if it is 贵族宝贝domain贵族宝贝/cs, then the directory depth is 1, an analogy. That is, the depth of small "URL directory is not always important, large web directory depth is not all not important," URL some academic papers have a long list of depth. The most important is high "will also have these 4 characteristics.

?By analyzing the characteristics of ";

here arises, when crawling the web search engine, can neither know the web links also don’t know is reproduced, in other words, he did not know the beginning of the front 3 characteristics, these factors can only know in web page or the link structure of the Web almost all after. So how to solve this problem? That is the characteristics of the 4 and 5 can be known in grasping when only 4 features do not need to know the content of the page (no crawling before) can determine a URL >

2) a "father" is more than the number of links or important web links, such as a web page in the page of a website, but the page links number, but also link the home page, then this page is more important;

1) "by other web link characteristics, if it is more than the number of links or important web links, it is very important to

search engine in the face of massive ", they are not parallel to capture every web search engine, because no matter how database expansion, are unable to keep up with the pace of growth", the search engine will be the most important priority to crawl the web to save a database, on the one hand, ordinary users can also help, because. For users, they do not need massive results, only the most important results. So that a good collection strategy is preferred to collect ", in order to get the most important web crawl in the shortest time.