Adsense site, collection is not strange, web site up, first collection to enrich the site, and then in the late development. Collection is the beginning of the site, the latter gradually formed its own characteristics. Manual acquisition, after all, is too cumbersome, so there has been more and more software acquisition, but how to collect high-quality information, which is more concerned about the webmaster.
A5 forum bbs.admin5.com November 12th organized the "webmaster how to improve acquisition skills" topic version, invited the locomotive alone to discuss the acquisition problem. This paper focuses on finishing version, hope interested in collecting the webmaster help.
: May I ask you how to collect high quality resources, because the current target stations are similar and repetitive?.
answer: multi-directional, all-round development, unique resources, free resources or charges tariff reduction program.
: can you give me a brief idea of the principle of locomotive acquisition?
: a locomotive collector has been supported by the regular extraction of web content, but based on practical considerations, not all of the webmaster to regular expressions, so that the common way is to set the start and end region, and then extract the needed information from, you can also choose to get through regular expressions.
asks: the simplest principle of collection is regular expression. Pseudo original is to analyze the content, but the general collection can not achieve these, I want to prove that it is not so.
machine is different, and the human brain Chinese, one word difference, and combined with the meaning of the words, read the tone is not the same, can be understood as a completely different meaning, so that the pseudo original is not desirable, but also very rogue program.
asked: pseudo original is to analyze the content, you need to search the engine as the lexical analysis. Excuse me, can the guest’s locomotive be realized,
answer: at present not yet, but we also have in the study of vertical search engine this piece, I believe that after our pseudo original will be more intelligent.
asks: "if everybody collects the website, there is no characteristic, want to ask what to collect to want to grasp, please guest says.".
answer: This is a collection of the focus, this degree of certainty, indeed, site search engines included, a new domain name, it is recommended that the number of articles updated every day is not more than 200. If Baidu collects more than ten thousand websites, update less than 500.
asked: "with the acquisition software, I think we can only do garbage sites.". Do portal station, useful software,
: of course, you don’t know Sina, QQ, 163 are mining, originally a piece of news, the content of typos, exactly the same, but there are some key words to replace, haha, guess how people do