Software features
(1) This software adopts Peking University Tianwang MD5 fingerprint deduplication algorithm, and will no longer save similar and identical web page information repeatedly.
(2) Meaning of collected information: [[HT]] represents the title of the web page, [[HA]] represents the title of the article, [[HC]] represents the 10 weighted keywords, [[UR]] represents the image link in the web page, [[UR]] represents the image link in the web page, [TXT]] is followed by the main text.
(3) Spider performance: This software opens 300 threads to ensure collection efficiency. The stress test is performed by collecting 1 million essential articles. Taking the Internet-connected computers of ordinary netizens as the reference standard, a single computer can traverse 2 million web pages and collect 200,000 essential articles in one day. It only takes 5 days to collect 1 million essential articles. complete.
(4) The difference between the official version and the free version is that the official version allows the collected essence article data to be automatically saved as an ACCESS database. To purchase the official version, please contact QQ (970093569).
How to operate
(1) Before use, you must ensure that your computer can connect to the network and that the firewall does not block this software.
(2) Run SETUP.EXE and setup2.exe to install the operating system system32 support library.
(3) Run spider.exe, enter the URL entry, click the "Manual Add" button first, and then click the "Start" button to start the collection.
Things to note
(1) Crawling depth: Fill in 0 to indicate no limit to the crawling depth; fill in 3 to capture the third layer.
(2) The difference between the general spider mode and the classified spider mode: Assume that the URL entry is "http://youxi.baidu.com/", if you select the general spider mode, every web page in "baidu.com" will be traversed; if Select the category spider mode to only traverse every web page in "youxi.baidu.com".
(3) Button "Import from MDB": URL entries are imported in batches from TASK.MDB.
(4) The principle of collection by this software is not to cross the site. For example, if the entrance is "http://youxi.baidu.com/", it will only crawl within the Baidu site.
(5) During the collection process of this software, one or several "error dialog boxes" will occasionally pop up. Please ignore them. If you close the "error dialog box", the collection software will hang.
(6) How users choose to collect topics: For example, if you want to collect "stock" articles, you only need to use those "stock" sites as the URL entry.
it works
it works
it works