Load Test Output
In this tutorial, we will show you how to use Neocortix Cloud Services BatchRunner to crawl news websites with Scrapy.
First, please follow the steps in the tutorial Setting Up For Batch Jobs. And then, in your home directory, please issue the following command:
python3 -m pip install scrapy
and then:
cd ~/ncsexamples/scrapy
In the subdirectory
~/ncsexamples/scrapy
you will find the runBatchScrapy.py command. This script will crawl news websites ABC News, Chicago Tribune, NPR, and USA Today.
For simple use, just issue this command:
./runBatchScrapy.py --scrapyProj newsProject
To scrape only a subset of sites:
./runBatchScrapy.py --scrapyProj newsProject --spiders abcnews npr
Output, by default, goes to a date-stamped subdirectory of
data/scrapy_<datestamp>
, but you can pass
–outDataDir
to change that. It produces files named after the spider that produced them, e.g:
abcnews_out.csv, chicagotribune_out.csv, npr_out.csv, usatoday_out.csv
To use your own scrapy project:
Copy your scrapy project to
~/ncsexamples/scrapy
, but you can pass
./runBatchScrapy.py --scrapyProj <yourProjectDir>
By default, it does auto-scaling, so you don’t have to set the number of workers
nWorkers
.
Official help text:
usage: runBatchScrapy.py [-h] [--authToken AUTHTOKEN] [--filter FILTER] [--outDataDir OUTDATADIR] [--timeLimit TIMELIMIT] [--unitTimeLimit UNITTIMELIMIT] [--scrapyProj SCRAPYPROJ] [--spiders [SPIDERS [SPIDERS ...]]] [--nWorkers NWORKERS] runs scrapy on a new batch of instances optional arguments: -h, --help show this help message and exit --authToken AUTHTOKEN the NCS authorization token to use (default empty, to use NCS_AUTH_TOKEN env var --filter FILTER json to filter instances for launch (default: { "storage": ">=4000000000", "dar": ">=95" }) --outDataDir OUTDATADIR a path to the output data dir for this run (default: empty for a new timestamped dir) --timeLimit TIMELIMIT amount of time (in seconds) allowed for the whole job (default: 1800) --unitTimeLimit UNITTIMELIMIT amount of time (in seconds) allowed for each spider (default: 300) --scrapyProj SCRAPYPROJ the name of the scrapy project directory --spiders [SPIDERS [SPIDERS ...]] list of spiders to run (from the given project) (default: run all) --nWorkers NWORKERS the # of worker instances to launch (default: 0 for autoscale) --authToken AUTHTOKEN the NCS authorization token to use (default empty, to use NCS_AUTH_TOKEN env var --filter FILTER json to filter instances for launch (default: { "storage": ">=4000000000", "dar": ">=95"}), can include regions like this, for example: "regions": ["north-america", "russia-ukraine-belarus"]) --outDataDir OUTDATADIR a path to the output data dir for this run (default: empty for a new timestamped dir) --timeLimit TIMELIMIT amount of time (in seconds) allowed for the whole job (default: 1800) --unitTimeLimit UNITTIMELIMIT amount of time (in seconds) allowed for each download (default: 300) --urlListFile URLLISTFILE a path to a text file containing urls to download from (default: dlUrlList.txt) --nWorkers NWORKERS the # of worker instances to launch (or 0 for autoscale) (default: 0)

Example Command

Simply run
python3 ./runBatchScrapy.py --scrapyProj newsProject
When the program is done, the output files will be put in a directory
./data/scrapy_<datestamp>
with names like
abcnews_out.csv, chicagotribune_out.csv, npr_out.csv, usatoday_out.csv
Here is a partial example of data from a successful run, showing the URL and Title of the article, from
abcnews_out.csv
:
url,title https://abcnews.go.com/Politics/live-updates/afghanistan-withdrawal-live-updates /?id=79482353,Afghanistan updates: Generals say they opposed Biden decision to w ithdraw all troops https://abcnews.go.com/Politics/key-takeaways-us-military-leaders-afghanistan-wi thdrawal/story?id=80286958,Key takeaways from US military leaders on Afghanistan withdrawal https://abcnews.go.com/US/wireStory/kentucky-man-life-prison-rape-children-80283 935,Kentucky man gets life in prison for the rape of 2 children https://abcnews.go.com/Business/wireStory/explainer-uk-experiencing-fuel-crisis- 80282811,EXPLAINER: Why and how the UK is experiencing a fuel crisis https://abcnews.go.com/Politics/milley-defends-calls-china-amid-concerns-trump/s tory?id=80279037,Milley defends calls to China amid concerns about Trump https://abcnews.go.com/US/wireStory/daughter-barbara-bush-birth-baby-girl-802838 55,Former first daughter Barbara Bush gives birth to baby girl https://abcnews.go.com/US/wireStory/police-georgia-shoot-suspect-bow-arrow-carja cking-80282897,Police in Georgia shoot suspect in bow and arrow carjacking https://abcnews.go.com/Entertainment/wireStory/prince-andrew-acknowledges-faces- us-sex-assault-lawsuit-80283058,Prince Andrew acknowledges he faces US sex assau lt lawsuit https://abcnews.go.com/US/capital-gazette-shooter-sentenced-life-prison-possibil ity-parole/story?id=80276939,"Capital Gazette shooter sentenced to life in priso n without the possibility of parole " https://abcnews.go.com/Politics/wireStory/florida-sues-biden-administration-immi gration-policy-80287924,Florida sues Biden administration over immigration polic y https://abcnews.go.com/US/atlanta-spa-gunman-robert-long-pleads-guilty-murder/st ory?id=80275534,Atlanta spa gunman Robert Long pleads not guilty to 4 murder cha rges in Fulton County https://abcnews.go.com/Weird/wireStory/housing-market-hot-burned-house-400k-8028 2601,"Housing market so hot, burned house going for almost $400K" https://abcnews.go.com/Sports/wireStory/ohio-state-sex-abuse-survivors-plan-appe als-defend-80286615,"Ohio State sex abuse survivors plan appeals, defend motives " https://abcnews.go.com/Politics/wireStory/jan-trials-slowed-mounting-evidence-us -capitol-riot-80275216,Jan. 6 trials slowed by mounting evidence in US Capitol r iot https://abcnews.go.com/Technology/wireStory/sequoia-national-parks-giant-forest- unscathed-wildfire-80152055,Sequoia National Park's Giant Forest unscathed by wi ldfire https://abcnews.go.com/GMA/Culture/video/oscars-2021-predictions-peter-travers-w in-win-77220775,"Video Oscars 2021 predictions: Peter Travers on who will win, w ho should win " https://abcnews.go.com/GMA/photos/fabulous-50-16695746,Fab over 50 Photos