Hello,
little disclaimer, I have little to no coding experience so please bear with me.
I made a little calendar scraper, scraping marketwatch.com calendar.
The html structure is very basic (see attached), it's basically table/tbody/ and from there, all the infos are in separate tds.
I can scrape (code in attachement) and get the table in a csv file then use pandas to get the whole thing in a formatted and exploitable dataframe.
My issue is that I want to be able to get scrapy to run at the time that the announcement is supposed to come out. For example if there's a release the 13th of october at 10:00 AM, I want to schedule my scrapy to launch 2 min later and get the latest number.
My thinking is as follows :
1/ Extract all the days and times and store them into some kind of scheduler table.
2/ write a script that gets the result from above and launch the scraper at the wanted times
But since the webpage html isn't really structured, I can't tell my scrapy to go look for a particular location only. So the only way I can think of is basically to scrape again the whole table…not efficient probably.
What could be the workaround? I might be doing this wrong by using pandas when I don't have to.
I'd love if you guys could point me in the right direction.
Best regards
Submitted October 12, 2020 at 08:19AM by crashbandishocks
via https://ift.tt/2SNNS7i