Writing A Data Extraction To Web Page Program
-
In my area, there are few different law enforcement agencies that post real time data on car accidents. One is http://www.flhsmv.gov/fhp/traffic/crs_h501.htm. They post the accidents by county, and then in the location heading, they add the intersection and the city. For most of these counties and cities, our website, http://www.kempruge.com/personal-injury/auto-and-car-accidents/ has city and county specific pages. I need to figure out a way to pull the information from the FHP site and other real time crash sites so that it will automatically post on our pages. For example, if there's an accident in Hillsborough County on I-275 in Tampa, I'd like to have that immediately post on our "Hillsborough county car accident attorney" page and our "Tampa car accident attorney" page.
I want our pages to have something comparable to a stock ticker widget, but for car accidents specific to each pages location AND combines all the info from the various law enforcement agencies. Any thoughts on how to go about creating this?
As always, thank you all for taking time out of your work to assist me with whatever information or ideas you have. I really appreciate it.
-
-
Write a Perl program (or other language script) that will: a) read the target webpage, b) extract the data relevant for your geographic locations, c) write a small html file to your server that formats the data into a table that will fit on the webpage where you want it published.
-
Save that Perl program in your /cgi-bin/ folder. (you will need to change file permissions to allow the perl program to execute and the small html file to be overwritten)
-
Most servers allow you to execute files from your /cgi-bin/ on a schedule such as hourly or daily. These are usually called "cron jobs". Find this in your server's control panel. Set up a cron job that will execute your Perl program automatically.
-
Place a server-side include the size and shape of your data table on the webpage where you want the information to appear.
This set-up will work until the URL or format of the target webpage changes. Then your script will produce errors or write garbage. When that happens you will need to change the URL in the script and/or the format that it is read in.
-
-
You need to get a developer who understands a lot about http requests. You will need to have one that knows how to basically run a spidering program to ping the website and look for changes and scrape data off of those sites. You will also need to have the program check and see if the coding on the page changes, as if it does, then the scraping program will need to be re-written to account for this.
Ideally, those sites would have some sort of data API or XML feed etc to pull off of, but odds are they do not. It would be worth asking, as then the programming/programmer would have a much easier time. It looks like the site is using CMS software from http://www.cts-america.com/ - they may be the better group to talk to about this as you would potentially be interfacing with the software they develop vs some minion at the help desk for the dept of motor vehicles.
Good luck and please do produce a post here or a YouMoz post to show the finished product - it should be pretty cool!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Increase in Soft 404s due to Custom 404 page?
Hi all, We have noticed recently soft 404s are increasing day by day; which are landing on our custom 404 page created a month back. Other 404 pages are NOT landing on custom 404 page. Does this custom 404 page hurting us by causing an increase in soft 404s? Our CMS is WordPress. Thanks
Web Design | | vtmoz0 -
CAPTCHA Alternatives to Improve Page Load Speed
Recently I had to install reCAPTCHA on my site. The site contains domain name generators and they were being misused, in the words of my host: _Addition of a Captcha will go one of two ways - hit the bruting on the head as intended - OR it will increase the load and the impact by rendering the Captcha's. _ Have noticed that reCAPTCHA adds a fair amount of code 32% of page size and 5 requests. I want to replace reCAPTCHA with an alternative, has anyone got any ideas? Cheers. Justin
Web Design | | GrouchyKids0 -
Body of text on category pages
Hello everyone, wonder if I can pick your brains about our company's website. We are a tea company - Canton Tea Co. We have been advised that it is really important to get more text onto the category pages on our website, as otherwise the page just consists of a list of products, and therefore provides Google with a ton of headers, tiny descriptions, and not enough text to allow the page to being easily indexed, therefore hurting our Google ranking for key search terms like 'Green Tea' which should lead to the Green Tea category page. So we decided to add some text to the category page. The only place for this text to go was laid over the category header image. However, it looks pretty awful and unsophisticated having this text on top of the image - please see an example, our Green Tea category page, via this link: http://www.cantonteaco.com/loose-leaf-tea-1/type/green-tea.html So I have three questions: How significant is the text on a category page such as this to that page's Google ranking? If we moved the text to an area that was hidden until clicked on, for example the 'Filter by' section that opens up when you click on it (see via URL above), would that negate the SEO benefit? Do you have any other ideas or opinions on how to resolve this? Thank you! Louise, Canton Tea Co.
Web Design | | Cantonteaco0 -
HELP! IE secure page display issue on new live site
For some reason IE 7, 8, & 9 do not display the following page: https://www.jwsuretybonds.com/protools.htm All they show is the Norton seal. It shows properly in all other browsers without issue (including IE 10+), but the earlier versions flash the page for a split second, then hides everything. Can someone shed some light on this? This is a new live site we just launched minutes ago and these browsers account for 12% of our overall traffic. UGH I hate you microsoft!!! Thanks all 🙂
Web Design | | TheDude0 -
CMS dynamicly created pages indexed?
Hey Moz'erz, Looking at the indexed pages of my clients eCommerce website I noticed that dynamically created pages are being indexed. For example this page does not "exist" but is created by a drop down filter menu that sorts by product tag: /collections/tools/TAG I can only conclude that this page got indexed either through a backlink or once upon a time there was an internal link pointing to this URL and got indexed (currently there is not). Are either of these cases possibilities? In either case before considering removal or any action I would of-course reference analytics to check for conversions, traffic and any backlinks for those "pages". I believe at the end of the day is recommend a drop down filer that doesn't create new pages as the best solution. Thoughts, comments and experience is greatly welcomed 🙂
Web Design | | paul-bold0 -
Hi, I have a doubt. If we want to hide unwanted text in a web page its possible with "" tag. And my question "does a search engine crawl those text? help me.
I want to hide a lot of text behind my site page. I know its possible with that tag. But in what way a search engine looks at those text? Hidden or they are crawled and indexed.
Web Design | | FhyzicsBCPL0 -
404 page not found after site migration
Hi, A question from our developer. We have an issue in Google Webmaster Tools. A few months ago we killed off one of our e-commerce sites and set up another to replace it. The new site uses different software on a different domain. I set up a mass 301 redirect that would redirect any URLs to the new domain, so domain-one.com/product would redirect to domain-two.com/product. As it turns out, the new site doesn’t use the same URLs for products as the old one did, so I deleted the mass 301 redirect. We’re getting a lot of URLs showing up as 404 not found in Webmaster tools. These URLs used to exist on the old site and be linked to from the old sitemap. Even URLs that are showing up as 404 recently say that they are linked to in the old sitemap. The old sitemap no longer exists and has been returning a 404 error for some time now. Normally I would set up 301 redirects for each one and mark them as fixed, but there are almost quarter of a million URLs that are returning 404 errors, and rising. I’m sure there are some genuine problems that need sorting out in that list, but I just can’t see them under the mass of errors for pages that have been redirected from the old site. Because of this, I’m reluctant to set up a robots file that disallows all of the 404 URLs. The old site is no longer in the index. Searching google for site:domain-one.com returns no results. Ideally, I’d like anything that was linked from the old sitemap to be removed from webmaster tools and for Google to stop attempting to crawl those pages. Thanks in advance.
Web Design | | PASSLtd0 -
Does listing my customer's address, phone number, and a contact form on "every page" count as duplicate content that they'd be penalized for?
I work with small local businesses (like Tree Farms, Feed Stores, Counselors, etc) doing web design, seo, etc. I encourage them to have their contact information visible at all times on their websites. I'm also delving into the world of contact forms. I want to have this info on every page - is this detrimental? Here's an example: http://www.trinityescape.net/marriage-couples-counselors-therapy-clermont-florida/ Thank you!
Web Design | | mikjgens1