Way to spider Wordpress site
-
I have an old Wordpress site and I want to move it to a new server and take it off Wordpress (too many hacks). I am trying to spider the site so as to get static, non-Wordpress, pages.
I am having trouble doing this. When I spider the site, it changes the URLs. For instance, if the URL is www.domain.com/page/ the URL I get out of the spider is /page/index.html And those are not the URLs in the search engine indices. There are about 2000 pages on this site, so it is not feasible to set up 301 redirects.
I tried using these spidering programs: WinHTTack Website Copier and PageNest
Does anyone know of another method of turning a Wordpress site into a non Wordpress site?
-
Hi Dan
Hmm that's a little strange. Two things;
- is WordPress updated? Do you get the normal URLs when viewing in your browser?
- have you tried Screaming Frog SEO Spider? It's free to crawl up to 500 pages Although it won't get the actual HTML on the pages, it could solve the URL issue perhaps.
This blackhat world thread has a few options too.
-Dan
-
Hi Dan, I'm not so experienced in migrating a WP to non -wp but I understand that the issue you're having is that the spider is returning index.htmlfiles for urls like domain/page/.
IT's normal, any spider you will use you'll always have and index.html file. Every directory has it's index.html which is the default file to show if you're not establishing something different with rewrite rules.
If you write /page/ the browser will read the index.html file. What you have to be sure is that you'll set up a 301 redirect to avoid any index.html url to show and have it redirected to the main / page (with wildcards is a one line rule) and that your internal links are pointing all to / pages and not to index.html version of it. You can jsut find and replace the /index.html" string into the html code with the /" text (dreamweaver or any html editor will do that in bulk.
Only one commentary on you idea is that you may consider useful to build a php driven site, using includes for header, footer and nav/sidebar, jsut because thinking ahead if you're willing to make changes to a portion of the page repeating throughout the site you'll have to make changes in all pages and uplaod them all which is quite huge to do and also let space for many human/machine errors.
Hope that helped you out!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Not Indexing Pages (Wordpress)
Hello, recently I started noticing that google is not indexing our new pages or our new blog posts. We are simply getting a "Discovered - Currently Not Indexed" message on all new pages. When I click "Request Indexing" is takes a few days, but eventually it does get indexed and is on Google. This is very strange, as our website has been around since the late 90's and the quality of the new content is neither duplicate nor "low quality". We started noticing this happening around February. We also do not have many pages - maybe 500 maximum? I have looked at all the obvious answers (allowing for indexing, etc.), but just can't seem to pinpoint a reason why. Has anyone had this happen recently? It is getting very annoying having to manually go in and request indexing for every page and makes me think there may be some underlying issues with the website that should be fixed.
Technical SEO | | Hasanovic1 -
Squarespace or Wordpress for a Photographer
Hi, I was wondering if people would recommend squarespace or wordpress for a photographer. I'm mainly curious about how wordpress uses internal links for their images and squarespace images exist on http://static1.squarespace.com. Wouldn't a photographers website, one that focuses on images, be better on wordpress for this readson?
Technical SEO | | mattdinbrooklyn1 -
Switching site from http to https. Should I do entire site?
Good morning, As many of you have read, Google seems to have confirmed that they will give a small boost to sites with SSL certificates this morning. So my question is, does that mean we have to switch our entire site to https? Even simple information pages and blog posts? Or will we get credit for the https boost as long as the sensitive parts of our site have it? Anybody know? Thanks in advance.
Technical SEO | | rayvensoft1 -
Site Map Problems or Are They?
According to webmaster tools my Sitemap contains urls which are blocked by robots.txt Our site map is generically generated and encompasses all web pages, whether I have excluded them using the robots.txt file As far as I am aware this has never been an issue until recently. Is this hurting my rankings and how do I fix it? Secondly, webmaster tools says there is over 5,000 error/warnings on my site map. But site map is only 1,400 or so pages submitted. How do I see what is going on?
Technical SEO | | Professor0 -
See your sites Architecture
Does anybody know a problem where you can see how your internal linkings look to the search engines?
Technical SEO | | ScottBaxterWW0 -
How to write 301 redirects in WordPress
I've successfully migrated new site to new domain (www.cmsearchmarketing.com) But I cannot get 301 redirects for pages and blog posts to redirect from the old domain (www.creativemindsearchmarketing.com). And it's my understanding I need to do a 301 for each page to maintain SEO. Here's what I've tried: RewriteCond %{QUERY_STRING} ^p=975$RewriteRule ^index.php$ http://www.cmsearchmarketing.com/top-5-questions-to-ask-an-seo-firm-before-signing-up/? [R=301,L] BEGIN WordPress<ifmodule mod_rewrite.c="">RewriteEngine OnRewriteBase /RewriteCond %{REQUEST_FILENAME} !-fRewriteCond %{REQUEST_FILENAME} !-dRewriteRule . /index.php [L]</ifmodule># END WordPress #AND ALSO# Use PHP5 Single php.ini as defaultAddHandler application/x-httpd-php5s .php BEGIN WordPress<ifmodule mod_rewrite.c="">RewriteEngine OnRewriteBase /RewriteCond %{REQUEST_FILENAME} !-fRewriteCond %{REQUEST_FILENAME} !-dRewriteRule . /index.php [L]</ifmodule># END WordPress redirect 301 /top-5-questions-to-ask-an-seo-firm-before-signing-up http://www.cmsearchmarketing.com/top-5-questions-to-ask-an-seo-firm-before-signing-up/ Any suggestions would be appreciated. _Cindy P.S. Maybe some other issues are in the way: --Old site is WP-Remix theme no longer supported, and latest WP version is 2.9.1 -- Old domain (www.creativemindsearchmarketing.com) is the primary account on BlueHost …and the new domain (www.cmsearchmarketing.com) is an addon, so the new domain's directory is within root of old domain. -- in root domain of old site there are other "handler files" that also have base file rewrites, if this is an issue: name of this file in root directory is:
Technical SEO | | CeCeBar
.htaccess.addHandlerBak -FrontPage- <limit get="" post="">order deny,allowdeny from allallow from all</limit><limit put="" delete="">order deny,allowdeny from all</limit>AuthUserFile /home/creatjo7/public_html/_vti_pvt/service.pwdAuthGroupFile /home/creatjo7/public_html/_vti_pvt/service.grp# BEGIN WordPress<ifmodule mod_rewrite.c="">RewriteEngine OnRewriteBase /RewriteCond %{REQUEST_FILENAME} !-fRewriteCond %{REQUEST_FILENAME} !-dRewriteRule . /index.php [L]</ifmodule> END WordPressAuthName creativemindsearchmarketing.comIndexIgnore .htaccess /.?? *~ *# /HEADER /README /_vti0 -
Penalities in a brand new site, Sandbox Time or rather a problem of the site?
Hi guys, 4 weeks ago we launched a site www.adsl-test.it. We just make some article marketing and developed a lots of functionalities to test and share the result of the speed tests runned throug the site. We have been for weeks in 9th google serp page then suddendly for a day (the 29 of february) in the second page next day the website home is disappeared even to brand search like adsl-test. The actual situalion is: it looks like we are not banned (site:www.adsl-test.it is still listed) GWT doesn't show any suggestion and everything looks good for it we are quite high on bing.it and yahoo.it (4th place in the first page) for adsl test search Anybody could help us to understand? Another think that I thought is that we create a single ID for each test that we are running and these test are indexed by google Ex: <cite>www.adsl-test.it/speedtest/w08ZMPKl3R or</cite> <cite>www.adsl-test.it/speedtest/P87t7Z7cd9</cite> Actually the content of these urls are quite different (because the speed measured is different) but, being a badge the other contents in the page are pretty the same. Could be a possible reason? I mean google just think we are creating duplicate content also if they are not effectively duplicated content but just the result of a speed test?
Technical SEO | | codicemigrazione0 -
How to handle a merge with another site
I own a gaming site at legendzelda.net Recently a site zelda-temple.net wanted to merge communities and sites. We pretty much scrapped their content (a lot was articles I already had topics on), merged the user databases, and redirected the domain to mine. What is the best seo way to redirect that domain. I tried to set up a 301 on the entire domain to gain all the backlinks that site had (A LOT) but it seems as if a lot are not being picked up by the open site explorer and other tools. Advice?
Technical SEO | | webfeatseo0