Way to spider Wordpress site
-
I have an old Wordpress site and I want to move it to a new server and take it off Wordpress (too many hacks). I am trying to spider the site so as to get static, non-Wordpress, pages.
I am having trouble doing this. When I spider the site, it changes the URLs. For instance, if the URL is www.domain.com/page/ the URL I get out of the spider is /page/index.html And those are not the URLs in the search engine indices. There are about 2000 pages on this site, so it is not feasible to set up 301 redirects.
I tried using these spidering programs: WinHTTack Website Copier and PageNest
Does anyone know of another method of turning a Wordpress site into a non Wordpress site?
-
Hi Dan
Hmm that's a little strange. Two things;
- is WordPress updated? Do you get the normal URLs when viewing in your browser?
- have you tried Screaming Frog SEO Spider? It's free to crawl up to 500 pages Although it won't get the actual HTML on the pages, it could solve the URL issue perhaps.
This blackhat world thread has a few options too.
-Dan
-
Hi Dan, I'm not so experienced in migrating a WP to non -wp but I understand that the issue you're having is that the spider is returning index.htmlfiles for urls like domain/page/.
IT's normal, any spider you will use you'll always have and index.html file. Every directory has it's index.html which is the default file to show if you're not establishing something different with rewrite rules.
If you write /page/ the browser will read the index.html file. What you have to be sure is that you'll set up a 301 redirect to avoid any index.html url to show and have it redirected to the main / page (with wildcards is a one line rule) and that your internal links are pointing all to / pages and not to index.html version of it. You can jsut find and replace the /index.html" string into the html code with the /" text (dreamweaver or any html editor will do that in bulk.
Only one commentary on you idea is that you may consider useful to build a php driven site, using includes for header, footer and nav/sidebar, jsut because thinking ahead if you're willing to make changes to a portion of the page repeating throughout the site you'll have to make changes in all pages and uplaod them all which is quite huge to do and also let space for many human/machine errors.
Hope that helped you out!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Unexplainable 404 in my Wordpress Monitor
Hi Guys, We started a new website (far from complete) and i'm fixing some SEO problems. I have a SEO 404 monitor plugin and he gives me strange 404's that i can't find: https://compleetverkleed.nl/feestkleding/www.compleetverkleed.nl/feestkleding
Technical SEO | | Happy-SEO
https://compleetverkleed.nl/rode-peper-pak/www.compleetverkleed.nl/feestkleding
https://compleetverkleed.nl/new-kids-kleding/www.compleetverkleed.nl/feestkleding
https://compleetverkleed.nl/fee-kostuum/www.compleetverkleed.nl/feestkleding The 404 error code comes due to a double domain URL, and it happens only on the /feestkleding subdomain. When i check my original page, for example https://compleetverkleed.nl/rode-peper-pak/ there is only 1 internal link to /feestkleding but when i click on it, it works fine. Does anyone know how to fix this problem or where to look to see why this is happening? (it happens on every page, not the 4 above > starting to come through in webmastertools also) 2.
https://compleetverkleed.nl/pa_categorie-sitemap.xml
https://compleetverkleed.nl/product_shipping_class-sitemap.xml
https://compleetverkleed.nl/featured_item_category-sitemap.xml
https://compleetverkleed.nl/post_tag-sitemap.xml
https://compleetverkleed.nl/featured_item_tag-sitemap.xml This 404's are completely new for me and i don't know where to find this 😞 Thanks!2 -
If I want clean up my URLs and take the "www.site.com/page.html" and make it "www.site.com/page" do I need a redirect?
If I want clean up my URLs and take the "www.site.com/page.html" and make it "www.site.com/page" do I need a redirect? If this scenario requires a 301 redirect no matter what, I might as well update the URL to be a little more keyword rich for the page while I'm at it. However, since these pages are ranking well I'd rather not lose any authority in the process and keep the URL just stripped of the ".html" (if that's possible). Thanks for you help! [edited for formatting]
Technical SEO | | Booj0 -
Redirects in site map
I have a site with the ace/sef ( creates friendly URLS) in a large data base site. It creates a site map dynamically. Yet I realize one issue which I am trying to think through. I recently changed my urls to include an ID number example: homepage/houses/1134-big-blue-house The prior url was: homepage/houses/big-blue-house the original url above redirects to the new one with the ID like I want. However the site map has both URLS in it which go to same page I am not sure but it seems rather stupid to have the new URL and OLD redirected URL in the site map. Yet beside stupid I am wondering if this is duplicate content and will cause a penalty from the google bot. What is your opinion ?
Technical SEO | | aimiyo0 -
Want to Target Mobile site for Google Mobile Version and Desktop Site for Google Desktop Version
I have ecommerce site with both mobile version and desktop version. Mobile version starts with m.example.com and full version starts with www.example.com I am using same content through out both site and using 301 redirection by detecting user agent vice-versa. My both sites are accessible to crawl by any google spider. I have submitted both sites's sitemap to GWT and mobile site having mobile sitemap xml, so google can easily recognize my mobile site. Is it going to help to rank my both sites as per my expectation? I need to rank for mobile site in Google mobile and ranking for desktop site in Google desktop version. Some of pages of my mobile site are started to appearing in Google desktop version. So how I can stop them to appear in Google desktop? Your comments are highly welcome.
Technical SEO | | Hexpress0 -
I Bought AD on a site, am i being penalized for it
Hello... I have bought two ads on the site : http://goo.gl/nXnRA If you look on the right hand column, you will see the top two images under the word sponsors are mine.... Am i being penalized for this? are these ads frowned upon by Google? I need guidance as my ranking for the words being used as that alt text have tanked, badly and i also received a letter from WMT about unnatural links. If that site, and the ads (not links) i bought are fine with google, please advise what you think it may be... my site is : http://goo.gl/JgK1e Please help... Thanks
Technical SEO | | Prime850 -
About Google Spider
Hello, people! I have some questions regarding on Google spider. Many people are saying that "Google spiders only have US IP address." Is this really true? But I also saw video from Google's offical blog and it said "Google spider come from all around the world." At this point I am really confused. Q1) I researched and it seems like Google spiders have only US IP address. THen what does exactly mean by "Google spider come from all around the world."? Q2) If Google spider have only US IP address, what happen to site which use IP delivery? Is this means that Google spider always redirect to us site since they only have US IP? Can anyone help me to understand?? One more questions! When Google analyzing for cloaking issue, do you think Google analyze when spider crawls the site or after they crawled the site?
Technical SEO | | Artience0 -
How to setup tumblr blog.site.com to give juice to site.com
Is it possible to get a subdomain blog.site.com that is on tumblr to count toward site.com. I hoped I could point it in webmaster tools like we do www but alas no. Any help would be greatly appreciated.
Technical SEO | | oznappies0 -
Disabling Wordpress Attachment Posts
having issues with every media (photo) we upload creating an additional page on the blog that is being indexed in Google. ie. post: www.mysite.com/ creates this for the photo used: www.mysite.com/blog/category/image-name/attachment/800px/
Technical SEO | | bozzie3110