How I could steal all your content from your RSS
As I documented in my tutorial scraping website content with PHP using curl; it's very easy to get the full generated source of a website. So if you take the script as discussed in the tutorial, load in a URL (http://www.thepcspy.com for example) then you have the full code for that page. Content and all.
Yes, a full RSS feed does make life easy for the thief because you're providing all your content right there for them. However, by definition your RSS feed has a URL in it and a description - that's probably enough.
RSS feeds are structured in a specific way (theoretically). Your feed will have a "link" attribute as well as a limited "description" and these are where we can do the damage. Your description is almost certainly the start of your content and the link is where we can find it.
Using the script outlined in the tutorial listed above I can curl the contents of the URL you've supplied. All I need to do then is look through the page for something that indicates the start of the article (such as what was provided in the description attribute of the RSS feed).
I can now happily curl your content, search for a specific string of characters and I've found the start of your content. The real trick is knowing where it ends - but that's not necessarily a big issue because they don't really *need* to steal ALL your content. If I can take 1/2 of it and get it indexed on the search engines before you - then I win.
Just a passing thought about how thieves could be a bit more "intelligent" about their theft. However, most I see just copy and paste from my site. It'd be more flattering if they used their brains, wrote a PHP script to scrape the URLs from your feed, tokenize the content, match the start against the description, pull the content into a database and publish it out. They'd still get a cease and desist but at least I'd have a little more respect for them.
Enjoy this article? Why not subscribe to the full RSS feed?


