One element of content scraping is legality. If you are using the material that you scrape for private, internal purposes, then there isn’t much to worry about. However, if you seek to re-post the material that you scrape, such as material from blogs, you must keep in mind copyright laws.
Most every country has its own laws, and usually, you will be subject to the laws of the country that your site’s server is located. So, if your server is located in the United States, you are subject to the Digital Millennium Copyright Act, regardless of where you live.
For the purposes of this article, we’ll focus on the DMCA, as a wide majority of servers are located in the United States. However, the same general principles apply everywhere.
The main concern you need to keep in mind is fair use. For example, you can’t re-post someone’s entire article, because they can have it removed through the DMCA, causing potential problems for you, including the loss of your site. The best rule of thumb is to post an excerpt, usually no more than 60 words. Then, be sure to include a link to the source blog or website. This way, the website operator will enjoy the benefit of added traffic, and you can rest assured that you are in harmony with copyright laws.
If your Web scraping campaign scrapes from feeds, most blog authors have their feeds set to include only excerpts in their feed, so the hard work will be done for you in this instance. However, if you use another method, or if your target blog posts entire articles in its feed, then you will need to adjust your scraping software accordingly or edit the articles manually, the latter obviously being the less desirable option.
You may be tempted to say that there’s not much risk in violating one’s rights under copyright law, however, you could expose yourself to civil penalties, as well as the loss of goodwill. The author of the content in question could accuse you of stealing his or her intellectual property, causing a possible backlash of negative publicity. Simply put, it’s not worth it.
By adhering to copyright law, you put yourself in a win-win situation. The author of the content wins, because he or she gets more exposure. You win, because you have original content for your web scraping site, and you can collect revenue from banner ads and the like in good conscience. Besides, posting excerpts from each article gives you room for more articles, allowing you to present a wider variety of content to your site’s visitors.
Remember to always give credit. No matter how much of the content you choose to post, always include a link back to the original article at the very least. Not only is this common courtesy, but it minimizes the chances of you being sued. When you re-post the content you scrape, you always put yourself at a risk for legal action unless you secure permission ahead of time. By giving proper credit, and by posting a small excerpt, you can reduce the chances of negative results quite handily.