Thursday 30 May 2013

Extract data from web pages

Web data extraction is the process of taking data from web pages and converting the unstructured results into an Excel file or a database.

Using its automatic navigation, WinTask can launch the URL to load, send a UserId and an encrypted password (if it's a secure site), conduct searches, and navigate to the different pages where some field contents are to be extracted. Internet Explorer, Mozilla Firefox and Google Chrome are the supported browsers.

Examples:

    Aggregate Real Estate Info
    Automate your favorite searches on eBay and extract prices
    Clip News Articles
    Extract Gambling Odds
    Create Alerts from Trading Sites
    Build your personal Product Catalog gathering information from several websites
    Automate Search Ad Listings
    Collect Data from Competitors

Are you wasting too much time and money manually collecting information from the Internet? WinTask is able to collect data from anywhere on the Internet and extract selected content from the targeted web pages.

Loops, conditions, and error handling allow a tremendous flexibility and ensure a reliable extraction even running unattended, such as overnight.

In addition, you can use WinTask in the opposite way: it can read the content of an Excel file or a database, automatically fill a Web form, click the Submit button and keep looping until the last record is read.


Source: http://www.wintask.com/web-data-extraction.php

Monday 27 May 2013

What is Web Scraping, and What are the Legal Issues in it?

Spiders, bots and crawlers: Oh MY! Most everyone who has a website or online social media profile is victim of web scraping, whether they know it or not.


What is web scraping?

Web scraping is a technique used by web crawlers, or bots, to extract information from websites. Collected information is transformed into structured data that can be stored and analyzed, typically in a database or spreadsheet. This technology drives a substantial amount of business, and many companies’ viability relies on it. However, controversy can arise when commercial companies use scraping software to collect substantial amounts of data from websites for their own profit.


When does web scraping violate website’s terms of use?

Web scraping is used for many reasons, including price comparisons and targeted advertising. Often websites prohibit scraping through their terms of use. There are two types of terms of use online: clickwrap and browsewrap. Clickwrap terms require the user to click in agreement with the terms of use. Browsewrap terms are simply listed on the website, without requiring any action. Consequently, if the user never saw the terms of use, there was no contract formed because there was no ‘meeting of the minds’.


What are legal risks for businesses that use web scraping?

Companies using web scraping can be subject to legal risks, but under current law, it is unclear what crawlers can and cannot do. A large risk at issue with web crawlers is the unanswered question behind breaking terms and conditions of websites when scraping information. The law is unclear as to whether that activity amounts to trespass to chattels or breach of contract. Some website owners’ claims have been viable in these situations, so there is a risk. When the scraper uses the scraped information commercially, they will likely be subject to more liability. Additionally, if the scraper collects copyrighted information, its operators could be liable for infringement.


Can web scraping give rise to a trespass to chattels action?

Trespass to chattels is a tort claim arising when a party has intentionally interfered with another person's lawful possession of movable personal property. Because, traditionally, trespass to chattels has included dispossession of the property by taking it, destroying it, or barring the owner's access to it, it has been argued in the digital age that websites are considered as chattels.

In eBay v. Bidder’s Edge, a notable claim involving scraping as “illegal data mining”, a California court held that the thousands of queries a day, electronic signals retrieving information from eBay’s system, by Bidder’s Edge were sufficiently tangible to constitute a trespass action. However, eBay had not actually suffered any injury or harm from the trespass. While the court acknowledged this, they stated that eBay was not required to wait until they suffered harm before they sought an injunction.

The Supreme Court of California interpreted the eBay decision further in Intel v. Hamidi, stating that showing a risk of future harm substantiated claims of Internet trespass to chattels. Accordingly, to determine if there is a substantial likelihood of future harm, a court should look to the volume or frequency of interferences. Subsequent courts in other jurisdictions have applied this analysis, requiring that the plaintiff demonstrate damage, or substantial risk of future damage, to their computer system. Thus, the degree of protection for online content is not settled, and will depend on the type of access made by the scraper, the amount of information accessed and copied, the degree to which the access adversely affects the site owner’s system, and the types and manner of prohibitions on such activity.


Can web scraping give rise to a breach of contract action?

In regard to breach of contract claims for violating a site’s terms of service, the United States Court of Appeals for the Second Circuit held in Specht v. Netscape Communications Corp. that terms of use are not enforceable if there is not reasonable notice of the existence of the terms and unambiguous consent to that license. Merely clicking on a button does not show assent to license terms if those terms were not obvious and if it was not explicit to the consumer that clicking meant agreeing to the license. California courts went on to determine in Ticketmaster Corp v. Tickets.com, Inc. that a hyperlink to the terms of use placed in the footer of a web page does not constitute prominent notice of those terms. However, if the terms are prominent, then a user will be held to the terms on inquiry notice.

Clickwrap agreements seem to carry more weight, as a Texas state court found grounds of trespass to chattels and breach of contract in American Airlines, Inc. v. Farechase, Inc. The court enforced an 'if you use this site, you agree' terms of service statement on American Airlines websites, and enjoined software company Farechase from accessing and scraping data to redistribute and sell it to travel agents and online travel systems.


Can web scraping give rise to a copyright infringement action?

Scraping, generally, raises some copyright law issues. Visiting a copyrighted website temporarily for the purpose of extracting factual information and reproducing it does not violate website owner’s copyright. Any factual information that is extracted is protected under fair use in The Copyright Act. However, if the extracted information is a copyrighted work, the scraper may be subject to copyright liability.

The Digital Millennium Copyright Act of 1998 was enacted to control and regulate copyright issues in a technological world. Section 1201(a)(1) of the DMCA states “No person shall circumvent a technological measure that effectively controls access to a work protected under this title.” This provision speaks to web scraping, particularly when bots avoid measures that website owners make to protect their content.


What types of damages have been awarded in previous web scraping cases?

Typically, injunctions are the only remedies sought. A plaintiff in a case against a scraper must show substantial harm in order to receive damages. Courts have generally not awarded damages, only injunctions, in cases involving web scraping.


It seems that the design and nature of the crawled web sites determines the legal liability, versus the actual activity of the crawler itself. If your business operates scraping technology, be wary of what you crawl! If you operate a website, check out these tips to create an effective user policy.


Source: http://www.attorneyfee.com/what-web-scraping-and-what-are-legal-issues-it/2012

Friday 24 May 2013

Extract Web Data

 More and more Internet users ask a question how to extract web data. There are plenty of services which appeared on the Internet and offer the customers to extract web data,Extract Web Data and only few software products (like WebSundew web extraction tool) which can fully meet customers' needs in scraping web data.

It is necessary to extract web data if you are busy with e-commerce, or you want to build your own e-shop and need to collects lots of information on some category of products from the World Wide Web. How does data extraction service work? You send your target web site you need to extract data from and as a result get data in Excel (or in some other) format. If you use web scraping software tool you have an opportunity to run it on your computer and use lots of advanced features which make the extraction process easy, simple and fully secure.

With WebSundew web scraping software tool you can extract web data with high accuracy and speed. You can automate the whole process of copying and pasting by just setting up an agent which will do all the work for you. You can change agent settings and collect different kind of information from a great variety of web sites. It is easy to extract web data and store it to Excel, CSV, XML, database or any text format. It is also possible to extract images and files, perform multi-level extraction. scheduling extraction, etc.

So if you have WebSundew data extraction tool, you do not need to ask question how to extract web data, you need to ask question which data and in what format I want to get, as well as how many times a day or week I want this data to be automatically extracted, whether I want to get notification when the extraction process is completed or if I want to get access to WebSundew programmatically to implement it to some other software product. So the most effective way to extract web data is purchasing your own web scraping tool which will always be with you in any place and any time.

Source: http://www.websundew.com/extract-web-data

Friday 17 May 2013

How to extract and restore website files, databases manually from Plesk backup

Steps to extract and restore website files, databases manually from Plesk backup
I. FIRST WAY:

If you have not so big dump file, for example 100-200MB, you can unzip it and open in any local Email client. Paths of the dump will be shown as attachments. Choose and save needed one then unzip it.

II. SECOND WAY:It can be done using mpack tools to work with MIME files. This packet is included into Debian:

    apt-get install mpack

For other Linux systems you can try to use RPM from ALT Linux:ftp://ftp.pbone.net/mirror/ftp.altlinux.ru/pub/distributions/ALTLinux/Sisyphus/files/i586/RPMS/mpack-1.6-alt1.i586.rpm

or compile mpack from the sources: http://ftp.andrew.cmu.edu/pub/mpack/.

 Create an empty directory to extract the back up file:

     mkdir recover
     cd recover

and copy backup into it.By default Plesk backup is gzipped (if not, use cat), so run zcat to pass data to munpack to extract content  of directories from the backup file:

    zcat DUMP_FILE.gz > DUMP_FILE
    cat DUMP_FILE | munpack

In result you get the set of tar and sql files that contain domains’ directories and databases. Untar the needed directory. For example if you need to restore the httpdocs folder for the DOMAIN.TLD domain:

    tar xvf DOMAIN.TLD.htdocs

NOTE: ‘munpack’ utility may not work with files greater then 2Gb and during dump extracting you may receive the error like

    cat DUMP_FILE | munpack
    DOMAIN.TLD.httpdocs (application/octet-stream)
    File size limit exceeded

In this case try the next way below.

III. THRID WAY:

First, check if the dump is compressed or not and unzip if needed:

    file testdom.com_2006.11.13_11.27
    testdom.com_2006.11.13_11.27: gzip compressed data, from Unix

    zcat testdom.com_2006.11.13_11.27 > testdom.com_dump

Dump consists from the XML path that describes what is included into the dump and the data itself. Every data pie can be found by appropriate CID (Content ID) that  can be found in the XML path.

For example if the domain has hosting, all path that are included in the hosting are listed like:

<phosting cid_ftpstat=”testdom.com.ftpstat” cid_webstat=”testdom.com.webstat” cid_docroot=”testdom.com.htdocs” cid_private=”testdom.com.private”
cid_docroot_ssl=”testdom.com.shtdocs” cid_webstat_ssl=”testdom.com.webstat-ssl” cid_cgi=”testdom.com.cgi” errdocs=”true”>

If you need to extract domain’s ‘httpdocs’ you should look for value of ‘cid_docroot‘ parameter, it is ‘testdom.com.htdocs’ in our case.

Next, cut the content of ‘httpdocs’ from the whole dump using the CID you found. In order to do it you should find the string number from that our content begins and the string where it ends, like:

    egrep -an '(^--_----------)|(testdom.com.shtdocs)' ./testdom.com_dump | grep -A1 "Content-Type"
    2023:Content-Type: application/octet-stream; name="testdom.com.shtdocs"
    3806:--_----------=_1163395694117660-----------------------------------------

Increase the first line number on 2 and  subtract 1 from the second line number, then run:

    head -n 3805  ./testdom.com_dump | tail +2025  > htdocs.tar

You get the tar archive of the ‘httpdocs’ directory in result.

If you need to restore the database, the behaviour is similar. You should find databases XML description for the domain you need, for example:

<database version=”4.1″ name=”mytest22″ cid=”mytest22.mysql.sql” type=”mysql”>
<db-server type=”mysql”>
<host>localhost</host>
<port>3306</port>
</db-server>
</database>

Find the database content by CID:

    egrep -an '(^--_----------)|(mytest22.mysql.sql)' ./testdom.com_dump | grep -A1 "Content-Type"
    1949:Content-Type: application/octet-stream; name="mytest22.mysql.sql"
    1975:--_----------=_1163395694117660-----------------------------------------

Increase the first line number on 2 and subtract 1 from the second line number, then run:

    head -n 1974  ./testdom.com_dump | tail +1951  > mytest22.sql

In result you get the database in SQL format.

Source: http://linuxadministrator.pro/blog/?p=436

Monday 6 May 2013

Houses For Sale | How Import.io Will Help Reporters Extract Data From Web Pages

The free online tool called import.io (pronounced import-eye-oh) will let you extract large amounts of data from a web page into an Excel spreadsheet.

For example, you could go to an estate agent’s website, find details on houses for sale, and extract the data to a table, defining which column headers, such as house price and location, you want to collect the data for.

The tool will also allow you to aggregate various sources of data. For example, you could extract data on house prices from 10 different estate agent websites, pulling the information into a single table.

This idea is to “democratise” data, Sally Hadadi from import.io told Journalism.co.uk. “Big data is extremely messy and hard to get a hold of in a simple, easy manner. import.io aims to solve this problem and make big data available to everyone with a simple, easy-to-use interface. We turn the web into a database, allowing you extract data from websites into rows and columns, normalising the selected information.”

She added: “We want journalists to get the best information possible to encourage and enhance unique, powerful pieces of work and generally make their research much easier.”

import.io is currently in private developer testing and set to launch at some point this year.

The tool will be offered free of charge with import.io looking to monetise by charging those who pull in high volumes of data.

The London-based team which created import.io, first created a tool aimed at banks which allows for the searching and analysis of online and internal data.

Chris Alexander, a developer from import.io, gave a lightning pitch at last week’s Hacks/Hackers London .

Disclaimer: I help organise Hacks/Hackers London monthly meet-ups.

Source: http://houses.mystery-shopper.org/houses-for-sale/houses-for-sale-how-import-io-will-help-reporters-extract-data-from-web-pages/

Wednesday 1 May 2013

Extract Data from a Web Page into an Excel Spreadsheet

Web Queries are simple but extremely powerful feature of Microsoft Excel that help you import live data from external websites into your Excel sheets – all you have to do is visually select portions of a web page in the browser and Excel will do the rest.

With Excel web queries, you can import information like Google search results, the latest CNN headlines, stock quotes, currency exchange rates or even monitor regular websites for changes.

Getting data from Web Pages into Excel

As shown in the figure above, click the “From Web” menu under Data -> Get External Data group. A new Web Query Dialog popus up – type the web URL here (see example for Google News below).

Click the Yellow Arrows next to the tables that would like to bring into Excel and Import.

You can do a similar thing using IE – just navigate to the web page that has the data and select “Export to Excel” from the Internet Explorer contextual menu.

Once the data is inside Excel, you can do all sort of complex things like conditional formatting, sorting, create charts, etc. You can either keep that data static or set it to auto-refresh so Excel will automatically update the worksheet whenever the source web page changes.

Microsoft also provides another Web Data add-in for Excel 2007 (link) that’s even more intelligent.   You make a few selections on a web page and it will automatically recognize other portions that match your pattern.

For instance, I just selected the top result in Google Web search page and clicked the “Select Similar” button.   It recognized all titles of other web pages appearing in the search results and imported them into Excel.

Now I can just open that Google – Excel spreadsheet, click refresh and will know almost instantly if the rankings of web pages have changed. Like RSS and Yahoo! Pipes, there could be many creative uses of Excel Web Queries.

Source: http://www.labnol.org/software/tutorials/extract-data-from-web-pages-into-excel/1979/

Note:

Jazz Martin is experienced web scraping consultant and writes articles on web data scraping, website data scraping, data scraping services, web scraping services, website scraping, eBay product scraping, Forms Data Entry etc.

Extract Options From Dropdown List Extracted From Website

Search resultsData Extraction - iMacros
Text to be extracted[EXTRACT] Salary: 33,000.00 per year ... In order to extract all options in a drop down list use TAG POS=1 TYPE=SELECT ATTR=TXT:*&&NAME: ...
wiki.imacros.net/Data_Extraction - Cached

JSP: How to Extract values from dropdown | DaniWeb
Hi, i want to know how to extract values from drop down box for processing in next field. In my file first drop down has list of countries taken from a database. I ...
www.daniweb.com/web-development/jsp/​threads/48968 - Cached

Excel Magic Trick 698: Extract Unique Items w Formula For ...See how to create an Expanding Data Validation Drop-down List from Table ... Unique List Data Extract Formula ... design--->table style options---->and ...
www.youtube.com/watch?v=IhuURsu0jdI - Cached.Play VideoMore results from youtube.com »Extraction Rule: Extract the value for an option in the ...
Hi All, I have a databound dropdown control in my page. How do I extract the value corresponding to a specific option in the dropdown list? I need to write ...
social.msdn.microsoft.com/Forums/.../​vstswebtest/thread/... - Cached

Use jQuery to extract data from HTML lists and tables » Encosia
... (of dropdown,textbox and cell text) ... I am using the online API for data extraction. It enables me to extract data with less and simpler code.
encosia.com/use-jquery-to-extract-data-​from-html-lists... - Cached

Chrome Web Store - Export HTML List Options
Exports the options for all HTML select elements (dropdown lists and list boxes) as text from a web page.
chrome.google.com/webstore/detail/​export-html... - Cached

need help with complex extraction rule - Visual Studio
I'm attempting to create a performance test that performs a search on the website, then changes the sort options from a dropdown menu randomly. Cur
www.go4answers.com/...complex-​extraction-rule-59766.aspx - Cached

Excel :: Extract List From Combobox
... only the second word found would be extracted and pasted in the new list. ... criteria & extract ranges) There are multiple options ... (dropdown list) ...
excel.bigresource.com/Extract-list-from-​Combobox... - Cached

How do I extract excel data from multiple worksheets and put ...
... ,I want to extract rows which have the word CHEQ in ... (this is a dropdown list with two options-CHEQ/PAID) ... start on cell A12 and the extracted records from ...
www.mrexcel.com/forum/...do-i-extract-​excel-data-multiple... - Cached

python - Extract Options From Dropdown List Extracted From ...
I've been trying (unsuccessfully) to solve this problem for a few hours and need some help. I used Firebug to extract a couple hundred lines of HTML that look like this:
stackoverflow.com/questions/14448925/​extract-options... - Cached
Promotional Results For You
Get the latest Internet Explorer®, Enhanced by Yahoo!
Fast. Safe. Easy. Download today!
downloads.yahoo.com/internetexplorer

Source: http://vibhu-university.blogspot.in/2013/04/extract-options-from-dropdown-list.html

Note:

Jazz Martin is experienced web scraping consultant and writes articles on web data scraping, website data scraping, web scraping services, data scraping services, website scraping, eBay product scraping, Forms Data Entry etc.