Tuesday 30 December 2014

Web Data Scraping Services At Lowest Rate For Business Directory

We are the world's most trusted provider directory, your business data scrape, and scrape email scraping and sending the data needed. We scour the entire directory database or doctors, lawyers, brokers, financial advisers, etc. As the scraping of a particular industry category wise database scraping or data that can be adapted.

We are pioneers in the worldwide web scraping and data services. We must understand the value of our customer database, we email id with the greatest effort to collect data. We are lawyers, doctors, brokers, realtors, schools, students, universities, IT managers, pubs, bars, nightclubs, dance clubs, financial advisers, liquor stores, Face book, Twitter, pharmaceutical companies, mortgage broker scraped data, accounting firms, car dealers , artists, shop health and job portals.

Our business database development services to try and get real quality at the lowest possible industry. Example worked. We have a quick turnaround time can be a business mailing database. Our business database development services to try and get real quality at the lowest possible industry. Example worked. We have a quick turnaround time can be a business mailing database.

We are the world's most trusted provider directory, your business data scrape, and scrape email scraping and sending the data needed. We scour the entire directory database or doctors, lawyers, brokers, financial advisers, etc., as the scraping of a particular industry category wise database scraping or data that can be adapted.

We are pioneers in the worldwide web scraping and data services. We must understand the value of our customer database, we email id with the greatest effort to collect data. We are lawyers, doctors, brokers, realtors, schools, students, universities, IT managers, pubs, bars, nightclubs, dance clubs, financial advisers, liquor stores, Face book, Twitter, pharmaceutical companies, mortgage broker scraped data, accounting firms, car dealers , artists, shop health and job portals.

What a great resource for specific information or content with little success to gather and have tried to organize themselves in a folder? You no longer need to worry, and data processing services through our website search are the best solution for your problem.

We currently have an "information explosion" phase of the walk, where there is so much information and content information for an event or a small group of channels.

Order without the benefit of you and your customers a little truth to that information. You use information and material is easy to organize in a way that is needed. Something other than a small business guide, simply create a separate folder in less than an hour.

Our technology-specific Web database for you to a similar configuration and database development to use. In addition, we finished our services can help you through the data to identify the sources of information for web pages to follow. This is a cost effective way to create a database.

We offer directory database, company name, address, the state, country, phone, email and website URL to take. In recent projects we have completed. We have a quick turnaround time can be a business mailing database. Our business database development services to try and get real quality at the lowest possible industry.

Source:http://www.articlesbase.com/outsourcing-articles/web-data-scraping-services-at-lowest-rate-for-business-directory-5757029.html

Sunday 28 December 2014

What Kind of Legal Problems Can Web Scraping Cause

Web scraping software is readily available and has been used by many for legitimate purposes. It has also been used for illegal purposes. A website that engages in this practice should know the legal dangers of the activity.

Related Articles

Black Hat SEO Popular Techniques

General Knowledge- VII

The idea of web scraping is not new. Search engines have used this type of software to determine which results appear when someone conducts a search. They use special software software to extract data from a website and this data is then used to calculate the rankings of the website. Websites work very hard to improve their ranking and their chance of being found by anyone making a search. This use of this practice is understood and is considered to be a legitimate use for the software. However, there are services that provide web scraping and screen scraping prevention services and help the webmaster to remain safe from the attack of bad bots.

The problem with duplicacy is that it is often used for less than legitimate reasons. Since the software responsible can collect all sorts of data from websites and store the information that is collected, it represents a danger to anyone who might be affected by it. The information that can be collected can be used for many practices that are not so legitimate and may even be illegal. Anyone who is involved in this practice of content duplicacy should be aware of the legal issues implicated with this practice. It may be wise for anyone who has a website to find ways to prevent a site from being scraped or to use professional services to block site scraping.

Legal problems

The first thing to worry about, if you have a website or are using web scraping software, is when you might run into legal problems. Some of the issues that web scraping can cause include:

•    Access. If the software is used to access sites it does not have the right to access and takes information that it is not entitled to, the owner of the web scarping software may find themselves in legal trouble.

•    Re-use. The software can collect and reuse information. If that information is copyrighted, that might be a legal problem. Any information that is reused without permission may create legal issues for anyone who uses it.

•    Robots. Some states have enacted laws that are designed to keep people from using scraping robots. These automatically search out information on websites and using them may be illegal in some states. It is up to the user of the web scraping software to comply with any laws in the state in which they are operating.

Who is Responsible

The laws and regulations surrounding this practice are not always clear. There are many grey areas that allow this practice to occur. The question is, who is responsible for determining whether the use of web scraping software is legal?

Websites collect the information, but they may not be the entity using the web scraping software. If they are using this type of software, it is not always enough to inform the website's visitors that this practice is occurring. Putting this information into the user agreement may or may not protect the website from legal problems.

It is also partly the responsibility of a site owner to prevent a site from being scraped. There is software that can be used that will do this for a website and will keep any information that is collected safe and secure. A website may or may not be held legally responsible for any web scraper that is able to collect information they have. It will depend on why the data was collected, how it was used, who collected it, and whether precautions were taken.

What to expect

The issue of content copying and the legal issues surrounding it will continue to evolve. As more courts take on this issue, the lines between legal and illegal web scraping will become clearer. Many of the cases that have been brought to court have occurred in civil court, although there are some that have been taken up in a criminal court. There will be times when such practice may actually be a felony.

Before you use spying software, you need to realize that the laws surrounding its use are not clear. If you operate a website, you need to know the legal issues that you may face if scraping software is used on your website. The best step is to use the software available to protect your website and stop web scraping and be honest on your site if web scraping is used.

Source: http://www.articlesbase.com/technology-articles/what-kind-of-legal-problems-can-web-scraping-cause-6780486.html

Wednesday 24 December 2014

Central Qld Coal: Mining for Needed Investments

The Central Qld Coal Project is situated in the Galilee Coal Basin, Central Queensland with the purpose of establishing a mine to service international export markets for thermal coal. An estimated cost to such a project would be around $ 7.5 billion - the amount proves that the mining industry is one serious business to begin with.

In addition to the mine, the Central Qld Coal Project also proposes to construct a railway, potentially in excess of 400km depending on the final option: Either to transport processed coal to an expanded facility at Abbot Point or new export terminal to be established at Dudgeon Point. However, this would require new major water and power supply infrastructure to service the mine and port - hence, the extremely high cost. Because mining areas usually involve desolate areas where there is no direct risk to developed regions where the populace thrives, setting up new major water and power supplies would simply demand costs as high as the estimated cost - but this is not the only major percent of the whole budget of the Central Qld Coal Project.

The location for the Central Qld Coal Project is situated 40km northwest of Alpha, approximately 450 km west of Rockhampton and contains an amount of more than three billion tons. The proposed open-cut mine of the Central Qld Coal Project is expected to be developed in stages. It shall have an initial export capacity of 30 million tons per annum with a mine life expectancy of 30 years.

In terms of employment regarding Central Qld Coal Project, there will be around a total of 2,500 people to be employed during the construction and 1,600 permanent positions shall be employed in the operation stage of the Central Qld Coal Project.

Australia is a major coal exporter - the largest exporter of coal and fourth largest producer of coal. Australia is also the second largest producer of gold, second only to China. As for Opal, Australia is responsible for 95% of its production, thereby making her the largest producer worldwide. Australia would not also lose in terms of commercially viable diamond deposits - being third next after Russia and Botswana. This pretty much explains the significance of the mining industry to Australia. It is like the backbone of its economy; an industry focused on claiming the blessings the earth has giver her lands. The Central Qld Coal Project was made to further the exports and improve the trade. However, the Central Qld Coal Project requires quite a large sum for its project. It is only through the financial support of investments, both local and international, can it achieve its goals and begin reaping the fruits of the land.

Source: http://ezinearticles.com/?Central-Qld-Coal:-Mining-for-Needed-Investments&id=6314576

Monday 22 December 2014

Scraping table from any web page with R or CloudStat

Scraping table from any web page with R or CloudStat:

You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.

Thanks to XML package from R. It provides amazing readHTMLtable() function.

For a study case,

I want to scrape data:

    US Airline Customer Score.
    World Top Chess Players (Men).

A. Scraping US Airline Customer Score table from

http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines

Code:

airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’

airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)

Result:

> library(XML)

Warning message:

package "XML" was built under R version 2.14.1

> airline = "http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines"
> airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)
> airline.table

                     Base-line 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10
1          Southwest        78 76 76 76 74 72 70 70 74 75 73 74 74 76 79 81 79
2         All Others        NM 70 74 70 62 67 63 64 72 74 73 74 74 75 75 77 75
3           Airlines        72 69 69 67 65 63 63 61 66 67 66 66 65 63 62 64 66
4        Continental        67 64 66 64 66 64 62 67 68 68 67 70 67 69 62 68 71
5           American        70 71 71 62 67 64 63 62 63 67 66 64 62 60 62 60 63
6             United        71 67 70 68 65 62 62 59 64 63 64 61 63 56 56 56 60
7         US Airways        72 67 66 68 65 61 62 60 63 64 62 57 62 61 54 59 62
8              Delta        77 72 67 69 65 68 66 61 66 67 67 65 64 59 60 64 62
9 Northwest Airlines        69 71 67 64 63 53 62 56 65 64 64 64 61 61 57 57 61

  11 PreviousYear%Change FirstYear%Change

1 81                 2.5              3.8
3 65                -1.5             -9.7
4 64                -9.9             -4.5
5 63                 0.0            -10.0
7 61                -1.6            -15.3
8 56                -9.7            -27.3
9  #                 N/A              N/A

>

B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men

Code:

chess = ‘http://ratings.fide.com/top.phtml?list=men’
chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)

Result:

> chess = "http://ratings.fide.com/top.phtml?list=men"
> chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)
> chess.table

     Rank                       Name Title Country Rating Games B-Year

1      1           Carlsen, Magnus    g    NOR  2835   17  1990
2      2            Aronian, Levon    g    ARM  2805   25  1982
3      3         Kramnik, Vladimir    g    RUS  2801   17  1975
4      4        Anand, Viswanathan    g    IND  2799   17  1969
5      5         Radjabov, Teimour    g    AZE  2773    9  1987
6      6          Topalov, Veselin    g    BUL  2770    9  1975
7      7          Karjakin, Sergey    g    RUS  2769   16  1990
8      8         Ivanchuk, Vassily    g    UKR  2766   16  1969
9      9     Morozevich, Alexander    g    RUS  2763    6  1977
10    10           Gashimov, Vugar    g    AZE  2761    9  1986
11    11       Grischuk, Alexander    g    RUS  2761    8  1983
12    12          Nakamura, Hikaru    g    USA  2759   17  1987
13    13            Svidler, Peter    g    RUS  2749   17  1976
14    14    Mamedyarov, Shakhriyar    g    AZE  2747    9  1985
15    15       Tomashevsky, Evgeny    g    RUS  2740    0  1987
16    16            Gelfand, Boris    g    ISR  2739    9  1968
17    17          Caruana, Fabiano    g    ITA  2736   19  1992
18    18       Nepomniachtchi, Ian    g    RUS  2735   16  1990
19    19                 Wang, Hao    g    CHN  2733    6  1989
20    20              Kamsky, Gata    g    USA  2732    0  1974
21    21  Dominguez Perez, Leinier    g    CUB  2730    6  1983
22    22         Jakovenko, Dmitry    g    RUS  2729    0  1983
23    23        Ponomariov, Ruslan    g    UKR  2727   13  1983
24    24          Vitiugov, Nikita    g    RUS  2726    1  1987
25    25            Adams, Michael    g    ENG  2724   17  1971
26    26               Leko, Peter    g    HUN  2720    9  1979
27    27            Almasi, Zoltan    g    HUN  2717    8  1976
28    28               Giri, Anish    g    NED  2714   15  1994
29    29            Le, Quang Liem    g    VIE  2714    0  1991
30    30             Navara, David    g    CZE  2712    8  1985
31    31            Shirov, Alexei    g    LAT  2710   13  1972
32    32             Polgar, Judit    g    HUN  2710    0  1976
33    33     Riazantsev, Alexander    g    RUS  2710    0  1985
34    34       Wojtaszek, Radoslaw    g    POL  2706    8  1987
35    35      Moiseenko, Alexander    g    UKR  2706    7  1980
36    36   Vallejo Pons, Francisco    g    ESP  2705   15  1982
37    37        Malakhov, Vladimir    g    RUS  2705    0  1980
38    38            Jobava, Baadur    g    GEO  2704   23  1983
39    39           Bacrot, Etienne    g    FRA  2704   14  1983
40    40          Laznicka, Viktor    g    CZE  2704    8  1988
41    41            Sutovsky, Emil    g    ISR  2703    8  1977
42    42        Naiditsch, Arkadij    g    GER  2702   14  1985
43    43         Movsesian, Sergei    g    ARM  2700    9  1978
44    44       Sasikiran, Krishnan    g    IND  2700    9  1981
45    45   Vachier-Lagrave, Maxime    g    FRA  2699   13  1990
46    46            Dreev, Aleksey    g    RUS  2698    6  1969
47    47           Efimenko, Zahar    g    UKR  2695    8  1985
48    48         Volokitin, Andrei    g    UKR  2695    0  1986
49    49                 Wang, Yue    g    CHN  2694    6  1987
50    50        Fressinet, Laurent    g    FRA  2693   17  1981
51    51                Li, Chao b    g    CHN  2693    6  1989
52    52            Grachev, Boris    g    RUS  2693    0  1986
53    53      Nielsen, Peter Heine    g    DEN  2693    0  1973
54    54            Van Wely, Loek    g    NED  2692   13  1972
55    55    Bruzon Batista, Lazaro    g    CUB  2691   19  1982
56    56           McShane, Luke J    g    ENG  2691    8  1984
57    57            Eljanov, Pavel    g    UKR  2690   10  1983
58    58      Kasimdzhanov, Rustam    g    UZB  2689   14  1979
59    59         Inarkiev, Ernesto    g    RUS  2689    6  1985
60    60         Zvjaginsev, Vadim    g    RUS  2688    8  1976
61    61         Andreikin, Dmitry    g    RUS  2688    0  1990
62    62    Areshchenko, Alexander    g    UKR  2688    0  1986
63    63         Rublevsky, Sergei    g    RUS  2686    0  1974
64    64         Akopian, Vladimir    g    ARM  2685    8  1971
65    65          Potkin, Vladimir    g    RUS  2684    0  1982
66    66       Sargissian, Gabriel    g    ARM  2683   15  1983
67    67            Berkes, Ferenc    g    HUN  2682   16  1985
68    68           Bologan, Viktor    g    MDA  2680   15  1971
69    69          Bauer, Christian    g    FRA  2679   24  1977
70    70          Tiviakov, Sergei    g    NED  2677   22  1973
71    71            Short, Nigel D    g    ENG  2677   15  1965
72    72        Motylev, Alexander    g    RUS  2677    6  1979
73    73         Gharamian, Tigran    g    FRA  2676    0  1984
74    74          Kobalia, Mikhail    g    RUS  2673    0  1978
75    75              Meier, Georg    g    GER  2671    9  1987
76    76       Onischuk, Alexander    g    USA  2670   13  1975
77    77              Bu, Xiangzhi    g    CHN  2670    6  1985
78    78          Alekseev, Evgeny    g    RUS  2670    0  1985
79    79            Azarov, Sergei    g    BLR  2667    0  1983
80    80        Kryvoruchko, Yuriy    g    UKR  2666    0  1986
81    81             Balogh, Csaba    g    HUN  2665    8  1987
82    82           Harikrishna, P.    g    IND  2665    6  1986
83    83       Khismatullin, Denis    g    RUS  2664    8  1984
84    84   Nguyen, Ngoc Truong Son    g    VIE  2662    6  1990
85    85           Fridman, Daniel    g    GER  2660   11  1976
86    86              Smirin, Ilia    g    ISR  2660    7  1968
87    87               Ding, Liren    g    CHN  2660    6  1992
88    88         Sadler, Matthew D    g    ENG  2660    3  1974
89    89            Korobov, Anton    g    UKR  2660    0  1985
90    90          Cheparinov, Ivan    g    BUL  2659   18  1986
91    91          Timofeev, Artyom    g    RUS  2659    0  1985
92    92           Georgiev, Kiril    g    BUL  2658   17  1965
93    93           Bartel, Mateusz    g    POL  2658    9  1985
94    94          Zhigalko, Sergei    g    BLR  2658    8  1989
95    95         Feller, Sebastien    g    FRA  2658    0  1991
96    96            Ragger, Markus    g    AUT  2655   17  1988
97    97         Jones, Gawain C B    g    ENG  2653   27  1987
98    98                So, Wesley    g    PHI  2653    5  1993
99    99              Milov, Vadim    g    SUI  2653    0  1972
100  100           Gupta, Abhijeet    g    IND  2652    9  1989
101  101            Postny, Evgeny    g    ISR  2652    8  1981
102  102             Roiz, Michael    g    ISR  2652    6  1983
103  103           Gyimesi, Zoltan    g    HUN  2652    4  1977
104  104          Nikolic, Predrag    g    BIH  2652    2  1960

>

Done. You had successfully scraping data from any web page with R or CloudStat.

Then, you can analyze as usual! Great! No more retype the data. Enjoy!

Source: http://www.r-bloggers.com/scraping-table-from-any-web-page-with-r-or-cloudstat/

Thursday 18 December 2014

Extracting Wisdom Teeth Tips

It is believed that due to evolution, our jaws are now smaller than our ancient ancestors'. For this reason, our mouths often do not have adequate room to accommodate the third molars, making them basically useless and in some cases detrimental. Even if they are not impacted, wisdom teeth may be hard to clean, and therefore require removal to reduce the probability of caries and infection.

As part of your routine dental visits, your dentist will likely take X-rays to monitor the development of your third molars. Your dentist will likely recommend removing them as soon as possible to avoid any complications. The extraction of wisdom teeth can sometimes be a costly and daunting procedure; for these reasons many patients delay having them extracted. However, if the impacted teeth become infected, it is important to see your dental professional at once. Symptoms of infection due to impacted wisdom teeth include;

•    Pain in the gums and surrounding areas
•    Red or inflamed gums
•    Tender or bleeding gums
•    Inflammation around the face and jaw
•    Bad breath (halitosis)
•    Frequent headaches

If a single molar needs to be extracted, local anesthetic will be used. In the case where several or all the teeth need extraction, the patient will usually be "put under" using a general anesthetic. If you have an infection or medical complications that put you at a higher than normal risk, the surgery may be performed at a hospital. Extraction of the wisdom teeth is a day surgery, and patients are usually able to return to normal activities in a day or so. You may be prescribed antibiotics prior to the surgery, and you will likely be asked not to eat or drink the night before the surgery.

During the surgery, your dentist makes an incision in the gum tissue covering the tooth. Once the tooth is exposed, the dentist may cut the tooth into smaller pieces to make extraction easier. After the extraction you will be given stitches to mend the gum tissue. You may need to return a few days later to have the stitches removed. You will be monitored after the surgery to ensure that you are not bleeding excessively.

The best time for extraction is when the patient is in their late teens to avoid unnecessary complications. Wisdom teeth extractions performed later in life are still beneficial, but the removal may be more difficult and healing may take longer. Therefore it is wise to have a conversation with your dentist regarding your wisdom teeth as early as possible.

Most people will experience the emergence of their wisdom teeth at some point in their life, and extraction is sometimes necessary as a preventative measure or to fix an actual problem or to prevent problem. It is best to deal with any problems regarding your wisdom teeth as soon as possible to avoid unnecessary difficulties.

Source:http://ezinearticles.com/?Extracting-Wisdom-Teeth-Tips&id=7788863

Tuesday 16 December 2014

Importance of Data Mining Services in Business

Data mining is used in re-establishment of hidden information of the data of the algorithms. It helps to extract the useful information starting from the data, which can be useful to make practical interpretations for the decision making.

It can be technically defined as automated extraction of hidden information of great databases for the predictive analysis. In other words, it is the retrieval of useful information from large masses of data, which is also presented in an analyzed form for specific decision-making. Although data mining is a relatively new term, the technology is not. It is thus also known as Knowledge discovery in databases since it grip searching for implied information in large databases.

It is primarily used today by companies with a strong customer focus - retail, financial, communication and marketing organizations. It is having lot of importance because of its huge applicability. It is being used increasingly in business applications for understanding and then predicting valuable data, like consumer buying actions and buying tendency, profiles of customers, industry analysis, etc. It is used in several applications like market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, e-commerce, customer relationship management and financial services.

However, the use of some advanced technologies makes it a decision making tool as well. It is used in market research, industry research and for competitor analysis. It has applications in major industries like direct marketing, e-commerce, customer relationship management, scientific tests, genetics, financial services and utilities.

Data mining consists of major elements:

•    Extract and load operation data onto the data store system.
•    Store and manage the data in a multidimensional database system.
•    Provide data access to business analysts and information technology professionals.
•    Analyze the data by application software.
•    Present the data in a useful format, such as a graph or table.

The use of data mining in business makes the data more related in application. There are several kinds of data mining: text mining, web mining, relational databases, graphic data mining, audio mining and video mining, which are all used in business intelligence applications. Data mining software is used to analyze consumer data and trends in banking as well as many other industries.

Source:http://ezinearticles.com/?Importance-of-Data-Mining-Services-in-Business&id=2601221

Monday 15 December 2014

Git workflow for Scrapy projects

Our customers often ask us what’s the best workflow for working with Scrapy projects. A popular approach we have seen and used in the past is to split the spiders folder (typically project/spiders) into two folders: project/spiders_prod and project/spiders_dev, and use the SPIDER_MODULES setting to control which spiders are loaded on each environment. This works reasonably well, until you have to make changes to common code used by many spiders (ie. code outside the spiders folder), for example common base spiders.

Nowadays, DVCS (in particular, git) have become more popular and people are quite used to branching, so we recommend using a simple git workflow (similar to GitHub flow) where you branch for every change you make. You keep all changes in a branch while they’re being tested and finally merge to master when they’re finished. This means that master branch is always stable and contains only “production-ready” spiders.

If you are using our Scrapy Cloud platform, you can have 2 projects (myproject-dev, myproject-prod) and use myproject-dev to test the changes in your branch.  scrapy deploy in Scrapy 0.17 now adds the branch name to the version name (when using version=GIT or version=HG), so you can see which branch you are going to run directly on the panel. This is particularly useful with large teams working on a single Scrapy project, to avoid stepping into each other when making changes to common code.

Here is a concrete example to illustrate how this workflow works:y

•    the developer decides to work on issue 123 (could be a new spider or fixes to an existing spider)
•    the developer creates a new branch to work on the issue
•    git checkout -b issue123
•    the developer finishes working on the code and deploys to the panel (this assumes scrapy.cfg is configured with a deploy target, and using version=GIT – see here for more information)
•    scrapy deploy dev
•    the developer goes into the panel and runs the spider, where he’ll see the branch name (issue123) that will be run
•    the developer checks the scraped data looks fine through the item browser in the panel
•    whenever issues are found, the developer makes more fixes (always working on the same branch) and deploys new versions
•    once all issues are fixed, the developer merges the branch and deploys to production project
•    git checkout master
•    git merge issue123
•    git pull # make sure to pull latest code before deploying
•    scrapy deploy prod

We recommend you keep your common spiders well-tested and use Spider Contracts extensively to test your final spiders. Otherwise experience tell us that base spiders end up being copied (instead of reused) out of fear of breaking old spiders that depend on them, thus turning their maintenance into a nightmare.

Source:http://blog.scrapinghub.com/2013/03/06/git-workflow-scrapy-projects/

Saturday 13 December 2014

Handling exceptions in scrapers

When requesting and parsing data from a source with unknown properties and random behavior (in other words, scraping), I expect all kinds of bizarrities to occur. Managing exceptions is particularly helpful in such cases.

Here is some ways that an exception might be raised.
[][0] #The list has no zeroth element, so this raises an IndexError
{}['foo'] #The dictionary has no foo element, so this raises a KeyError

Catching the exception is sometimes cleaner than preventing it from happening in the first place. Here are some examples handling bizarre exceptions in scrapers.

Example 1: Inconsistant date formats

Let’s say we’re parsing dates.
import datetime
This doesn’t raise an error.
datetime.datetime.strptime('2012-04-19', '%Y-%m-%d')
But this does.
datetime.datetime.strptime('April 19, 2012', '%Y-%m-%d')

It raises a ValueError because the date formats don’t match. So what do we do if we’re scraping a data source with multiple date formats?

Ignoring unexpected date formats

A simple thing is to ignore the date formats that we didn’t expect.

import lxml.html
import datetime
def parse_date1(source):
    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text
    try:
         cleandate = datetime.datetime.strptime(rawdate, '%Y-%m-%d')
    except ValueError:
         cleandate = None
    return cleandate

print parse_date1('<div id="date">2012-04-19</div>')

If we make a clean date column in a database and put this in there, we’ll have some rows with dates and some rows with nulls. If there are only a few nulls, we might just parse those by hand.

Trying multiple date formats

Maybe we have determined that this particular data source uses three different date formats. We can try all three.

import lxml.html
import datetime

def parse_date2(source):

    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text

    for date_format in ['%Y-%m-%d', '%B %d, %Y', '%d %B, %Y']:

        try:
             cleandate = datetime.datetime.strptime(rawdate, date_format)
             return cleandate
        except ValueError:
             pass
    return None

print parse_date2('<div id="date">19 April, 2012</div>')

This loops through three different date formats and returns the first one that doesn’t raise the error.

Example 2: Unreliable HTTP connection

If you’re scraping an unreliable website or you are behind an unreliable internet connection, you may sometimes get HTTPErrors or URLErrors for valid URLs. Trying again later might help.

import urllib2
def load(url):
    retries = 3
    for i in range(retries):
        try:
            handle = urllib2.urlopen(url)
            return handle.read()
        except urllib2.URLError:
            if i + 1 == retries:
                raise
            else:
                time.sleep(42)
    # never get here

print load('http://thomaslevine.com')

This function tries to download the page thee times. On the first two fails, it waits 42 seconds and tries again. On the third failure, it raises the error. On a success, it returs the content of the page.

Example 3: Logging errors rather than raising them

For more complicated parses, you might find loads of errors popping up in weird places, so you might want to go through all of the documents before deciding which to fix first or whether to do some of them manually.

import scraperwiki
for document_name in document_names:
    try:
        parse_document(document_name)
    except Exception as e:
        scraperwiki.sqlite.save([], {
            'documentName': document_name,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')

This catches any exception raised by a particular document, stores it in the database and then continues with the next document. Looking at the database afterwards, you might notice some trends in the errors that you can easily fix and some others where you might hard-code the correct parse.

Example 4: Exiting gracefully

When I’m scraping over 9000 pages and my script fails on page 8765, I like to be able to resume where I left off. I can often figure out where I left off based on the previous row that I saved to a database or file, but sometimes I can’t, particularly when I don’t have a unique index.


for bar in bars:
    try:
        foo(bar)
    except:
        print('Failure at bar = "%s"' % bar)
        raise

This will tell me which bar I left off on. It’s fancier if I save the information to the database, so here is how I might do that with ScraperWiki.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except:
        scraperwiki.sqlite.save_var('resume_index', i)
        raise
scraperwiki.sqlite.save_var('resume_index', 0)

ScraperWiki has a limit on CPU time, so an error that often concerns me is the scraperwiki.CPUTimeExceededError. This error is raised after the script has used 80 seconds of CPU time; if you catch the exception, you have two CPU seconds to clean up. You might want to handle this error differently from other errors.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    try:
        foo(bar)
    except scraperwiki.CPUTimeExceededError:
        scraperwiki.sqlite.save_var('resume_index', i)
    except Exception as e:
        scraperwiki.sqlite.save_var('resume_index', i)
        scraperwiki.sqlite.save([], {
            'bar': bar,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')
scraperwiki.sqlite.save_var('resume_index', 0)

tl;dr

Expect exceptions to occur when you are scraping a randomly unreliable website with randomly inconsistent content, and consider handling them in ways that allow the script to keep running when one document of interest is bizarrely formatted or not available.

Source: https://blog.scraperwiki.com/2012/05/handling-exceptions-in-scrapers/

Thursday 11 December 2014

Scraping Webmaster Tools with FMiner

The biggest problem (after the problem with their data quality) I am having with Google Webmaster Tools is that you can’t export all the data for external analysis. Luckily the guys from the FMiner.com web scraping tool contacted me a few weeks ago to test their tool. The problem with Webmaster Tools is that you can’t use web based scrapers and all the other screen scraping software tools were not that good in the steps you need to take to get to the data within Webmaster Tools. The software is available for Windows and Mac OSX users.

FMiner is a classical screen scraping app, installed on your desktop. Since you need to emulate real browser behaviour, you need to install it on your desktop. There is no coding required and their interface is visual based which makes it possible to start scraping within minutes. Another possibility I like is to upload a set of keywords, to scrape internal search engine result pages for example, something that is missing in a lot of other tools. If you need to scrape a lot of accounts, this tool provides multi-browser crawling which decreases the time needed.

This tool can be used for a lot of scraping jobs, including Google SERPs, Facebook Graph search, downloading files & images and collecting e-mail addresses. And for the real heavy scrapers, they also have built in a captcha solving API system so if you want to pass captchas while scraping, no problem.

Below you can find an introduction to the tool, with one of their tutorial video’s about scraping IMDB.com:

More basic and advanced tutorials can be found on their website: Fminer tutorials. Their tutorials show you a range of simple and complex tasks and how to use their software to get the data you need.

Guide for Scraping Webmaster Tools data

The software is capable of dealing with JavaScript and AJAX, one of the main requirements to scrape data from within Google Webmaster Tools.

Step 1: The first challenge is to login into webmaster tools. After opening a new project, first browse to https://www.google.com/webmasters/ and select the Recording button in the upper left corner.

fminer01

After browsing to this page, a goto action appears in the left panel. Click on this button and look for the “Action Options” button at the bottom of that panel. Tick the option Clear cookies before do it to avoid problems if you are already logged in for example.

fminer06

Step 2: Click the “Sign in Webmaster Tools” button. You will notice the Macro designer overview on the left registered a click as the first step.

fminer03

Step 3: Fill in your Google username and password. In the designer panel you will see the two Fill actions emerging.

fminer04

Step 4: After this step you should add some waiting time to be sure everything is fully loaded. Use the second button on the right side above the Macro Designer panel to add an action. 2000 milliseconds (2 seconds :)) will do the job.

fminer07

fminer08

Step 5: Browse to the account of which you want to export the data from

fminer05

Step 6: Browse to the specific pages of which you want the data scraped

fminer09

Step 7:Scrape the data from the tables as shown in the video

Congratulations, now you are able to scrape data from Google Webmaster Tools :)

Step 8: One of the things I use it for is pulling the search query data per keyword, which you normally can’t export. To do that, you have to use a right mouse click on the keyword, which opens a menu with options. Go to open links recursively and select normal. This will loop through all the keywords.

fminer10

Step 9: This video will show you how to make use of the pagination elements to loop through all the pages:

You can also download the following file, which has a predefined set of actions to login in WMT and download the keywords, impressions and clicks: google_webmaster_tools_login.fmpx. Open the file and update the login details by clicking on those action buttons and insert your own Google account details.

Automating and scheduling scrapers

For people that want to automate and regularly download the data, you can setup a Scheduler config and within the project settings you can setup the program to send an e-mail after completion of the crawl:

Source: http://www.notprovided.eu/scraping-webmaster-tools-fminer/

Thursday 4 December 2014

Multiple Listing Service Gets Favorable Appellate Ruling in Scraping Lawsuit

This is a follow-up to our massive post on anti-scraping lawsuits in the real estate industry from New Year’s Eve 2012 (Note: the portion on MRIS is about halfway through the post, labeled “Same Writ, Different Plaintiff”).

AHRN is a California real estate broker that owns and operates NeighborCity.com. The site gets its data in part by scraping from MLS databases–in this case, MRIS. As part of the scraping, however, AHRN had collected and displayed copyrighted photographs among the bits and pieces of general textual information about the properties. MRIS sent a cease and desist letter to AHRN, and filed suit alleging various copyright claims after the parties failed to agree on a license to use the photographs. Ultimately, a district court in Maryland granted a motion made by MRIS for a preliminary injunction.

When we last left off, the district court had revised its preliminary injunction order to enjoin only AHRN’s use of MRIS’s photographs–not the compilation itself or any textual elements that may be considered a part of it. Since then, AHRN appealed the injunction. On July 18th, the Fourth Circuit Court of Appeals affirmed.

Background

shutterstock_108008486.jpgAHRN argued that MRIS failed to show a likelihood of success on its copyright infringement claim because MRIS: (1) failed to register its copyright in the individual photographs when it registered the database, and (2) did not have a copyright interest in the photographs because the subscribers’ electronic agreement to MRIS’s terms of use failed to transfer those rights.

 MRIS Did Not Fail to Register Its Interest in the Photographs

This first question revolved around the scope of MRIS’s registrations. AHRN argued that MRIS’s collective work registrations did not cover the individual photographs because MRIS did not identify the names of the authors and titles of those works. MRIS argued that 17 U.S.C. §409 did not require any such identification when applied to collective works, and that its general description of the pre-existing photographs’ inclusion sufficed.

The court began its discussion by noting the “ambiguous” nature of §409’s language and its varying judicial interpretations. Some courts have barred infringement suits because the collective work registrant failed to list the authors, while others have allowed infringement suits where the registrant owns the rights to the component works as well as the collective work.

In this case, the court agreed with MRIS and found that the latter approach was more consistent with the relevant statutes and regulations:

    Adding impediments to automated database authors’ attempts to register their own component works conflicts with the general purpose of Section 409 to encourage prompt registration . . . and thwarts the specific goal embodied in Section 408 of easing the burden on group registrations[.]

As part of its decision, the court looked favorably upon the 3Taps case, in which Craigslist sued 3Taps and Padmapper for scraping and repackaging its online classified ads. In that case, the court reasoned that it would be “inefficient” to require registrants to list each author of an extremely large number of component works to which the registrant already had obtained an exclusive license.

Having found that MRIS’s general description satisfied § 409’s pre-suit registration requirement, the court moved on to the merits of MRIS’s infringement claim–more specifically, the question of whether MRIS’s Terms of Use actually transferred a copyright interest to its subscribers’ photographs.

E-SIGN Applies to Assignments of Copyrights and Overrides § 204

AHRN challenged MRIS’s ownership of the photographs by arguing that an MLS subscriber’s electronic agreement to MRIS’s Terms of Use does not operate as an assignment of rights under § 204, which requires a signed “writing.”

In a bad sign for AHRN, the court began its discussion by volunteering an argument that MRIS did not even bring up:

    [I]n situations where “the copyright [author] appears to have no dispute with its [assignee] on this matter, it would be anomalous to permit a third party infringer to invoke [Section 204(a)’s signed writing requirement] against the [assignee].”

With that in mind, the court went on to discuss the E-SIGN act’s impact on the conveyance of copyrights. After establishing the meaning of “e-signature,” the court focused on whether the act was limited from covering this type of situation.

    The Act provides that it “does not . . . limit, alter, or otherwise affect any requirement imposed by a statute, regulation, or rule of law . . . other than a requirement that contracts or other records be written, signed, or in nonelectric form[.]”

The court emphasized the phrase “other than,” reasoning that a plain reading of the E-SIGN language showed that Congress intended the provisions to limit § 204. It also noted that Congress did not list copyright assignments among the various agreements to which E-SIGN did not apply–nor was there a catchall that included such assignments.

The court then turned to the Hermosilla case, in which a district court in Florida upheld the validity of a copyright conveyance via e-mail. It emphasized the Hermosilla court’s reliance on the purpose of § 204–“to resolve disputes between copyright owners and transferees and to protect copyright holders from persons mistakenly or fraudulently claiming oral licenses or copyright ownership.” The appellate court agreed with the Hermosilla court that allowing assignment via e-mail actually helped cut down on these types of disputes.

    To invalidate copyright transfer agreements solely because they were made electronically would thwart the clear congressional intent embodied in the E-Sign Act.

All in all, the court basically said “we don’t see why E-SIGN shouldn’t apply.” Note that it did not pass judgment specifically on whether MRIS’s Terms of Use constituted a valid contract. It simply mentioned that AHRN waived that argument by not bringing it up sooner.

Source: http://blog.ericgoldman.org/archives/2013/07/multiple_listin_1.htm