Extract Data From Website: May 2015

Friday, 29 May 2015

Data Scraping Services - Scraping Yelp Business Data With Python Scraping Script

Yelp is a great source of business contact information with details like address, postal code, contact information; website addresses etc. that other site like Google Maps just does not. Yelp also provides reviews about the particular business. The yelp business database can be useful for telemarketing, email marketing and lead generation.

Are you looking for yelp business details database? Are you looking for scraping data from yelp website/business directory? Are you looking for yelp screen scraping software? Are you looking for scraping the business contact information from the online Yelp? Then you are at the right place.

Here I am going to discuss how to scrape yelp data for lead generation and email marketing. I have made a simple and straight forward yelp data scraping script in python that can scrape data from yelp website. You can use this yelp scraper script absolutely free.

I have used urllib, BeautifulSoup packages. Urllib package to make http request and parsed the HTML using BeautifulSoup, used Threads to make the scraping faster.

Yelp Scraping Python Script

import urllib from bs4 import BeautifulSoup import re from threading import Thread #List of yelp urls to scrape url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg'] i=0 #function that will do actual scraping job def scrape(ur): html = urllib.urlopen(ur).read() soup = BeautifulSoup(html) title = soup.find('h1',itemprop="name") saddress = soup.find('span',itemprop="streetAddress") postalcode = soup.find('span',itemprop="postalCode") print title.text print saddress.text print postalcode.text print "-------------------" threadlist = [] #making threads while i<len(url): t = Thread(target=scrape,args=(url[i],)) t.start() threadlist.append(t) i=i+1 for b in
threadlist: b.join()

import urllib

from bs4 import BeautifulSoup

import re

from threading import Thread

#List of yelp urls to scrape

url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg']

i=0

#function that will do actual scraping job

def scrape(ur):

           html = urllib.urlopen(ur).read()

          soup = BeautifulSoup(html)

       title = soup.find('h1',itemprop="name")

          saddress = soup.find('span',itemprop="streetAddress")

          postalcode = soup.find('span',itemprop="postalCode")

          print title.text

          print saddress.text

          print postalcode.text

          print "-------------------"

threadlist = []

#making threads

while i<len(url):

          t = Thread(target=scrape,args=(url[i],))

          t.start()

          threadlist.append(t)

          i=i+1

for b in threadlist:

          b.join()

Recently I had worked for one German company and did yelp scraping project for them and delivered data as per their requirement. If you looking for scraping data from business directories like yelp then send me your requirement and I will get back to you with sample.

Source: http://webdata-scraping.com/scraping-yelp-business-data-python-scraping-script/

Tuesday, 26 May 2015

Data Extraction Services

Are you finding it tedious to perform your routine tasks as well as finding time to research for some information? Don't worry; all you have to do is outsource data extraction requirements to reliable service providers such as Hi-Tech BPO Services.

We can assist you in finding, extracting, gathering, processing and validating all the required data through our effective data extraction services. We can extract data from any given source such as websites, databases, printed documents, directories, etc.

With a whole plethora of data extraction services solutions; we are definitely a one stop solution to all your data extraction services requirements.

For utilizing our data extraction services, all you have to do is outsource data extraction requirements to us, and we will create effective strategies and extract the required data from all preferred sources. Then we will arrange all the extracted data in a systematic order.

Types of data extraction services provided by our data extraction India unit:

The data extraction India unit of Hi-Tech BPO Services can attend to all types of outsource data extraction requirements. Following are just some of the data extraction services we have delivered:

•    Data extraction from websites
•    Data extraction from databases
•    Extraction of data from directories
•    Extracting data from books
•    Data extraction from forms
•    Extracting data from printed materials

Features of Our Data Extraction Services:

•    Reliable collection of resources for data extraction
•    Extensive range of data extraction services
•    Data can be extracted from any available source be it a digital source or a hard copy source
•    Proper researching, extraction, gathering, processing and validation of data
•    Reasonably priced data extraction services
•    Quality and confidentiality ensured through various strict measures

Our data extraction India unit has the competency to handle any of your data extraction services requirements. Just provide us with your specific requirements and we will extract data accordingly from your preferred resources, if particularly specified. Otherwise we will completely rely on our collection of resources for extracting data for you.

Source: http://www.hitechbposervices.com/data-extraction.php

Monday, 25 May 2015

Data Scraping - One application or multiple?

I have 30+ sources of data I scrape daily in various formats (xml, html, csv). Over the last three years Ive built 20 or so c# console applications that go out, download the data and re-format it into a database. But Im curious what other people are doing for this type of task. Are people building one tool that has a lot of variables and inputs or are people designing 20+ programs to scrape and parse this data. Everything is hard-coded into each console and run through Windows Task Manager.

Added a couple additional thoughts/details:

    Of the 30 sources, they all have unique properties, all are uploaded into individual MySQL tables and all have varying frequencies. For example, one data source is hit once a minute, another on 5 minute intervals. Majority are once an hour and once a day.

At current I download the formats (xml, csv, html), parse them into a formatted csv and put them into staging folders. Within that folder, I run an application that reads a config file specific to the folder. When a new csv is added to the folder, the application then uploads the data into the specific MySQL tables designated in the config file.

Im wondering if it is worth re-building all this into a larger complex program that is more capable of dynamically adding content+scrapes and adjusting to format changes.

Looking for outside thoughts.

5 Answers

What you are working on is basically ETL. So at a high level you need an export component (get stuff) a transform component (map to known format) and a load (take known format and put stuff somewhere). If you are comfortable being tied to a RDBMS you could use something like SQL Server SSIS packages. What I would do is create a host application that managed common aspects of the overall process (errors, and pipeline processing). Then make the specifics of the E, T, and L pluggable. A low ceremony way to get this would be to host the powershell runtime and create each seesion with common context objects that the scripts will use to communicate. You get a built in pipe and filter model for scripts and easy, safe extensibility. This design has worked extremely for my team with a similar situation.

Resist the temptation to rewrite.

However, for new code, you could plan for what you know has already happened. Write a retrieval mechanism that you can reuse through configuration. Write a translation mechanism that you can reuse (maybe in a library that you can call with very little code). Write a saving mechanism that can be called or configured.

At this point, you've written #21(+). Now, the following ones can be handled with a tiny bit of code and configuration. Yay!

(You may want to implement this in a service that handles multiple conversions, but weight the benefits of it versus the ability to separate errors in one module from the rest.)

1

It depends - if you need the scrapers to feed into a single application/database and have a uniform data format, it makes sense to have them all in a single program (possibly inheriting from a common base scraper).

If not and they are completely unrelated to each other, might as well keep them separate so changes in one have no effect on another.

Update, following edits to question:

Don't change things just for the sake of change. You have something that works, don't mess with it too much.

Since your data sources and data sinks are all separate from each other, combining them into one application will simply create a very complicated application that will be very difficult to change when needed.

Since the scrapers are separate, keep the separation as you have it now.

As sbrenton said, this most falls in with ETL. You should check out Talend Open Studio. It specializes in handling data flows like I imagine yours are as well as other things like duplicate removal, normalization of fields; tens/hundreds of drag and drop ETL components, you can also write custom code as Talend is a code generator as well, either Java or Perl are options. You can also use Talend to execute system commands. I use it for my ETL work, although not in production, in production we will use SSIS, mostly due to lots of other Microsoft products in house.

You may want to use some good scheduling library, like Quartz.NET.

In a few words, here's what you can expect:

    Your tasks are represented by classes and not processes

    You can set and forget tasks and scale across multiple servers

    You have an out-of-the-box system to actually take care of what is needed to be run when, what failed and needs to be re-run, etc. etc.

Source: http://programmers.stackexchange.com/questions/118077/data-scraping-one-application-or-multiple/118098#118098

Friday, 22 May 2015

Web scraping using Python without using large frameworks like Scrapy

scrapy-big-logoIf you need publicly available data from scraping the Internet, before creating a webscraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data.

Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Scrapy is a well established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow.

So if you would like to roll up your sleeves and build your own scraper, continue reading.

Here are some basic steps performed by most webspiders:

1) Start with a URL and use a HTTP GET or PUT request to access the URL
2) Fetch all the contents in it and parse the data
3) Store the data in any database or put it into any data warehouse
4) Enqueue all the URLs in a page
5) Use the URLs in queue and repeat from process 1
Here are the 3 major modules in every web crawler:
1) Request/Response handler.
2) Data parsing/data cleansing/data munging process.
3) Data serialization/data pipelines.

Lets look at each of these modules and see what they do and how to use them.

Request/Response handler

Request/response handlers are managers who make http requests to a url or a group of urls, and fetch the response objects as html contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

1) urllib(20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.

2) urllib2(20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – extensible library of urllib, which would handle basic http requests, digest authentication, redirections, cookies and more.

3) requests(Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below

1) pickle (pickle – Python object serialization) – This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure

2) JSON (JSON encoder and decoder)

3) CSV (https://docs.python.org/2/library/csv.html)

4) Basic database interface libraries like pymongo (Tutorial – PyMongo),mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the number of requests in a second and build enough delays in the spiders so that you don’t adversely affect the site.

It just makes sense to be nice.

We will cover more techniques in future articles

Source: http://learn.scrapehero.com/webscraping-using-python-without-using-large-frameworks-like-scrapy/

Wednesday, 20 May 2015

How Web Data Extraction Services Impact Startups

Starting a business has its fair share of ebbs and flows – it can be extremely challenging to get a new business off the blocks, and extremely rewarding when everything goes according to plan and yields desired results. For startups, it is important to get the nuances of running a business right from day one. To succeed in an immensely competitive space, startups need to perform above and beyond expectation right from the start, and one of the factors that can be of great help during the growing years of a startup is web data extraction.

Web data extraction through crawling and scraping, a highly efficient information gathering process, can be used in many creative ways to bring about major change in the performance graph of a startup. With effective web data extraction services acquired by outsourcing to a reputed company, the business intelligence gathered and the numerous possibilities associated with it, web crawling and extraction services can indeed become the difference maker for a startup, propelling it to the heights of success.

What drives the success of web data extraction?

When it comes to figuring out the perfect, balanced web data collection methodology for startups, there are a lot of crucial factors that come into play. Some of these are associated with the technical aspects of data collection, the approach used, the time invested, and the tools involved. Others have more to do with the processing and analysis of collected information and its judicious use in formulating strategies to take things forward.

Web Crawling Services & Web Scraping Services

With the advent of highly professional web data extraction services providers, massive amounts of structured, relevant data can be gathered and stored in real time, and in time, productively used to further the business interests of a startup. As a new business owner, it is important to have a high-level knowledge of the modern and highly functional web scraping tools available for use. This will help to utilize the prowess of competent data extraction services. This in turn can assist both in the immediate and long-term revenue generation context.

Web Data Extraction for Startups

From the very beginning, the dynamics of startups is different from that of older, well-established businesses. The time taken by the new business entity in proving its capabilities and market position needs to be used completely and effectively. Every day of growth and learning needs to add up to make a substantial difference. In this period, every plan and strategy, every execution effort, and every move needs to be properly thought out.

In such a trying situation where there is little margin for error, it pays to have accurate, reliable, relevant and actionable business intelligence. This can put you in firm control of things by allowing you to make informed business decisions and formulate targeted, relevant and growth oriented business strategies. With powerful web crawling, the volume of data gathered is varied, accurate and relevant. This data can then be studied minutely, analyzed in detail and arranged into meaningful clusters. With this weapon in your arsenal, you can take your startup a long way with smart decisions and clever implementations.

Web data extraction is a task best handled by professionals who have had rich experience in the field. Often, in-house web scraping teams are difficult to assemble and not economically viable to maintain, especially for startups. For a better solution, you can outsource your web scraping needs to a reliable web data extraction service for data collection. This way, you can get all the relevant intelligence you need without overstraining your workforce or having to employ additional personnel to handle web scraping. The company you outsource your work to can easily scrape data from multiple sources as per your requirements, and furnish you with actionable business intelligence that can help you take a lead in a competitive market.

Different Ways for Startups to use Web Data Extraction

Web scraping can be employed for many different purposes to yield different kinds of relevant data that generate actionable insights. For a startup, the important decision is how to use this powerful technique to provide valuable information that can make a difference for the future prospects of the company. Here are some interesting possibilities when it comes to impactful web data extraction for startups –

Fishing for Social Rankings and Backlinks

One of the most important business processes for a startup is competition analysis. This is one area where web data extraction can come across as an invaluable enabler. In the past, many startups have effectively used web scraping to fish for backlinks and social rankings related to competing companies.

Backlinks are important to reach a greater mass of better-targeted audiences, which can go on to increase customer base with minimal efforts. Social ranking is also an immensely important factor, as social actions on the internet are building blocks of opinion and reputation generation in this day and age. Keeping this in mind, you can use web data extraction to scrape for social rankings and backlinks related to content generated by your competing companies. After careful analysis, it is possible to arrive at concrete conclusions regarding what your competitors are doing well, and what sells the best.

This information is gold for marketers and sales personnel, and can be used to discern exactly what needs to be done to increase social buzz, generate favorable opinion, and win over customers from your competitors. You can also use this technique to develop high authority backlinks that help with SEO, targeted reach and organic traffic for your business website. For competition analysis, web scraping is a formidable tool.

Sourcing Contact Information

Another important aspect of business that startups can never ignore is good networking. Whether it is with customers, prospective customers, industry peers, partners, or competitors, excellent networking and open, transparent communication is essential for the success of your startup. For effective communication and networking, you need a large, solid list of contact information pertaining to your exact requirements.

Scraping data from multiple web sources gives you the perfect method of achieving this. With automated, fast web scraping, you can in a short time collect a wealth of important contact information that can be leveraged in many different ways. Whether it is the formation of lasting business relationships or making potential customers aware of what you have on offer, this information has the power to propel your startup to new levels of recognition.

For Ecommerce

If you sell your products and services online and want to stay on top of the competition when it comes to variety, pricing analysis, and special deals and offers, web scraping is the way to go. For many e-commerce startups, the problem of high CTR and low conversion is a stumbling block to higher bottom lines. To remedy problems like these and to ensure better sales, it is always a good idea to have a clear insight about your competition.

Future of Retail Industry

With web data extraction, you can be always aware of what competing companies are doing in terms of pricing strategies, product diversity and special customer offers. By considering that information while evaluating and cementing your own strategies, you can always ensure that you provide better value and range of products and services than your competitors, and therefore stay ahead of the competition.

For Marketing, Brand Promotion and Advertisement

For startups, the first wave of promotion and marketing is the one that holds the key to your long-term business success. It is during this phase that the first and most important public perception of your company is formed, and the rudiments of public opinion start taking shape. For this reason, it is crucial to be on point with your marketing and promotion during the early, formative years of your business.

To achieve this, you need a clear, in-depth understanding of your target audience. You need to categorize your target audience on the basis of many factors like age, gender, demographics, income groups and tastes and preferences. Such detailed understanding can only be possible when you have a large wealth of social data pertaining to your target audience. There is no better way of achieving this than by web data extraction.

Love your brand

With the help of data extraction services, you can gather large chunks of relevant data regarding your target audience which can help you accurately evaluate the potential of each prospective customers as a possible addition to your business family. To ensure that you have a steady, early wave of customers to take your business off the blocks at a rapid pace, you need to devise marketing campaigns, promotional strategies and advertisements in accordance with the customer knowledge you drive through your web scraping efforts. This is a foolproof strategy to have marketing and promotional plans in place that achieve goals, bring in new business and provide your company with enough initial momentum to carry it through the later years of success.

To conclude, web data extraction can be a veritable tool in the hands of a startup. With the proper use and leveraging of this technique, your startup can gather the required business intelligence to shine in a competitive market and become a favorite with the customer base. Working with the right web data extraction company can be one of the most important business decisions you make as a startup owner.

Source: https://www.promptcloud.com/blog/web-data-extraction-services-for-startups/

Sunday, 17 May 2015

Scraping Twitter Lists To Boost Social Outreach (+ Free Tool!)

I published a post a few weeks ago describing how to build your own twitter custom audience list, outlining a variety of techniques to build up your list.

This post outlines another method (hat tip to Ade Lewis for the idea) which requires you to scrape Twitter directly.

If you want to skip all the explanations and just want to download the Twitter List Scraper tool, here you go…

Download the Twitter Scraper Tool for Windows or Mac (completely free)

Disclaimer: Scraping Twitter is against their Terms of Service, so if you decide to do this you do it at your own risk.

Some Benchmarks

Building custom audiences on Twitter requires you to identify Twitter usernames that might be interested in your service or product.

In my previous posts, one of the methods I employed was to pull a competitor’s link profile and scrape social accounts from the linking domains.

Once you upload a custom list, Twitter goes through a process of ‘matching’ against profiles in their system, to make sure the user exists and hasn’t opted out of tailored ads.

As our data was scraped from a list of unqualified websites, the data matching wasn’t likely to be perfect.

Experiments

Since I published that post, I have been experimenting a fair bit with list building, and have built up around 10 custom audience lists. I‘ve uploaded a total of 48,857 Twitter usernames using this method, but only 29,260 were matched by Twitter (just less than 60% match rate).

From some other experiments where I have had better control over the input data, this match rate was between 70-80%.

Since we’ll be scraping Twitter directly, I expect our match rate to be much higher – 90%+

Finding Relevant Twitter Lists

So, we’re going to scrape Twitter, and the first step is to find Twitter lists that will contain users potentially interested in what we have to offer.

As an example, we’ll pretend we’re marketing a music website, and we’ve produced a survey we want to collect responses for.

An advanced Google query can give us lists of music bloggers: site:twitter.com inurl:lists inurl:members inurl:music “music blogger”

Source: http://urlprofiler.com/blog/scraping-twitter/

Wednesday, 6 May 2015

Web Scraping: Startups, Services & Market

I got recently interested in startups using web scraping in a way or another and since I find the topic very interesting I wanted to share with you some thoughts. [Note that I’m not an expert. To correct me / share your knowledge please use the comment section]

Web scraping is everything but a new technique. However with more and more data shared on internet (from user generated content like social networks & review websites to public/government data and the growing number of online services) the amount of data collected and the use cases possible are increasing at an incredible pace.

We’ve entered the age of “Big Data” and web scraping is one of the sources to feed big data engines with fresh new data, let it be for predictive analytics, competition monitoring or simply to steal data.

From what I could see the startups and services which are using “web scraping” at their core can be divided into three categories:

•    the shovel sellers (a.k.a we sell you the technology to do web scraping)

•    the shovel users (a.k.a we use web scraping to extract gold and sell it to our users)

•    the shovel police (a.k.a the security services which are here to protect website owners from these bots)

The shovel sellers

From a technology point of view efficient web scraping is quite complicated. It exists a number of open source projects (like Beautiful Soup) which enable anyone to get up and running a web scraper by himself. However it’s a whole different story when it has to be the core of your business and that you need not only to maintain your scrapers but also to scale them and to extract smartly the data you need.

This is the reason why more and more services are selling “web scraping” as a service. Their job is to take care about the technical aspects so you can get the data you need without any technical knowledge. Here some examples of such services:

    Grepsr
    Krakio
    import.io
    promptcloud
    80legs
    Proxymesh (funny service: it provides a proxy rotator for web scraping. A shovel seller for shovel seller in a way)
    scrapingHub
    mozanda

The shovel users

It’s the layer above. Web scraping is the technical layer. What is interesting is to make sense of the data you collect. The number of business applications for web scraping is only increasing and some startups are really using it in a truly innovative way to provide a lot of value to their customers.

Basically these startups take care of collecting data then extract the value out of it to sell it to their customers. Here some examples:

Sales intelligence. The scrapers screen marketplaces, competitors, data from public markets, online directories (and more) to find leads. Datanyze, for example, track websites which add or drop javascript tags from your competitors so you can contact them as qualified leads.

Marketing. Web scraping can be used to monitor how your competitors are performing. From reviews they get on marketplaces to press coverage and financial published data you can learn a lot. Concerning marketing there is even a growth hacking class on udemy that teaches you how to leverage scraping for marketing purposes.

Price Intelligence. A very common use case is price monitoring. Whether it’s in the travel, e-commerce or real-estate industry monitoring your competitors’ prices and adjusting yours accordingly is often key. These services not only monitor prices but with their predictive algorithms they can give you advice on where the puck will be. Ex: WisePricer, Pricing Assistant.

Economic intelligence, Finance intelligence etc. with more and more economical, financial and political data available online a new breed of services, which collect and make sense of it, are rising. Ex: connotate.

The shovel police

Web scraping lies in a gray area. Depending on the country or the terms of service of each website, automatically collecting data via robots can be illegal. Whatever the laws say it becomes crucial for some services to try to block these crawlers to protect themselves. The IT security industry has understood it and some startups are starting to tackle this problem. Here are 3 services which claim to provide solutions to stop bots from crawling your website:

•    Distil
•    ScrapeSentry
•    Fireblade

From a market point of view

A couple of points on the market to conclude:

•    It’s hard to assess how big the “web scraping economy” is since it is at the intersection of several big industries (billion dollars): IT security, sales, marketing & finance intelligence. This technique is of course a small component of these industries but is likely to grow in the years to come.

•    A whole underground economy also exists since a lot of web scraping is done through “botnets” (networks of infected computers)

•    It’s a safe bet to say that more and more SaaS (like Datanyze pr Pricing Assistant) will find innovative applications for web scraping. And more and more startups will tackle web scraping from the security point of view.

•    Since these startups are often entering big markets through a niche product / approach (web scraping is a not the solution to everything, there are more a feature) they are likely to be acquired by bigger players (in the security, marketing or sales tools industries). The technological barrier are there.

Source: http://clementvouillon.com/article/web-scraping-startups-services-market/