Extract Data From Website: April 2013

Google has a new, embarrassing scandal on its hands this morning. Kenyan startup Mocality conducted a relatively elaborate sting to catch Google scraping results from its database of local businesses. Mocality CEO Stefan Magdalinski wrote up the full results on its blog.

Mocality is a crowdsourced platform that lists over 170,000 small Kenyan businesses. Over the last two years it paid out over $100,000 to Kenyans who contributed listings to its database. For 100,000 of those 170,000 businesses, Mocality is the first time they've been listed online.

In September Google decided to replicate some of what Mocality had already done. It launched a program called, Getting Kenyan Businesses Online (GKBO). After Google launched GKBO, Mocality started getting what it calls "odd calls."

Small businesses were calling Mocality about websites, but Mocality doesn't offer websites. Google does.

Mocality then traced sources of inbound traffic, and it found that an unusually high amount of traffic from one source was targeting the contact information of the businesses it listed. The traffic wasn't from an automated system, it was from a "team of humans."

So, Mocality changed its code to catch the people coming from that source of traffic. Instead of listing the businesses phone numbers, it listed its own number and told employees to act like they were working at the small business.

Sure enough, it got phone calls to its offices. The people calling were from Google. These Google employees said they were working with Mocality, which isn't true, and then offered to get the business a website (at a fee.)

What's worse is that on one call a Google employee says Mocality charges businesses, which is not true.

Mocality sat on the evidence for a little while during the Christmas break. When it can back from holiday it found more shady business. This time Google was having people in India do the cold calls.

Magdalinski says that 30% of Mocality's database has been contacted by Google. He's rightfully annoyed by Google's behavior. He wants to know why Google didn't just ask if it wanted to get some of the data. He also wants to know why it's lying to Kenyan businesses about its relationship with Mocality.

A Google rep provided this statement to us: "We're aware that a company in Kenya has accused us of using some of their publicly available customer data without permission. We are investigating the matter and will have more information as soon as possible."

Source: http://articles.businessinsider.com/2012-01-13/tech/30622472_1_google-employees-businesses-kenyans

Note:

Delta Ray is experienced web scraping consultant and writes articles on Web Screen Scraping, Scraping A Website, Extract Data From Website, Website Screen Scraping and Scrape A Website etc.

Web scraping also called web harvesting or web data extraction, is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

Web scraping is closely related to web indexing, which indexes information on the web using a bot and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, weather data monitoring, website change detection, research, web mashup and web data integration.

Techniques:

Web scraping is the process of automatically collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Web scraping, instead, favors practical solutions based on existing technologies that are often entirely ad hoc. Therefore, there are different levels of automation that existing web-scraping technologies can provide:

§ Human copy-and-paste: Sometimes even the best web-scraping technology cannot replace a human’s manual examination and copy-and-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation.

§ Text grepping and regular expression matching: A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression matching facilities of programming languages (for instance Perl or Python).

§ HTTP programming: Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.

§ Data mining algorithms. Many websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template. In data mining, a program that detects such templates in a particular information source, extracts its content and translates it into a relational form is called a wrapper. Wrapper generation algorithms assume that input pages of a wrapper induction system conform to a common template and that they can be easily identified in terms of a URL common scheme.

§ DOM parsing: By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic contents generated by client side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages.

§ HTML parsers: Some semi-structured data query languages, such as XQuery and the HTQL, can be used to parse HTML pages and to retrieve and transform page content.

§ Web-scraping software: There are many software tools available that can be used to customize web-scraping solutions. This software may attempt to automatically recognize the data structure of a page or provide a recording interface that removes the necessity to manually write web-scraping code, or some scripting functions that can be used to extract and transform content, and database interfaces that can store the scraped data in local databases.

§ Vertical aggregation platforms: There are several companies that have developed vertical specific harvesting platforms. These platforms create and monitor a multitude of “bots” for specific verticals with no man-in-the-loop, and no work related to a specific target site. The preparation involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically. The platform's robustness is measured by the quality of the information it retrieves (usually number of fields) and its scalability (how quick it can scale up to hundreds or thousands of sites). This scalability is mostly used to target the Long Tail of sites that common aggregators find complicated or too labor intensive to harvest content from.

§ Semantic annotation recognizing: The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets. If the annotations are embedded in the pages, as Microformat does, this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer, are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages.

Source: http://www.thehackingarticles.com/2012/08/web-scraping-techniques-of-web-data.html#.UXpfGdhW_VQ

Note:

Delta Ray is experienced web scraping consultant and writes articles on Web Screen Scraping, Scraping A Website, Extract Data From Website, Website Screen Scraping and Scrape A Website etc.

Extract Data From Website

Friday, 26 April 2013

Google Was Caught Scraping Data From A Kenyan Startup And Telling Lies About The Startup To Its Customers

Web scraping: Techniques of web data extraction