[S]crape Overview

Context

[S]crape operates within the following environment:

[S]crape Context

Modes of Operation

You operate [S]crape in one of two ways:

  • interactively;
  • batch;

[S]crape starts and opens targets through a browser [1], and gets it’s data from that browser.

You can interactively highlight sections in the browser and scrape will give you unambiguous code to select that data. As with a Google search, skill will help make the search term (xpath or css selector) more general, yet specific enough to return your desired result.

Inspect results as you work interactively until you are satisfied and then save a range of commands from your history to develop your script.

At anytime you can load and run scripts against an opened target. Thus you build up a complete script incrementally. In batch mode, you can automatically run it over a pattern or list of targets.

You can test your scripts interactively in headless mode (that is, without a browser). You can also run batch either with a browser, or headless.

Knowledge You Should Have

You should have a general understanding of HTML and CSS structure and form. You don’t need to know much, but you should be able to understand and recognize what you are looking at when looking at small portions of web page source, and have an understanding of what type of thing you are trying to extract, i.e. path, attribute, or text.

You will need some basic understanding of XPATH syntax and CSS Selectors as you will be using these to describe what you are looking for. When manually highlighting something in your browser, [S]crape will return an XPATH. Often a CSS selector is both shorter and more accurately selective. [S]crape allows you to view context near your selection. This makes it easy to pick a different form of selector and test it before saving it to your script.

[S]crape Shell

In interactive use, [S]crape is similar to a typical command shell, such as sh or bash, or cmd on Windows. In command interpreters, there are typically built-in commands and a way to execute external commands. Shells also provide variables, and some sort of program control.

[S]crape has a rich set of built-in commands, and allows calling external commands through your system’s shell. You can also add built-in commands by writing extensions to [S]crape in Python (plugins).

Since [S]crape outputs tables [2], variable names are like table column names. This means every variable in [S]crape is a list (you can think of them as arrays), and every table an associative array of variables. In fact, you can save the result of your [S]crape as either csv, json or yaml. There are other important kinds of variables in [S]crape.

vars:Output variables are the normal variables, and are used to specify output table column names.
local:Local variables are similar to output variables, only they are omitted from tables. These are used for intermediate results. Local variables have scope per output table.
global:These variables persist across output table changes.

[S]crape is least like shells in that there is no familiar loop control. This simplifies traversing an HTML tree and extracting data. Instead of looping, you traverse to locations in the XPATH tree of the input file. We refer to selected (current) XPATH locations as nodes. Typical [S]crape operation involves traversing a document’s tree, extracting selected content from those nodes, and repeating. In place of program control, you control which nodes you search from. Multiple nodes can be active (for example all the list items of some part of the document), so scripts tend to be rather short. Some general control mechanisms [S]crape provides are:

root:Normally, navigation through the document is incremental. This sets the root of the tree to the starting <html> tag. When the root of the document tree is set, it’s children are the active children, so in this case, normally <head> and <body> tags will be current starting nodes.
body:This resets the root node to the <body> tag.
grab:[S]crape opens a browser when it starts, and communicates with it. grab gets a highlited region from your browser, giving you an xpath to it.

A majority of [S]crape commands involve selecting a node using an xpath selector, a css-selector, or a combination of path and text search. The remaining commands deal with interactive use (history, view variables, run scripts, save or load scripts), and outputing results (tables).


Footnotes

[1][S]crape uses a Selenium Client Driver to run your browser. At this time [S]crape only supports Firefox.
[2][S]crape was initially designed to output CSV, but this is a bit too restricting. For one thing, to change the view of the data (the order of way the data is populated into columns, the number and contents of tables) one would need to re-scrape the source. This is why you have a choice of saving variables as JSON or YAML also. Then, you could rebuild, re-shape your tables from your saved data source.
comments powered by Disqus

Table Of Contents

Previous topic

- the Documentation

Next topic

Installation

This Page

Edit this document!

Anyone with a Github account can edit and submit changes directly through the Web.

  1. Click to edit: Overview
  2. Edit using GitHub's editor in your web browser (click 'Edit' tab on the top right)
  3. Fill in the Commit message the bottom of the page describing why you made the changes. If you've completed your changes, press the Propose file change button.
  4. If you've completed your changes, click Send a pull request.
  5. Your changes are now queued for review under the project's Pull requests tab on GitHub!

For an introduction to the documentation format see the reST primer.