.. _pycon: .. include:: ../__scrape_logo.rst ===================== Developing a Project ===================== This is the second tutorial in a series. From the introductory tutorial, we saw how to select a destination (file or URL). Initially, it's also beneficial to view the destination's source along with the browser window. You can either search for what interests you in the source window, or use ``inspect element`` to get to the item that interests you. Once you've quickly found the item of interest, you can start trying various tree traversal commands to get to related items in |s|, view the nodes found, and save some part of their content in |s| variables for output. You can also save your script activity into a script, which you can edit and run later in |s|. In this tutorial we'll see how to backtrack and make corrections. We'll also see how the various |s| commands behave when applied to multiple nodes. .. Note:: *A word about* |s| ing *public sites*: Be a *Good Citizen*! - avoid repeatedly hitting a site, and loading its servers; - always check for copyright, and observe fair use doctrines. PyCon Volunteer Reporting ========================== Here's our project: the US PyCon 2013 Conference is coming up. PyCon is a community conference and depends heavily on voluneers. We want to track how many volunteers we still need for session staff [#sessstaff]_. The conference site lists the sessions and staff on http://us.pycon.org/2013/schedule/sessions. Since this will likely change dynamically, we'll use a snapshot version we saved, just as you would when first developing a script (in order to spare repeatedly hitting a site's servers). Having a static copy will also make it easier to follow along with the tutorial (also, after the conference, there will be no unfulfilled needs, so the web data won't be as interesting): - download :download:`tutorial2.zip `. Getting Oriented with |S| Commands =================================== .. scrape = navigating - body cssselect find find_by_text (findtext) find_class (findclass) findall get_element_by_id (getbyid) getchildren getnext getparent getpath getprevious search capturing - attrib content (text_content) tail text text_content (content) interaction - current (show) doc grab history (hi) list (also looks into history stack) nodes run (r) show tags settings / behavior glob / noglob headless / notheadless overwrite / roll (notoverwrite) populous / sparse vars - clear global local root table var other - EOF base help see help show browser = close open / scrape shell = shell (sh, !; inline: $(...)) scripts = load save plugins = set output = json table yaml Let's review what we've learned so far. When you open |s| with a ``URL``, |s| opens the url in a browser and parses it into a tree of nodes held in scrape. These nodes are what you navigate. Using xpath and cssselect you select nodes and extract data. The ability to inspect aspects during the process is useful, as well as being able to run scripts in batch. In this tutorial we'll introduce some of the rhyme and reason behind |s|. Since |s| has over 60 commands, let's start by describing some structure around the commands (we will only introduce some of them in this tutorial). .. graphviz:: ../context.dot :alt: [S]crape Context |S| commands affect each of these areas. Most of the action happens in the hub - in |s| itself. The type of commands in |s| are: - navigation - content extraction (capturing) - interaction - settings - variables A Starting Strategy ==================== The first time you open a target ``URL`` it can be useful to open the page's source from the browser (I have them side-by-side at first). .. sidebar:: To open the page source right-click in the browser page: .. image:: img2/page_source.png :width: 339px :height: 331px :scale: 65% For smaller pages, it can be useful to search in the source for what interested you in the browser. For larger pages, it can sometimes be easier to simply highlight what interests you in the web page, and use the |s| ``grab`` command to give you a small context. From there, it can be easier to search for the larger context in the source window, so you can get a good view of the context around your interest. Let's do that now. Unzip the tutorial file (I've replaced the >1M in images with a single pixel gif to keep things manageable). You should have a file ``sessions.html`` and a directory ``sessions_files``. Assuming you've unzipped in the current directory run ``scrape``:: $ scrape sessions.html .. [S]crape >>> To orient ourselves, use a few of the interaction commands from the *Introductory Tutorial*:: [S]crape >>> nodes 2 [S]crape >>> tags ['head', 'body'] [S]crape >>> In this case, we are not concerned with any of the meta-data which might be in the ````:: [S]crape >>> body [S]crape >>> nodes 7 [S]crape >>> tags ['header', 'div', 'script', 'script', 'script', 'script', 'div'] [S]crape >>> Looking at our browser window, the sessions are named and listed as visual blocks. Here are the parts interesting for our task: .. image:: img2/session_view1.* :width: 710px :height: 793px :scale: 80% Scrolling to the bottom of the browser page, we see there are 42 sessions. We can see that each session has a ``Session Chair`` and a ``Session Runner``. If no one has signed up, the page shows: ``No volunteers signed up``. We need a total of 84 volunteers. We'll need to gather information after the session name (e.g. ``Session #1``). Unfortunately, there's a lot of ``HTML`` code for headers, sponsors, and so forth - but let's go to our browser's source window and search for ``Sessions``. It looks like our info is all contained in an ``HTML`` list. .. image:: img2/list_view1.* :width: 786px :height: 733px :scale: 85% Let's just start by seeing what happens when we try to get the list of sessions. If we try findclass:: [S]crape >>> findclass unstyled [S]crape >>> nodes 43 [S]crape >>> It looks like we might have gotten the 42 session (their content looks to be held in ``