On this page you find the weekly homeworks. We will post them approximately one week before they are due.

For each homework that involves code, you must follow the following submission guidelines:

  • Submit an archive (tar, zip) containing your solution to the course CMS before the homework deadline. The archive filename must follow this pattern: info4302_hw{homework_number}_{firstname}_{lastname}.(tar.gz|zip)
  • We accept code contributions in Java and Python. Please make sure that you use recent language versions.
  • Your code must be compile- and executable from the command line. You can use IDEs such as Eclipse for development, but please don't send us your IDE project files.
  • Your code should follow directory layout best-practices, e.g., source-code in a folder src or lib, executables in a folder bin, etc. This is language-specific and not standardized, but you will find guidelines on the Web.
  • The root directory of your archive file must include a README file, which explains precisely how we can run your code and what we are expected to see. There you must also acknowledge every piece of code you borrow. It is fine to use others' code in this class, but it is not fine to represent it as your own, even accidentally. That is plagiarism and a serious offense in both the work and academic worlds.

HW10: Recap (due 11/29 10:00am)

The purpose of this homework is to reflect on some of the main topics taught in this course. Your answers should be 800-1000 words in total.

Questions

  1. What historical and technical relationship exists between the Architecture of the Internet and the World Wide Web?
  2. What is content negotiation, and what additional dimension of content negotiation does the Memento project introduce? Through which extensions of the http protocol?
  3. How do XML and JSON compare? What do they have in common, how are they different? Based on your own experience so far, when would you chose one or the other as a content author or provider?
  4. What is the purpose of RESTful APIs? Name and summarize three key design principles.
  5. What is the idea behind Linked Data? What do RESTful APIs and Linked Data have in common, what are the technical differences?
  6. Why do people and businesses markup their Web pages with terms from the schema.org vocabulary.

What you should turn in

  • A TXT or PDF document containing your responses.

HW9: Google (Open) Refine and Cross Cutting Issues (due 11/15 11:59pm)

The purpose of this homework is to gain hands-on experience with the idea of linking "things" in your local dataset with those in other dataset. By following established links you can then augment your dataset with additional data. This homework also focuses on the readings we posted for the upcoming cross-cutting issues sessions.

Task 1: Data Reconciliation with Google Refine / Open Refine

Use Google Refine (now Open Refine) to augment either the movie or the actor dataset from the previous homeworks with at least one other property (e.g., initial release date). You can use the built-in Freebase reconciliation service, build a custom URI fetching routine, or use RDF Refine for this task.

Task 2: Time Travel on the Web

Read the paper Memento: Time Travel on the Web and describe the goals Memento tries to achieve and its technical architecture in your own words. Please make sure your description (300-400 words) also answers the following questions:

  • How does Memento add the temporal dimension to HTTP?
  • What is a TimeGate and how is it implemented in servers with and servers without archival capabilities.

Task 3: Human Computation

Read (at least) chapters 1,2,5,6,7 of the Human Computation textbook and summarize, in your own (300-400) words, the basic idea of human computation. Also discuss design decision that need to be made to ensure that human workers compute the needed outputs.

Task 4: Citizen Science

Read the paper eBird: A citizen-based bird observation network in the biological sciences and summarize in your own (200-300) words what eBird is about and discuss the incentives eBird provides for participants to submit their data.

What you should turn in

  1. Your movie.csv or actor.csv file augmented by at least one more column
  2. A TXT or PDF document containing answers to Task 2, 3, 4

HW8: SPARQL Querying and Structured Data Publishing (Due: 11/11 11:59pm)

The purpose of this homework is to gain hands-on experience with SPARQL and current data publishing techniques. You will formulate SPARQL queries against DBPedia and extend the movie data service from your previous homework to publish structured data about movies and actors.

Task 1: SPARQL queries

Please formulate SPARQL queries that answer the following questions, when executed against http://dbpedia.org/sparql:

  1. All films directed by Stanley Kubrick
  2. The English and, if available, Spanish abstracts of ten 1980s horror movies
  3. The human-readable name (in english) and the birth dates of all actors that were starrings in movies directed by Stanley Kubrick. Note that the result should also contain actors with unknown birth dates.

Task 2: Publish Movie and Actor Data as Linked Data

Extend the movie service you built for the previous homeworks so that it supports HTML representations for GET operations on actors and movies (UC3 + UC4). No HTML pages are required for list-retrieval use cases create/update/delete/search operations.

Implement 303-style content negotiation and distinguish between information and non-information resources. Inspect the HTTP request for each incoming HTTP GET request against a non-information resource (e.g., http://localhost:8888/actor/1, http://localhost:8888/movie/2) and depending on the value of the HTTP Accept header field (e.g., "text/html", "application/rdf+xml"), redirect the client to corresponding information resource (e.g., http://localhost:8888/actor/1.html, http://localhost:8888/actor/1.rdf) via a 303 See Other response.

Test your Webservice using cURL and/or raptor and provide sample commands to verify that your service operates correctly.

Task 3: schema.org Markup

Markup the actor and movie HTML representations produced by your movie service with structured data using the schema.org vocabulary.

Test your serializations using the Google Structured Data Testing Tool

What you should turn in

  1. A file README.TXT containing the SPARQL queries from task 1
  2. Your Web Service implementation (e.g., movie_service.py) with detailed instructions in README.TXT
  3. At least two example cURL/rapper commands in README.txt that allow us to test and verify that your service published Linked Data
  4. One actor and one movie HTML page, each showing schema.org structured data markup

HW7: Basic RDF Construction and Programming (Due: 11/01 11:59pm)

The purpose of this homework is to give you experience with using the Resource Description Framework (RDF) and existing RDF libraries (APIs). You will also extend the RESTful Movie Service from the previous assignment and bring it one step further to a Linked Data service.

You can download and use the provided code skeleton for this homework.

Task 1: RDF APIs

Model the movie data set from the previous assignments as RDF statements. In as many cases as possible use elements (classes and properties) from existing vocabularies. Hint: although most facts can be expressed in these vocabularies, there are a few that can't, in which case you should use a single new namespace of your choosing.

Write a program (e.g., convert.py) that takes the three movie data CSV files as input, converts them into an in-memory RDF graph, and serialize that RDF graph into at least two different RDF formats (e.g., e.g., RDF/XML, Turtle) on completion. Use an existing RDF library (e.g., RDFlib) for handling RDF models in your program.

Make sure you provide instructions on how to run your program README.TXT.

Task 2: Enhance movie service to support RDF

Extend your Movie Service from the previous assignment so that it also supports at least two RDF serialization formats (e.g., RDF/XML, Turtle) for use cases UC3 (Retrieve a specific actor) and UC4 (Retrieve a specific movie). You don't need to implement resource creation / update functions based on RDF.

You can either use an RDF API or template mechanisms such as those provided by Tornado.

Test your implementation using cURL and/or raptor and provide example commands in README.TXT.

What you should turn in for this homework

  1. a file convert.py that performs the conversion to RDF
  2. two example serializations in different formats (e.g., movies.rdf, movies.ttl)
  3. your Web Service implementation (e.g., movie_service.py)

HW6: RESTful Webservice (Due: 10/25 11:59pm)

Create a RESTful Webservice for retrieval and manipulation of actor and movie data. It should use the movie/actor dataset and implement the following use cases:

  • UC1: Retrieve a list of all actors
  • UC2: Retrieve a list of all movies
  • UC3: Retrieve a specific actor
  • UC4: Retrieve a specific movie
  • UC5: Create a new movie
  • UC6: Delete a movie
  • UC7: Update a movie
  • UC8: Retrieve a list of all actors playing in a certain movie
  • UC9: Search (full-text) over actors and movies

Think about a mapping of the "things of interest" (movies, actors) and design URI templates before you implement your Webservice. For instance: /actors/{:id} -> identifies a specific actor

The Webservce must serve XML and JSON data representations for all resources and also for search results. Distinguish between these representations by assigning separate URIs, e.g., /actors/1234.json identifies the JSON representation of a certain actor, whereas /actors/1234.xml identifies the XML representation.

Your Webservice should also implement the principle of "connectedness", meaning that returned resource representations should include links to other related resources.

For implementing your Webservice you can re-use existing code from your previous lectures. For getting started you can use our Python code skeleton, which uses Facebook's Tornado Webserver and already implements UC1: INFO/CS 4302 - HW6 - code skeleton

Create sample request URIs for your Webservice and test them using cURL. Make sure you document these cURL requests in your homework submission because we will use them for testing your service and grading your homework.

Please don't forget to implement basic error handling as discussed in our lecture meetings.

What you should turn in for this homework

  1. Your Web service implementation (e.g., movie_service.py)
  2. A file README.txt with detailed setup instructions and example cURL requests for testing your Webservice

HW5: XSLT, JSON (Due: 10/04 11:59pm)

Task 1

Create an XSL stylesheet that transforms the XML document that you created for hw3, or use alternatively as a source file the catalogue.xml file included in the zip file below.

The result of this transformation should be an HTML document that renders in a browser similar to the startpage.pdf template (provided in the zip file below); however it should contain the complete list of all movies. The style of the formatting is up to you but the exact content should be reproduced.

Note that the transformation relies on conditional processing of the date field, and in particular the year value within the date string. Be forewarned that not all XSLT processors can handle XPath functions that are available in XPath 2.0. In case you encounter that issue you either need to switch the XSLT processor or you could modify the source xml to make the parsing of the birthdate information easier.

Data files & reference implementations: The zip archive release-hw5.zip includes the following material for you to use in this task:

  • data/catalogue.xml (you may use this as your XML source document)
  • docs/startpage.pdf (providing a template for your HTML result document)

What you should turn in for task 1

  • The XML source document that you used, named catalogue.xml or if it is your own file, named {your_lastname}_catalogue.xml
  • An XSL stylesheet {your_lastname}_catalogue.xsl transforming the XML source document to an HTML document conforming to the template provided by startpage.pdf
  • The HTML file {your_lastname}_catalogue.html that was generated using your stylesheet
  • A screenshot demonstrating how the first few entries of your HTML file display in your browser

Task 2

Write two programs:

  • One that produces a JSON representation of the data in the three csv files with movie and actor data and outputs it into a .json file. Make sure it produces a syntactically correct JSON representation of the data, using e.g. an online validator such as jsonlint or oxygen.
  • One that reads this newly created JSON file, converts the data to python objects, extracts from those objects the following information and outputs it to the screen:
    • for each actor born in September (in any year), their names, birth dates, and the titles of the movies he or she has performed in

Data files & reference implementations: The zip archive release-hw5.zip includes the following data files and reference implementations that you may build on and extend:

  • data/movies.csv (data on movies)
  • data/actors.csv (data on actors)
  • data/movie_actors.csv (data linking movies to actors)
  • src/create_simple-catalogue_JSON.py (to extract the data from the csv files and build a simple JSON based catalogue)
  • src/parse_simple-catalogue_JSON.py (to parse the simple-catalogue file and output to the screen)

What you should turn in for task 2

  • A program {your_lastname}_create_catalogue_JSON.py that produces a JSON representation of the (complete) movie data
  • A JSON file {your_lastname}_catalogue.json containing the (complete) movie data
  • A program {your_lastname}_parse_catalogue_JSON.py that parses the JSON data and outputs the required information to the screen
  • A screenshot of the information as output to the screen.

HW4: Internet Surveillance (Due: 9/27 10:00am, before class!)

Read the three readings for the cross-cutting issue session on Internet Surveillance. Each article provides an analytic framework for discussing and thinking about Internet surveillance:

  • Cooper: privacy risks associated with specific use cases of DPI and their mitigation
  • Fuchs: neutral and negative concepts of surveillance
  • Roberts & Palfrey: distinction between network, server side, and client side surveillance

Write a reading response as follows: for each of the three articles summarize in your own words the gist of the conceptual framework offered by the author(s) of the respective article and apply it to discuss a specific form of internet surveillance that you care about. How is the analytic framework useful in this discussion? Feel free to criticize the analytic framework offered.

Submit your reading response in a single (pdf or txt) document using in total between 800 and 1100 words. Please note your name and netid at the beginning of your reading response.

HW3: XML, XML Schema, XML Path (Due: 9/20 11:59pm (new date))

The purpose of this homework is to become familiar with the third pillar of the Web Architecture: Data Formats. It focuses on XML-based formats and gives you some practice with XML and XML schema languages. For working with XML, you can download trial versions of oXygen XML Editor or Altova XML Spy or simply use your text editor or libxml from your console.

Task 1

Create an XML schema file catalogue.xsd and an XML document catalogue.xml that make use of all types of data provided in the three csv files movies.csv, actors.csv and movie_actors.csv. In particular, make use of an xml element attribute at least once. To decide when to use an element and when to use an attribute in the design of your XML document, consult Ogbuji's essay 'Principles of XML design: When to use elements versus attributes' online at http://www.ibm.com/developerworks/xml/library/x-eleatt/index.html

Ensure that both your XML document and your XML schema document are well-formed and that your XML document validates without errors against the XML schema using an online schema validator such as http://www.freeformatter.com/xml-validator-xsd.html. For improved readability please use indentation in all xml based documents.

Data & reference implementations: The gzip archive release-hw3.zip includes the data files and reference implementations listed below. We encourage you to make use of the reference implementations that we offer and simply revise and extend the respective files.

  • create_simple-catalogue.py (reference implementation: produces simple-catalogue.xml)
  • simple-catalogue.xsd (reference implementation: an XML Schema file that simple-catalogue.xml validates against)
  • movies.csv (data on movies)
  • actors.csv (data on actors)
  • movie_actors.csv (data linking movies to actors)

What you should turn in

  • A python program {your_lastname}_create-catalogue.py (or Java program) that creates an XML document from the data provided in the three csv files provided
  • The XML document {your_lastname}_catalogue.xml created by your program using the data in the tree csv files provided
  • An XML schema file {your_lastname}_catalogue.xsd that your XML document validates against

Task 2

Create an RELAX NG schema to validate the XML document that you created in task 1. Make sure the RELAX NG schema is a well-formed XML document. For improved readability please use indentation in the document.

What you should turn in

  • A RELAX NG schema document {your_lastname}_catalogue.rng that is well-formed and that your XML document validates against

Task 3

Write three XPath expressions that:

  • Deliver the movie titles of all movie nodes
  • Count the number of actors listed for each movie
  • Return the birthday information of all actors

What you should turn in

  • A text document {your_lastname}_xpath.{txt} listing the XPath expressions

HW2: Identification, Interaction, Representation (Due: 9/9 11:59pm)

The purpose of this assignment is to understand and practice two of the three architectural bases of the Web: Identification and Interaction. You will also learn how to use browser add-ons and cURL to debug your Web Information System. The HTTP specification included in the readings for this week is an important reference for this assignment. You will also find the Web architecture document that is part of the readings useful. Of course, you can also search the web via Google, etc. for help with many of the answers. Please submit your answers as a single text (pdf, txt) document.

Task 1: Identifiers

Consider the following identifier schemes (each nicely described in Wikipedia)

  • Digital Object Identifier (DOI)
  • Uniform Resource Identifier (URI)
  • Domain Name System (DNS)

Briefly describe each of them in terms of the following characteristics:

  • Persistence: what are the mechanisms and inherent capabilities to last forever?
  • Scope: what type of entity do they identify (documents, persons, abstract concepts...)?
  • Uniqueness: what are the mechanisms to ensure global uniqueness?
  • Governance: who, if anyone, manages them?
  • Actionability: how are they, or not, tied to an access mechanism?

Task 2: HTTP in your Web browser

Most Web browsers provide some tools to monitor their HTTP interactions with Web resource. You can install the Firefox Web Developer add-on, enable Safari's Develop Menu, use Chrome's Developer Tools, or use any other browser's development tools. If you haven't done so in the past, either install those tools or enable them in your browser of choice. Also, find a way to either clear or empty your cache in your browser. For example in Safari this is done via a menu command, in chrome this is done via the preferences. Once you discover how to do this, clear or empty the cache in your browser (make sure that you clear the cache rather than your entire web history, which you probably don't want to do). Now dereference http://www.infosci.cornell.edu/Courses/info4302/2012fa/ and answer the following questions.

  • How many web resources were requested and returned by this single HTTP request?
  • Describe the sequence of events triggered by this request, how many resources were eventually requested, and what is the nature (content-type) of each resource representation?
  • What is the meaning of the status code returned for each resource?
  • When you hit your browser's back button and reload the page, what has changed in the HTTP transactions and why? How does this relate to the cache that you cleared at the beginning of this exercise?

Task 3: HTTP with cURL

In this task you will use curl: a command line based HTTP utility to examine HTTP transactions. If you are running Windows, Mac OS, or Linux, curl should already be available from a terminal window. The curl web page provides versions for all common operating systems. There are some tutorials available on the Web that help you quickly learn to use it. Take note with some useful commandline options such as -H that allows you to add arbitrary request headers and -v that verbosely displays your request headers and corresponding response headers

Use curl to experiment with the following HTTP GET scenarios:

  • Scenario 1: access http://www.google.com to retrieve its versions in french and spanish.
  • Scenario 2: access to http://dbpedia.org/resource/Berlin to retrieve its versions in text/html and application/rdf+xml. Describe what the resource identified as http://dbpedia/resource/Berlin denotes. What is the "object of interest" (using the terminology of the web architecture document) that it stands for?
  • Scenario 3: access to content/representation for URI doi:10.1021/ci050378m through the proxy URI http://dx.doi.org/10.1021/ci050378m (note this will only work at Cornell due to licensing restrictions). Think carefully when you answer the following question. What does each of the resources (and their respective URIs) involved in accessing a representation denote (make sure to consider the DOI, the proxy, and the final URI)?

For each scenario report the following characteristics:

  • the number of resources involved in the HTTP transaction.
  • the number of representations and their associations with the resource.
  • the role of content negotiation in the relationship between resources and representations.
  • the role of redirection in the relationship between resources and representations.

HW1: Introductory Reading (Due: 09/02, 11:59pm)

Read the items listed in the course introduction and answer the following questions.

  • Describe three features of V. Bush's vision and how you see them realized by today's World Wide Web.
  • The reading "Creating a Science of the Web" mentions three postulates of an ethos of web science: how are they motivated and what is your take on them (provide concise but thoughtful reactions).
  • Based on reading chpts. 1 and 2 of Linked Data by Heath & Bizer, what problem does linked data set out to solve and what are the main ingredients to a solution?

Submit your answers in a single (pdf or txt) document using in total no more than 600 words.