The Web Observatory aims to become a global data, analytics, and visualisation environment for the advancement of research, and ultimately improving economic and social prosperity. Given the ever-increasing streams of Web data, the Web Observatory positions itself as a suitable environment to study the evolution and impact of the Web's ecosystem which operates at massive scale and is dominated by unexpected, emergent phenomena and radical user-led innovations in technology and society. These documents provide additional background information and insights into the concepts and technologies in use..
Beyond the SOCIAM project, the Web Observatory project is part of a bigger venture and is coordinated by the Web Science Trust, which leverages research resources of 15 international academic labs (WSTNet), and partners such as the Open Data Institute (ODI), the World Wide Web consortium (W3C), The Web Foundation, the Fraunhofer Institute and a growing list of industry collaborators.
The socio-technical principles of the Web Observatory consist of:
Provide catalogues which locate and describe existing datasets about activity on the Web
Identify systems for aggregating datasets and analytics tools for their analysis and visualization
Establish open standards, protocols and practices to reduce the cost of gathering, sharing and analyzing data in the Observatory
Assist in the identification of new business models and appropriate licensing schemes to encourage industry adoption and collaborative opportunities.
To develop a platform that is capable of harvesting, processing, storing, streaming, and retrieving data at Web-scale is a complex engineering, HCI, and social challenge. The Southampton Web Observatory (SWO) is being designed and engineered using the latest in data storage and processing technologies, as well as make use of open standards and formats to store, access and retrieve the data. As shown below the architecture of the SWO consists of various layers of which use a mix of different technologies in order to provide the numerous features required.
The core of the data storage infrastructure is driven by a large purpose-built cluster which houses various distributed storage and processing solutions including Apache Hadoop and soon Apache Spark. Given the scale of the data we are processing (thousands of records a second), we are working with distributed solutions in order to store and process incoming Web data streams. As of present, we have over 30TB of Web data from services such as Flickr, Twitter, Tumblr, Reddit, SlashDot, Wikipedia, Zooniverse. In order to harvest these data sources we have developed dedicated pipelines which either harvest data in real-time (see Wikipedia project for more information), or collect data which from a dedicated Web crawl.
As part of the pre-processing stage of the pipeline, we transform our data using common data formats such as JSON in order to unify our data streams to allow for simpler processing further down the analytics pipeline. In order to deal with high volume messaging feeds (often parsing thousands of messages a second), we take advantage of AMQP and use RabbitMQ to stream our data into hadoop whilst also making it available for Web Observatory clients to connect to. We also make use of NOSQL storage solutions such as MongoDB in order to provide daily a cache of the last 24 hours worth of data. Again, this is made available via the Web Observatory portal’s API.
Shown below is the processing pipelines that we have engineered for storing a processing data:
A core feature of the SWO is to provide streaming access to real-time data from various Web services. Real-time streams provide us with an insight into the current state of the Web and the state of interaction between different social machines. This is of particular interest for SOCIAM as it is informing us for how to develop new metrics for measuring the state and health of a social machine.
Whilst there are various technologies available to process such volumes of data, implementing a robust and reliable solution is not trivial. We are currently harvesting and processing a number of different real-time streams including Twitter, Wikipedia, and a collection of Web trend services (Google Trends, Yahoo News, Social Trends). We also perform real-time unification of the streams, representing them using a common data schema, which is then made available for public consumption via the Web Observatory API. As shown above, AMQP via RabbitMQ is the underlying messaging technology which we use for this, which is wrapped with various layers of pre- and post-processing.
In SOCIAM, we consume these streams internally to answer a various research question with regards to the health of social machines (see Wikipedia project example), or to understand the flow of information within and across social machines (more details can be found here).
The Web Observatory is also capable of providing access to large amounts of historic data which is stored from the real-time streams and various Web crawls. There are several challenges here, from providing adequate layers of pre-processing to make the data accessible in a timely way, to user interfaces to interact and query the data in a browser, to securing the data and adding layers of access control.
A critical feature of the Web Observatory is to enable a larger set of researchers to interact and query data without the need for specific training with big data solutions such as hadoop. As many of you who have experience with such systems will know, writing and executing queries requires some - if not extensive - knowledge of querying frameworks such as MapReduce.
In order to make querying more accessible, we are currently developing various interfaces which act as a middle layer to enable those without domain specific expertise to access and query data stored in the SWO. As shown below, we currently have designed Web interfaces which enable access to the data stored in our HDFS filesystem, making use of Apache Hive in to use a SQL-inspired language to query the data.
Web Observatory Portal and API
In order to make the data visible and accessible, we have been developing various API’s to provide access to WO data, independent of the data storage technology. By using a lightweight platform which incorporates NodeJS. AngularJS and Socket.IO we are able to provide a simple API to query various stores of historic and real-time data streams. As controlling access to who can view and query the data is a core component of the WO, we use OAuth as a means to ensure data owners are in full control of their data. The SWO portal has two layers of access control, WO users are required to register and have an account in order to interact with datasets, and in order to query them via the API, they are required to authenticate using OAuth procedures.