Data transport discussion
This wiki page is a working document. The authors are working on this page. This page is in preparation of a post to the Taverna hackers list.
- Morris Swertz
- Scott Marshall
- Adam Belloum
- Machiel Jansen
- Spiros Koulouzis
- Marco Roos
Members of the 'e-Science support team for NBIC' and myGrid are invited to co-author. Others are invited to comment.
Discussion on data transport solutions for BioAssist dd. 20/03/2008
One of issues talked about in the BioAssist programme is that of data transport limitations of Taverna and SOAP based web services. In a collaboration between myGrid developers and core developers for the BioAssist programme this issue will be addressed. Taverna 2, which is scheduled for a first release halfway this year (2008), will be extended. This wiki results from a discussion that was organised in preparation of a visit of Morris, Machiel and Marco to Manchester in April, and a master project by student Spiros on data streaming between distributed web services. The overall question is: how can we speed up and enable transport of big data sets in workflows developed in Taverna? This page is in preparation of a post to the Taverna hackers list.
Summary of the work on data streaming by Spiros
The ability to stream data between web-services could be an alternative way of data transfer for e-Science applications. A common workflow execution scenario involves a WS accessing, or generating a large dataset, which needs to be passed for further processing to a next WS. Using a file oriented approach, the entire data set needs to reach the next WS before that may start working on that data. Instead by using streaming, a next WS may immediately start working on chunks of data as those are generated. In order to investigate the possibility of streaming data between WS, we have developed a streaming library, that uses various protocols for streaming, taking into consideration issues such as security, reliability and speed. The implementation of this library follows a Server/Client paradigm providing a simple API through which WS can stream data. A paper with more details can found here and here
Three scenario's (Marco):
- Workflow contains legacy (SOAP-based) web services, and there is no support from the providers of the services
- In principle there is no proper solution.
- Workflow contains legacy web services, while the providers of the services will help with a solution.
- (Longer-term solution) Extend axis with a library that enables data transport by reference (and streaming) that T2 can make use of. This may be the most elegant solution in the long run, but we introduce a dependency on (versions of) axis.
- Place a 'helper' service with the legacy services deployed under tomcat on the provider's server. A T2 extension point should be implemented to make use of this service. It probably requires that the workflow is annotated on where 'big data' is expected and the service should be used (see attached cartoon. -- not done yet). This manual step is not necessarily bad, as a workflow creator may be assumed to know what (s)he is doing. Possibly BioCatalogue or BioNanny may provide useful information automatically.
- The web services for the BioAssist workflows still need to be made. Therefore the issue is not very urgent.
- Add lines to the service's code that will have it make use of a new library for data transport. [Is this the same library of 2.1?] [I dont understand this - Machiel]
- Do not return data streams but use URI's to refer to data.
- Use REST rather than SOAP - Taverna will get a REST processor.
- A lot can be learned from Pegasus. [Adam: can you extend a bit or add a reference or two?]
- A requirement is backward compatability: older clients (Taverna 1) should still be able to run the workflows. In theory the solutions above have this feature.
- For organisations such as NBIC and a project such as BioAssist we may assume support will encompass servers with the required functionality provided by organisations such as SARA.
- 2.1 and 3: a new library to be installed under axis
- 2.2: a helper service to be deployed under tomcat.
- implementation of T2 extension points to make use of the axis library or the helper service of 2.2
Spiros on data streaming
Spiros will test the http protocol for streaming and some of the ideas coming from this discussion. [Spiros/Adam/Scott: can you change/extend this?]