Data transport meeting, Manchester, April 9, 2009
Disclaimer As a user (be it an unusual one) is writing most of this, 'data transport' is meant in its broadest sense. A solution is anything that gets big data sets better through a workflow.
- A plan for tackling the data transport bottleneck in Taverna 2
- A list of requirements/consequences for client applications and users
- An implementation roadmap
- A roadmap aimed at Bioinformaticians, in particular those in the BioAssist project (can be drafted after the meeting)
- Gain some hands-on experience with T2
- Stuart Owen (myGrid)
- Morris Swertz (Groningen, member of e-Science support team for BioAssist/NBIC)
- Marco Roos (Amsterdam, member of e-Science support team for BioAssist/NBIC)
- Machiel Jansen by Skype (SARA, BioAssist)
- Bharathi Kattamuri (myGrid, security and myProxy)
- David Withers (myGrid, security and myProxy)
- Alex Nenadic (myGrid T2 security extension point)
As Machiel could not make it we will have a shorter meeting.
Start: 10.00hr GMT, 11.00hr CET
Points to discuss
- Machiel starts the discussion by explaining the SARA point of view (see below)
- Disentangling of solutions, what is needed for what?
- Collaborative plan of action
I can try to be present via Skype if you present me with a date and time. Also I can give you some information, but you have to tell me the direction of what you want to know.
Our strategy to enable a webservice to start jobs on the grid and be accessible by Taverna is as follows:
1. Write the service 2. Decide who may access the service 3. Let Taverna access the service by using the credentials of the user
The third step is important now. I have explained the strategy in the hackers mailing list. This is the procedure. It involves myproxy. Since there is a myproxy plugin for T1 this may already work. I haven't tested it yet.
The basic idea is here: http://grid.ncsa.uiuc.edu/myproxy/delegation/
At the Taverna side a proxy cert has to be made and stored at the SARA myproxy server. In return the user gets a username and password. This has to be passed to a gridaware webservice. We suggested basic authentication on the mailing list. Then the taverna story is over. The webservice now has to get a proxy from the myproxy on basis of the username and password.
This is very simple and I think (but are not sure) possible with T1. Creating the proxy and putting it in the myproxy server can be done in Java. I have the code. (Also we have to put a VOMS extension to the certificate - we can also do that in Java). Then the T part. T has to use https to get it to the other side. That's the part I like to see.
When username and password arrive we can pick it up. We have java code to get to the myproxy server and get a proxy. Jobs can then be submitted in principle. (You still have to write a webservice which does anything meaningful.)
Datatransport is not a big issue right now. It is for legacy applications but I have not seen those on the Life Science Grid. For starters get something running first. If it is too slow because of large data flow we'll see.
There is also the time waiting for the jobs to get finished. Will Taverna wait on that. But that's taken care of in T2 as far as I understood.
By the way I don't think this mail thread should be posted on the BioAssist-core mailing list. It is out of place there.
Minutes by Stian Soiland-Reyes.
Stuart, working on data and provenance
Bharathi, work on North west grid, involve in myGrid doing grid stuff
Alex, doing security for myGrid
Morris, GENOSIS and involved with Marco, in particular how to deal with large data
Marco, which you don't know already
Stian, developer focussed on security and execution environments
David, developer focussed on security
Jits, working on myExperiment
Machiel, working in SARS, working on eScience support, looking at grid in the Bioassist program
Machiel: Explain SARA point of view
gLite as grid environment
Use a web service that uses the grid. The problem is getting the credentials across.
The basic idea has been described on the NBIC wiki, with myProxy, etc.
Creating a proxy certificate locally, but you can't just send it, you have to delegate it (hand-shake negotiation).
Send it to a web service that is using the grid, the service can contact the proxy server.
What Taverna should be able to do is sending the username/password to this service. Ideally all the myProxy communication could also be done in Taverna.
Marco asks about how the grid could solve the transporting-big-data-problem - they don't have a use-case where you have to send large data to the web service.
Taverna 2 has promising
Marco: There are workflows in BioAssist (that have not yet been designed), that could deal with large data. Taverna will not need access to the data or even to the grid. Marco says that users should not be required to put their data on the grid first, but that Taverna should do that for them. In the short-term we can approach this by accessing files already on the grid, but that long-term Taverna should help with this.
Machiel: This should not be that hard for the user to do, it's just a matter of copying files to the grid.
Stuart: We'll focus now on the security, but for the data we should have communication open to be sure we're going in the same direction.
Machiel: Question: About authentication, and big chunks of data.
Bharathi: Separate web-service that talks to the grid, Taverna only passes myProxy username and password to the service.
Stuart: Is monitoring an issue?
Machiel: Have not tried integrating grid jobs with Taverna, but if they did then monitoring might be an issue.
Bharathi: Has mainly been doing this with gridsam, sending the JSDL including the username/password over HTTPS. It is a plugin for Taverna that pops up to do the delegation to the myProxy server.
Machiel: Would prefer to use HTTP Basic Authentication instead of embedding the password in cleartext in the JSDL.
Bharathi: Add VOMS proxy?
Machiel: Put a proxy into the myProxy server, then add the VOMS extensions to the proxy certificate. We can do this by email and have a look at the code.
Bharathi: Will commit the myProxy plugin this week.
Machiel: Basic authentication - how can we do that?
Bharathi: No UI support for that currently, but in t2
Alex: With the security agents it can have different username/passwords for various services. Not yet implemented.
Machiel: Should we use t1 still, or wait for t2?
Alex: We'll go for a simple one first that can do HTTP Basic auth, but later also funkier stuff like myProxy and certificate delegation.
Marco: Maybe Machiel should have more direct contact with Alex.
Alex: Is there a use-case of the user having more than one certificate?
Machiel / Marco: Everyone will get one certificate. There will only be one myProxy server. Just one username/password for that myProxy server.
Stuart: What is the t2 plan on this?
Alex: Tom said yesterday that he's writing something about this.
Machiel: Does Taverna have REST support already?
Stian: Well.. you can do simple GETs cdurrently, but you have to construct the URIs manually.
Stian: The RESTful webservices you have, would they be "true" REST-services with links, etc?
Machiel: You can follow links, HTML, pretty strict REST. "Real REST".
Marco: Many of the bioinformatics sites have REST-ish interfaces with hackable URLs.
Stian: Would you need multiple requests, parsing of the resource representations to follow links etc.
Machiel: Have a BLAST service that we can use as an example.
Stuart: One of the problems is that there is not many RESTful services with WSDL 2.0 or WADL descriptions, so we can't
Alex: Hardcoding username/password in the workflow.. not very sharable or secure. The workflow should be inspected upfront before running to see what security credentials are needed. User to click and select from his credential store. Username/password etc. Either the system would know which services are secure, or the user would have to select that himselves, or there could be a pop-up.
Machiel: Many ways to send the username/password, HTTP Basic Auth, SOAP header, etc, WS-Security.
Alex: It depends on what the service supports..
Machiel: Would you support HTTP Basic Authentication?
Alex: Yes, but over HTTPS.
Stian: If the service does not tell us what it requires, the workflow designer have to mark so manually.
Jits brings coffee and muffins, and Marco brings some dutch waffle-like biscuit treat.
Alex: If the service expects normal Basic Auth over HTTPS, that shouldn't be a problem, but if it expects something special (like magic parameters) that would have to be specialised in the code. The best would be if they expected WS-Security.
Stuart: How would the security agent know what to do?
Alex: The processor (activity) would have to know how to send or use the credentials. Some services don't tell you until you try and they fail. It has to be discovered, or tagged manually by the user.
Marco: Will the user experience be kind of like in Firefox, popping up a big dialogue where the user clicks "Always OK" without reading anything. Likes the idea of inspecting everything first so no interaction is needed during execution of the services.
Marco: How would this work when running a workflow through myExperiment? Could you in principle have the same scenario to pre-configure the workflow?
Stuart: Before you run the workflow you have to make a "peer group" where you add your security agent. Stuart could join the peer group because he has access to service 1, and you have access to service 2, and together we could join a peer group and run a workflow using both services.
Marco: myExperiment could help in having several people executing a workflow.
Jits: myXP can do everything! But we're not trying to replace the workbench, just enough data and functionality to share data. Creating e-Experiments. Not sure if these heavy duty security agents would fit into our plan. Also looking into portal integrations, which could potentially be responsible for this. Propagate information from portal to myExperiment.
Marco: Find out about responsibility about NBIC / BioAssist on getting functionality out of myExperiment and Taverna, what we have to do ourselves.
Machiel: It's just a matter of using extension points?
Stian: Everything is an extension point, Manchester doesn't have to do anything, the users just need to fill in the blanks.
Marco: Come home with extension points in the rucksack!
Bharathi: Are you planning to do webservices using basic authentication then? Would you need to configure with host certificates?
Machiel: BasicAuth should fit the scenario, the service accesses myProxy service to get the proxy, and it will use that to submit jobs to the grid. Hidden from Taverna.
Alex: Taverna puts proxy in myProxy with myProxy username/password, and then
Typically the proxy certicates are valid for 12 hours.
Issue there, if you store it in myProxy the default is 1 week. The grid middleware can automatically lengthen the certificate. Need to put it into myProxy without password. Set some flags..
Alex: Go through it again.. the proxy certificate expires on myProxy, how is it renewed?
Machiel: It does not expire on myProxy server, the proxy put onto myProxy lasts for a week, but the ones issued by myProxy lasts for 12 hours. The difference is you get a username/password to get a proxy certificate from the server - but in this you would not have a password to get a renewal.
Alex: Why not just use the same username/password as the first time?
Machiel: The grid jobs
What if the 1 week proxy expires? Then the user has to put a new one, right?
Marco: Can you put a process on hold until the proxy has been renewed by the user? Will it always fail if it runs more than a week?
Machiel: The job will abort if the proxy expires, or wait in a queue. If the proxy certificates while it's in a queue it will not be run. But not very likely with a one week proxy.
Grid has a mechanism to check if the proxy is expirying, and if the 1 week one would expire, one would have to ask the user - and if the user agent is alive on the p2p network (now called "overlay network") - it can renew the proxy certificate for the myProxy service.
Marco: If anything goes wrong - will that error message come back out to my Taverna?
Machiel: That depends on the web service.. now it's the web service that is responsible for what happens on the grid, and that would have to communicate it back to Taverna. Asynchronous web services needed? If it could take a week for a job to complete.
Stuart: But the you can't increase the timeout to a week, it might work, but it's more reliable to do an asynchronous web service.
Marco: They are starting to do create web services now, and there would be a need for
Stuart: Web service hackaton, there was a group looking at asynchronous web services. Their conclusion that WSRF was over-complicated, and they tried to come up with a standard document with just a session ID. Martin Senger was there and said that they had already done this with Soaplab.
Stian: For Soaplab this is already hidden from the user because all Soaplab services do this in the same way. Such a standard document would mean we could wrap such services generally.
Stuart: You might see a pattern in Taverna already with a nested workflow with retries and
Marco: Should talk to Manchester about how to do this, and ensure that the workflow user is told if and how something went wrong.
Stuart: Dig out the link to the hackaton report.
Machiel: Will start tomorrow to put up a web service to put stuff on the grid. We do everything for users, not just for fun, so the error messages should be there.
Stuart: We've had issues with Axis not giving us the actual error message, the problem is that this is not really defined in the SOAP specs what the error message is.
Stian: Also sometimes Axis translates errors like "NullPointerException" in the error message by raising an actual exception.
Stuart: Rewriting the WSDL processor by ripping out Axis. Say to add wss4j.
Stian: So to conclude security, what's needed on a short-term for Machiel is HTTP Basic Authentication and the myProxy plugin.
Machiel: Yes, that would be enough.
Stian: And I guess you don't want to supply the myProxy username password again, so there should be a link between the myProxy plugin and the HTTP Basic Authentication. Either the myProxy plugin provides a little Security Agent, or it puts stuff in the other Security Agent.
David: Technical discussion - where and how would we put in the HTTP Basis Auth header?
Machiel: We're okay to wait to t2, because there's no urgent need for the solution now
Marco: We have a workflow in another system taht runs on the grid, and they wrapped it up as a web service - which requires grid authentication. Would the plugin allow me to run that workflow?
Bharathi: Need to do grid certificates -
Marco: Would require more work then
Stian: Yes, supporting web services that does authentication by certificates is a different piece of work, potentially bigger to avoid magic files etc. This is what is needed for the caGrid use-case.
Stuart: Plan of action: Keep in touch and double-checking and work with prototypes with
Machiel: Look into the myProxy code and look at the headers.
David: As soon as you have a web service up that we can test against, that weould be great.
Marco: Contact Adam to see if it's worthwhile for this use-case.
Stuart: Would test both the myProxy plugin and make stuff working for BioAssist.
Alex: Is it normal HTTP Basic Authentication like in a browser? Request without auth, 401 Auth required returned, and then request again with basic auth.
Stuart: Have not seen any SOAP services yet that do Basic Auth, only REST ones.
Alex: There are also SOAP headers for WS-Security etc. Would have to say where it is to be put! It's a problem for the activity.
Marco: I like insecurity!
Alex: It makes life easier..
Stuart: Why can't we all just thrust eachother?
Marco: Will talk to Stuart and Stian about the other issues with data for the long-term plan. The bioinformatics are not always going to use the grid correctly.
Machiel: Not an urgent issue, but it will come up.
Stuart: Not urgent, but we need to know that when we start development now that we don't go down the wrong path. Part of this meeting was initiated because Carole came back with problems about references for data
Marco: ..and a provenance usecase. Talked to Morris about larg edata, if the legacy web services where the provider is not willing to change, then you are stuck - but if they are willing to change then there could be a solution. If the web services are not yet written then you can do it "properly" - and this is where BioAssist is now.
Stuart: For legacy web services there's not much we can do, we still ahve to send the data, but when services are to be written, we need to come up with guidelines about how the services can be written to support t2's reference schemes - annotations in WSDL etc.
Marco: Ask two things - ask the service provider to add an extra set of services that Taverna could use the extra service that does dereferencing.
Stuart: It should be so that if you call a service outside Taverna it should pass the data by value, but with Taverna by reference. Using "mustAccept" fields in the WSDL? There must be some way of Taverna to know that it supports referencing, most likely in the WSDL.
Morris: Can you write such a service now?
Stian: Well, there is no standard for this yet, but we would have to write something ourselves, it wouldn't be that hard to get something up and running. We would basically be writing the standard as we go along.
Marco: Wrapping the legacy service behind a proxy that hides the big data. (Basically the data proxy) - would this be a way to make legacy services
Stuart: The service itself would need to get hold of the big data.
Stian/Stuart: If the legacy service has not been designed for large data it will properly fall over no matter which way you proxy the big data in the front - if it does say a string operation internally, that would pull in gigabytes of data straight to memory no matter what. So in this case you would always need to rewrite the service anyway.
Marco: So we need to change the culture of the service provider. Could we use vBrowser?
Machiel: Integrating vBrowser with Taverna? It is a general structural 'file system' with all kinds of 'files' (resources), GridFTP, SRB, etc, all kind of storage systems can be handled within a simple GUI, where you can copy files around. It has an API.. vBrowser is an open source project, but with an unclear license currently.
Marco: Would it help with what Bharathi asked about?
Machiel: It would be easy with your mouse to say where your data should go.
Marco: I'm building a workflow, and a service is producing data. How would I use vBrowser for this?
Machiel: Let's say you drag a location from vBrowser to a Taverna input.
Marco: If it works by reference it would be vBrowser that would take care of the up/downloading.
Stian: This sounds very relevant for t2's data references, say you drag a location in to Taverna, it would occur as a data reference in t2. And t2 should be able to do translations/transports between incompatible reference schemes, for instance dragging a GridFTP reference to a service that expects a SRB reference, t2 could do the dereferencing and uploading.
Machiel leaves, Scott will join in tomorrow.