![]() |
XML Flattener tutorial: how to create a flat file from a PSI XML file |
The purpose of this tutorial is is to provide a step by step guide for the user who wants to extract information from a PSI XML file and produce an output in a flat file format. In the output file each line will describe a protein interaction.
The first step consists in the loading of the PSI schema and it is achieved by clicking the “open a schema” button. The file is called MIF.xsd and is available at the PSI web page. You can also find it in the directory data. Once the schema is loaded, the root node should be displayed in the main frame. It is named entrySet. Some nodes are colored in red, indicating that this element in the PSI document should never be empty according to the XML schema.
You are now ready to load the PSI document. This step can take some time as the application is checking the file.
Since we want each line of the flat file to describe an interaction, we select the node interaction (entrySet, entry, interactionList, interaction) and click on node describing a line. This node is now colored in blue.
If we had not done this selection, the application would have looked by itself for the node supposed to be the most representative of a line once we had began to select some nodes (it would be looking for the last duplicable node that groups all selected nodes).
Now we can choose the elements we want to have in the flat file. For example if we want the short label of the interaction, we select the node shortLabel (interaction, names, shortLabel) and press the button select this node. The node is now colored black. [shortLabel] should now appear in the frame titled preview indicating that the flat file will contain one column for the shortLabel. Next if for instance we want the shortLabel of the interactor to be present in the flat file, we click on participantList, proteinParticipant, (proteinInteractorRef|proteinInteractor) that is automaticaly expanded, proteinInteractor, names and shortLabel, and press select this node.
The flattener will be looking down in the document for the maximum number of interactors participating in an interaction described in the specific XML file. If the largest interaction involves ten proteins, it will add ten columns for the short labels. The references are also taken into account, so even if the PSI document is normalized, the flattener will go deep to the references to find the short label of the referenced interactors. In fact, we should never select a reference that does not contain any real information but let the application get down to it as if the file was not normalized.
Before printing the flat file we can choose the fields separator (for example |, ; or ,).
We can finally print the flat file by pressing the button print.
This software has been created at the University of Roma "Tor Vergata" by Arnaud Ceol with help of the Mint Group. For any information you can contact me at arnaud@cbm.bio.uniroma2.it.
PSI: the Proteomics Standards Initiative