Sequence analysis

来源：锐游网

BIOINFORMATICSAPPLICATIONSNOTE

Sequenceanalysis

Vol.21no.242005,pages4416–4419

doi:10.1093/bioinformatics/bti715

Improvingdisulﬁdeconnectivitypredictionwithsequentialdistancebetweenoxidizedcysteines

Chi-HungTsai1,Bo-JuenChen1,Chen-hsiungChan1,Hsuan-LiangLiu2andCheng-YanKao1,3,Ã

12DepartmentofComputerScienceandInformationEngineering,NationalTaiwanUniversity,Taipei,Taiwan106,DepartmentofChemicalEngineeringandGraduateInstituteofBiotechnology,NationalTaipeiUniversityofTechnology,Taipei,Taiwan10608and3InstituteforInformationIndustry,Taipei,Taiwan106

ReceivedonAugust19,2005;revisedandacceptedonOctober11,2005AdvanceAccesspublicationOctober13,2005

ABSTRACT

Summary:Predictingdisulfideconnectivitypreciselyhelpstowardsthesolutionofproteinstructureprediction.Inthisstudy,adescriptorderivedfromthesequentialdistancebetweenoxidizedcysteines(denotedasDOC)isproposed.Anapproachusingsupportvectormachine(SVM)methodbasedonweightedgraphmatchingwasfurtherdevelopedtopredictthedisulfideconnectivitypatterninproteins.WhenDOCwasapplied,predictionaccuracyof63%forourSVMmodelscouldbeachieved,whichissignificantlyhigherthanthoseobtainedfrompre-viousapproaches.Theresultsshowthatusingthenon-localdescriptorDOCcoupledwithlocalsequenceprofilessignificantlyimprovesthepredictionaccuracy.TheseimprovementsdemonstratethatDOC,withaproperscalingscheme,isaneffectivefeatureforthepredictionofdisulfideconnectivity.ThemethoddevelopedinthisworkisavailableatthewebserverPreCys(predictionofcys–cyslinkagesofproteins).Availability:http://bioinfo.csie.ntu.edu.tw:33/Disulfide/Contact:cykao@csie.ntu.edu.tw

Supplementaryinformation:Supplementarydata,detailedresults,tablesandinformationareavailableathttp://bioinfo.csie.ntu.edu.tw:33/Disulfide/

1INTRODUCTION

Disulﬁdebonds,commonlyfoundinextracellularproteins,stabilizefoldedconformationsastheycontributetothestabilityofthethree-dimensionalstructureswithrespecttothermodynamics(Wedemeyeretal.,2000).Sincedisulﬁdebondsimposelengthandangleconstraintsonthebackboneofaprotein,correctpre-dictionofdisulﬁdeconnectivitycanbeemployedtodramaticallyreducethesearchinconformationalspaceandgreatlyraisetheaccuracyforproteinstructureprediction(Huangetal.,1999).Dif-ferentmethods(FariselliandCasadio,2001;Farisellietal.,2002;VulloandFrasconi,2004)havebeendevelopedtopredictdisulﬁdeconnectivitywiththepriorknowledgeoftheoxidizationstates

Towhomcorrespondenceshouldbeaddressed.

Theauthorswishittobeknownthat,intheiropinion,thefirsttwoauthorsshouldberegardedasjointFirstAuthors.

ofcysteineresidues.Thesemethodscanbeclassiﬁedintotwocategories:(1)patternwiseor(2)pairwise.Themajordifferencebetweenthemiswhetherthemethodologyisdevelopedtodealwithalternativedisulﬁdeconnectivitypatterns(VulloandFrasconi,2004;Zhaoetal.,2005)ortherelationshipsbetweencysteine

`andpairs(FariselliandCasadio,2001;Baldietal.,2005;Ferre

Clote,2005).Thisdifferencedecideshowtheinformationisencoded.However,thepredictionaccuraciesofthesemethodsarestilllimitedsofar($50%).

Besidesthemethodologyused,anothercriticalfactordetermin-ingthepredictingperformanceisthedescriptoremployed.FariselliandCasadio(2001)computedresiduecontactpotentialsaccordingtothenearest-neighborresiduesofbondedcysteines.Secondary

`andClote,2005)andsolventstructure(Baldietal.,2005;Ferre

accessibility(Baldietal.,2005)werealsousedasdescriptorstorepresentinputinformation.Allthesedescriptorsonlydescribethelocalenvironmentsofbondedcysteines.However,adisulﬁdebridgeisalong-rangeinteractionbetweentwolinearlydistantcysteines.Descriptorscontaininglocalinformationonlyareinsufﬁcientforpredictingdisulﬁdeconnectivityaccurately.Therefore,informationregardingrelationshipsbetweencysteinesishighlydesired.

HarrisonandSternberg(1994)havesuggestedthatsequenceseparationbetweenbondedcysteinescorrelatesstronglywithspe-ciﬁcconnectivitypatterns.Zhaoetal.(2005)alsoobservedthatdisulﬁdeconnectivitypatternishighlyconservedwiththesamecysteine-separationpatternofoxidizedcysteines.Althoughtherehavebeensomeattempts(Vullo,2004;Baldietal.,2005)totakeadvantageofsuchinformationbyusingdescriptorssuchasposi-tionsofcysteinesorrelativesequencelength,noemphasishasbeenaddressedontheeffectsofthesefeaturessofar.

Inthispaper,adescriptorderivedfromthelinearsequencedis-tancebetweenoxidizedcysteines(denotedasDOC)wasusedtodemonstrateitspoweronpredictingdisulﬁdeconnectivity.Apair-wisemethodusingsupportvectormachine(SVM)togeneratebondingpotentialsofcysteinepairswasdeveloped.ThismethodwasfurthervalidatedwithadatasetderivedfromSwiss-Prot39(SP39),andsigniﬁcantimprovementswereobtainedwhenthe

ÓTheAuthor2005.PublishedbyOxfordUniversityPress.Allrightsreserved.ForPermissions,pleaseemail:journals.permissions@oxfordjournals.orgTheonlineversionofthisarticlehasbeenpublishedunderanopenaccessmodel.Usersareentitledtouse,reproduce,disseminate,ordisplaytheopenaccessversionofthisarticlefornon-commercialpurposesprovidedthat:theoriginalauthorshipisproperlyandfullyattributed;theJournalandOxfordUniversityPressareattributedastheoriginalplaceofpublicationwiththecorrectcitationdetailsgiven;ifanarticleissubsequentlyreproducedordisseminatednotinitsentiretybutonlyinpartorasaderivativeworkthismustbeclearlyindicated.Forcommercialre-use,pleasecontactjournals.permissions@oxfordjournals.org

Disulfideconnectivityprediction

non-localdescriptorDOCcoupledwithlocalsequenceproﬁleswasapplied.TheseresultsrevealthatDOCisaneffectivefeatureindisulﬁdeconnectivityprediction.Thewebinterfaceserviceofthemethodproposedinthisstudyfordisulﬁdeconnectivitypredictionisavailableathttp://bioinfo.csie.ntu.edu.tw:33/Disulﬁde/

SincescalingapproachesmayaffecttheperformanceofSVM,threescalingschemesforDOCweretested:

(1)DOCL,DOCnormalizedwiththeproteinsequencelengthL.(2)DOCmax,DOCnormalizedwiththemaximumvalueofthe

wholedataset.(3)DOClog,DOCvaluesnormalizedwiththelogarithmfunction.2.1.3MaximumweightmatchingFeatureswereencodedwithrespecttoeachpairofcysteines,andSVMmodelsweretrainedwiththesedatatogenerateposteriorprobabilitiesthatindicatethepotentialofconnectivitybetweencysteinepairs.AfterthebondingprobabilityofeachcysteinepairwasproducedbySVMmodels,animplementationofGabow’salgorithm(Gabow,1973),wmatch(Rothberg,http://elib.zib.de/pub/Packages/mathprog/matching/weighted/),wasusedtoﬁndthemaximumweightmatch-ing.Finally,thematchingwithmaximumweightwastransformedtothecorrespondingdisulﬁdeconnectivitypattern.

22.1

METHODOLOGY

Predictionoftheconnectivitypatternofdisulfidebridges

Withpriorknowledgeoftheoxidationstatesofcysteineresidues,apredictionstrategysimilartopreviousstudies(Fariselliand

`andClote,2005)wasCasadio,2001;Baldietal.,2005;Ferre

applied.Thewholeproblemwasmappedtoanundirectedcompletegraph,whereoxidizedcysteineswereconsideredasverticesandtheprobabilitiesofconnectivitybetweencysteinepairswereassignedastheweightsoftheedgesbetweencorrespondingvertices.Thenthedisulﬁdeconnectivitypatterncanbeinferredbysolvingthemaximumweightmatchingofthisgraph,whichimpliesmaximumprobabilitiesforbondingpairsofthisresultingpattern.

2.1.1SVMInthiswork,SVMwasemployedtopredictthepoten-tialofconnectivitybetweencysteines.SVMhasbeenappliedbroadlywithintheﬁeldofcomputationalbiologytopattern-recognitionpro-blemsandisapromisingtechniquefordataclassiﬁcation(Vapnik,1998).Givendatax1,...,x1,wesettheirlabels,yi,as+1ifxiisinclass1andasÀ1ifxibelongstoclass2.Thenwiththesetrainingdata,SVMsolvesanoptimizationproblemforbinaryclassiﬁcation:

lXÀÁ1T

jiandyivTwðxiÞþb!1Àji‚ji!0‚minvvþC

v‚b‚j2i¼1ð1Þwherexiismappedtoahigherdimensionalspacebythefunctionw;jiisthetrainingerrorallowedandCisthecostoferror.Moreover,SVMcanfurtherbesolvedtoapproximateposteriorclassprobabilityP(yi¼1|xi)withasigmoidfunction(Platt,2000):

Pðyi¼1jfiÞ¼

‚

1þexpðAfiþBÞð2Þ

2.2Evaluationcriteria

OurmodelswereevaluatedbyQpandQcwhicharedeﬁnedasfollows:

CpCc

‚Qc¼‚ð4ÞQp¼TpTcwhereCpisthenumberofproteinswhoseconnectivitypatternsare

correctlypredicted;Tpisthetotalnumberofproteinsinthetestset;CcisthenumberofdisulﬁdebridgescorrectlypredictedandTcisthetotalnumberofdisulﬁdebridgesintestproteins.

33.1

IMPLEMENTATIONANDRESULTSDataset

whereAandBareparametersandfi¼vTw(xi)+b.Using(2),wecaninferthebondingprobabilityforeachpairofcysteines.ThesoftwareLIBSVM(ChangandLin,2000),alibraryforSVMs,wasadoptedinourexperiments.

2.1.2DataencodingTwodescriptorsweremainlyconsideredtoencodeinputdatafortheSVM:(1)localsequenceproﬁles(evolutionaryinformation)aroundtargetcysteinesfrommultiplesequencealignmentsand(2)thelinearDOC.

WegeneratedsequenceproﬁlesbyperformingmultiplesequencealignmentswiththewidelyusedprogramPSI-BLAST(Altschuletal.,1997).ForeachcysteinepairCys(i,j),proﬁleswereextractedusingawindowcenteredatcysteinesiandj.Thewindowsizeindicatesthescopeofvicinityofthetargetcysteineanddetermineshowmuchinformationisprovidedforourmodels.Inourexperi-ments,thewindowsizewassetto13,andthevaluesofelementsintheproﬁleswerescaledto[0,1].

Foracysteinepairwithsequenceindexesiandj,thecorrespond-ingDOCisdeﬁnedasfollows:

DOCði‚jÞ¼kiÀjk:

ð3Þ

Inordertocompareourmethodwiththeapproachesreportedpreviously(VulloandFrasconi,2004;Baldietal.,2005),thesamedatasetextractedfromSP39(BairochandApweiler,2002)wasemployed.Thesameﬁlteringprocedure(FariselliandCasadio,2001)wasappliedtoensureonlyhighqualityandexperimentallyveriﬁedintra-chaindisulﬁdebridgeannotationswereincluded.Forcross-validation,thisdatasetwasfurtherdividedintofoursubsetssothateachofthetwosharedsequencehomology 30%.

3.2Cross-validationofSP39

Table1liststheaccuraciesof4-foldcross-validationperformedwiththedatasetSP39forourmodelalongwiththeresultsreportedpreviously.Usingsequenceproﬁlesonly,ourSVMmodelsobtainedaQPof59%,whichisbetterthanthoseobtainedinpreviousworks.ThismaybeneﬁtfromthegeneralityofSVM,whichavoidsover-ﬁttingduringthetrainingprocess.Anotherreasonfortheimprove-mentistheenlargingofwindowsizewhenextractingsequenceproﬁles.WetriedtousedifferentwindowsizestobuildSVMmodels,andtheaccuracyofthepredictionsisshowninFigure1.TheoverallQPincreaseswithenlargingwindowsizeandpeaksat13,whichwasadoptedinthiswork.Usingthesamewindowsizeof5asusedbyVulloandFrasconi(2004)andBaldietal.(2005),similaraccuracyof52%wasalsoobtainedusingourmethod.Moreover,whenDOCwasused,thepredictionaccuracywasfurtherimproved.ToexploretheeffectsofscalingschemesonDOC,threescalingfunctionswereconsidered:DOCL,DOCmax4417

C.-H.Tsaietal.

Table1.Resultsofcross-validationonthedataextractedfromSP39Methods

B¼2Qp(%)5668737462727679

B¼3Qp(%)21224151405353

B¼4Qp(%)1720242755334855

B¼5Qp(%)22131126184458

B¼2–5Qp(%)2934444949525963

Qc(%)56687374—727679

Qc(%)36375161—666262

Qc(%)37373744—506270

Qc(%)21263041—366071

Qc(%)38424956—586570

MCgraph-matchingaNNgraph-matchingbBiRnn-2profilec2D-RnnprofileddNN2eCSP

SVMprofile

SVMprofile+DOClogabFariselliandCasadio(2001).Farisellietal.(2002).cVulloandFrasconi(2004).dBaldietal.(2005).e`andClote(2005),onlyresultsofQpareavailable.Ferre

thescalingissimplyperformedbydividingthedistancewithaﬁxed

value.Ontheotherhand,theoriginallyskeweddistributionofDOCbecomesclosetoanormaldistributionafterlogarithmfunctionwasapplied,andthedistributionofDOCLbecomesblurredduetothevariationofsequencelengths.

Thepredictionaccuraciesof59and61%wereobtainedbyusingthescalingfunctionDOCLorDOCmax.Ontheotherhand,thehighestpredictionaccuracyof63%wasobtainedbyusingthescalingfunctionDOClog,whichwasselectedtobuildourSVMmodelsfordisulﬁdeconnectivityprediction.TheseresultssuggestthatthescalingofDOCcanaffectitscontributiontoourmodels.Withaproperscalingfunction,DOCcanenhancetheperformanceofSVMmodels.

Fig.1.Theaccuracy(Qp)ofpredictionsusingdifferentwindowsizestoextractsequenceprofilesonthedatasetSP39.

3.3

PreCys(predictionofcys–cyslinkagesinproteins)webserver

ThePreCysserver(athttp://bioinfo.csie.ntu.edu.tw:33/Disulﬁde/)providestheserviceofdisulﬁdeconnectivitypredictionbythemethoddevelopedinthiswork.Inaddition,asimpleCSPsearchcanalsobeaccessedonthewebsite.ThisserverprovidestwoSVMmodelsbuiltfromSwiss-Protreleases39and47.Withthesequenceandthepositionsofoxidizedcysteines(optional)input,thebondingprobabilitiesofcysteinepairsandtheﬁnalconnectivitypatterncanbegenerated.Additionalexperimentalresultsandthechainlistsusedcanbefoundatthiswebsite.

4DISCUSSIONANDCONCLUSION

Fig.2.Histogramofthefractionofchainsversus(a)theoriginaldistributionofDOCwithoutnormalization,(b)DOCL,(c)DOCmaxand(d)DOCloginthedatasetSP39.

andDOClog.ThetrendofDOCbetweencysteinebondingpairsindatasetSP39isshowninFigure2a,andthedistributionsofDOCL,DOCmaxandDOClogarealsoshowninFigure2b–d,respectively.Ascanbeseen,DOCmaxremainsthedistributionoftheDOCsince4418

Therearetwomajorcategoriesforthemethodsofdisulﬁdecon-nectivityprediction.The‘patternwise’approachestakethewholeproteinasaunitdirectlyandrankalternativeconnectivitypatterns(VulloandFrasconi,2004).Theycaneasilyincludeglobalinfor-mation,suchasthesequencelength,aminoacidcontentsorthepositionsofallcysteines.Ontheotherhand,the‘pairwise’methods

`andClote,2005)lacktheoverviewofthe(Baldietal.,2005;Ferre

wholeproteinandareusuallylimitedtothescopeoflocalenviron-mentsofcysteines.

However,thepatternwisemethodsoftensufferfromtheproblemofinsufﬁcientdata,especiallywhenthenumberofdisulﬁdebondsincreases.Forproteinswithﬁvedisulﬁdebonds,therearesome

Disulfideconnectivityprediction

sizewithbestQpvalues.Third,thenon-localdescriptorDOClogalsocontributestothepredictionaccuracies.Ourmethodachievedanaccuracyof63%indatasetSP39whenDOCwasused,whichoutperformsotherpreviousapproaches.Consistentimprovementswerealsoobtainedonotherdatasets,detailedresultscanbefoundintheSupplementarydata.Theseresultsimplythattheformationofdisulﬁdelinkagesbetweencysteinesisdeterminednotonlybythelocalinformationofcysteinesbutalsobytherelationshipsbetweenthem.ThedescriptorDOCcontainsimportantinformationabouttherelationshipsbetweenoxidizedcysteinesandisaneffectivefeatureforpredictingdisulﬁdeconnectivityaccurately.Thisdescriptorcanbeadditionallyappliedtootherproblemswheretheknowledgeofdisulﬁdebridgesisrequired.ThewebinterfaceofourprogramisprovidedonthePreCyswebsite.Theresultsfromourmethodmaybeusefulforadvancedstudiesinproteinstructureprediction,pro-teinstructuremodelingandproteinengineering.

Fig.3.(a)Thestructureandtheconnectivitypatternofdisulfidebridgesand(b)thebondingpotentialP(i,j)foreachcysteinepaircys(i,j)generatedbySVMmodelforchymotrypsinogenA(PDBid1HJA).Selectedbondingpairsareboxed.

ACKNOWLEDGEMENTS

WewouldliketothankJianlinChengforgenerouslysharingdata-setsandusefulcommentsandShih-ChiehChenforenlighteningdiscussion.FundingtopaytheOpenAccesspublicationchargesforthisarticlewasprovidedbytheInstituteforInformationIndustry.ConflictofInterest:nonedeclared.

patternsthatonlyhaveoneinstanceinthedataset.Thesepatternsarenotlikelytobepredictedcorrectlybypatternwisemethodsbecausethereisnotenoughinformationformodeltraining.Forexample,theconnectivitypatternsoftheproteinchainsCTRA_BOVIN(PDB:1HJA,pattern:[1–4,2–3,5–9,6–7,8–10],Fig.3)andUROK_HUMAN(PDB:1LMW,pattern:[1–3,2–4,5–9,6–7,8–10])onlyappearonceinthedatasetSP39.ThepatternwisemethodCSPfailstopredictthedisulﬁdeconnectivityofthesechains,becausenotemplateisavailableforthepatternstobepre-dicted.Ontheotherhand,ourpairwiseSVMmodelscanstillpredicttheirconnectivitycorrectly,sincethepatterncanbeassembledbythebondingpairspredicted.

Inaddition,theimbalancesituationbetweenthepositiveandnegativedatadiffersforpairwiseandpatternwisemethods.AstoaproteinwithBdisulﬁdebonds,thepositive/negativeratiois1:(2BÀ2)forpairwiseencoding.However,forthepatternwiseencoding,theimbalanceismoresevere,sincethereisonlyonecorrectpatternamongthe(2BÀ1)!!generatedentries.TakingB¼5foranexample,thepositive/negativeratioisonly1:8inpairwiseencoding.WiththesamebondnumberBinpatternwiseencoding,thereare945entrieswherethepositive/negativeratiois1:944.Suchsevereimbalancecanbiasthelearningprocessandresultinpoormodels.Duetotheinsufﬁciencyofdataandthesevereimbalanceissueofpatternwiseencoding,weadoptedthepairwiseapproachinourmethod.

Inthispaper,wedevelopedamethodtopredictdisulﬁdecon-nectivitybasedonSVMs.Thenon-localdescriptorDOCdescribingthedistancebetweenoxidizedcysteineswasproposedtoencodeadditionalinformationforourinput.ForthedatasetSP39,thepre-dictionaccuracycanbeimprovedsigniﬁcantlywiththecombina-tionoflocalsequenceproﬁlesandthenon-localdescriptorDOC.Thesigniﬁcantimprovementonpredictionaccuraciesagainstpre-viousapproachesisbecauseofthefollowingreasons.First,SVMscanavoidover-ﬁttingproblemscommonlyseeninneuralnetworksandothermachinelearningmethods.Second,weexploredthelocalenvironmentsofoxidizedcysteinesandfoundtheoptimumwindow

REFERENCES

Altschul,S.F.etal.(1997)GappedBLASTandPSI-BLAST:anewgenerationof

proteindatabasesearchprograms.NucleicAcidsRes.,25,33–3402.

Bairoch,A.andApweiler,R.(2000)TheSwiss–Protproteinsequencedatabaseandits

supplementTrEMBLin2000.NucleicAcidsRes.,28,45–48.

Baldi,P.,Cheng,J.andVullo,A.(2005)Large-scalepredictionofdisulphidebond

connectivity.InSaul,L.K.,Weiss,Y.andBottou,L.(eds),AdvancesinNeuralInformationProcessingSystems17.MITPress,Cambridge,MA,pp.97–104.Chang,C.-C.andLin,C.-J.(2000)LIBSVM:introductionandbenchmarks.Technical

Report,DepartmentofComputerScienceandInformationEngineering,NationalTaiwanUniversity,Taipei,Taiwan.

Fariselli,P.andCasadio,R.(2001)Predictionofdisulﬁdeconnectivityinproteins.

Bioinformatics,17,957–9.

Fariselli,P.,Riccobelli,P.andCasadio,R.(2002)Aneuralnetworkbasedmethod

forpredictingthedisulﬁdeconnectivityinproteins.InDamiani,E.,Jain,L.C.,Howlett,R.J.andIchalkaranje,N.(eds),Knowledgebasedintelligentinformationengineeringsystemsandalliedtechnologies(KES2002).IOSPress,Amsterdam,1,pp.4–468.

`,F.andClote,P.(2005)Disulﬁdeconnectivitypredictionusingsecondarystruc-Ferre

tureinformationanddiresiduefrequencies.Bioinformatics,21,2336–2346.

Gabow,H.N.(1973)Implementationofalgorithmsformaximummatchingonnon-bipartitegraphs.PhdThesis,StanfordUniversity,CA.

Harrison,P.M.andSternberg,M.J.E.(1994)Analysisandclassiﬁcationofdisulphide

connectivityinproteins.J.Mol.Biol.,244,448–463.

Huang,E.S.etal.(1999)Abinitiofoldpredictionofsmallhelicalproteinsusing

distancegeometryandknowledge-basedscoringfunctions.J.Mol.Biol.,290,267–281.

Platt,J.(2000)Probabilisticoutputsforsupportvectormachinesandcomparison

¨lkopf,B.toregularizedlikelihoodmethods.InSmola,A.J.,Bartlett,P.L.,Scho

andSchuurmans,D.(eds),AdvancesinLargeMarginClassiﬁers.MITPress,Cambridge,MA,pp.61–74.

Rothberg,E.(1985)wmatch:aCProgramtosolvemaximumweightmatching.Vapnik,V.(1998)StatisticalLearningTheory.Wiley,NewYork,NY.

Vullo,A.andFrasconi,P.(2004)Disulﬁdeconnectivitypredictionusingrecursive

neuralnetworksandevolutionaryinformation.Bioinformatics,20,653–659.Wedemeyer,W.J.etal.(2000)Disulﬁdebondsandproteinfolding.Biochemistry,39,

4207–4216.

Zhao,E.etal.(2005)Cysteineseparationsproﬁlesonproteinsequencesinferdisulﬁde

connectivity.Bioinformatics,21,1415–1420.

4419

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文