Sequenceanalysis
Vol.21no.242005,pages4416–4419
doi:10.1093/bioinformatics/bti715
Improvingdisulfideconnectivitypredictionwithsequentialdistancebetweenoxidizedcysteines
Chi-HungTsai1,Bo-JuenChen1,Chen-hsiungChan1,Hsuan-LiangLiu2andCheng-YanKao1,3,Ã
12DepartmentofComputerScienceandInformationEngineering,NationalTaiwanUniversity,Taipei,Taiwan106,DepartmentofChemicalEngineeringandGraduateInstituteofBiotechnology,NationalTaipeiUniversityofTechnology,Taipei,Taiwan10608and3InstituteforInformationIndustry,Taipei,Taiwan106
ReceivedonAugust19,2005;revisedandacceptedonOctober11,2005AdvanceAccesspublicationOctober13,2005
ABSTRACT
Summary:Predictingdisulfideconnectivitypreciselyhelpstowardsthesolutionofproteinstructureprediction.Inthisstudy,adescriptorderivedfromthesequentialdistancebetweenoxidizedcysteines(denotedasDOC)isproposed.Anapproachusingsupportvectormachine(SVM)methodbasedonweightedgraphmatchingwasfurtherdevelopedtopredictthedisulfideconnectivitypatterninproteins.WhenDOCwasapplied,predictionaccuracyof63%forourSVMmodelscouldbeachieved,whichissignificantlyhigherthanthoseobtainedfrompre-viousapproaches.Theresultsshowthatusingthenon-localdescriptorDOCcoupledwithlocalsequenceprofilessignificantlyimprovesthepredictionaccuracy.TheseimprovementsdemonstratethatDOC,withaproperscalingscheme,isaneffectivefeatureforthepredictionofdisulfideconnectivity.ThemethoddevelopedinthisworkisavailableatthewebserverPreCys(predictionofcys–cyslinkagesofproteins).Availability:http://bioinfo.csie.ntu.edu.tw:33/Disulfide/Contact:cykao@csie.ntu.edu.tw
Supplementaryinformation:Supplementarydata,detailedresults,tablesandinformationareavailableathttp://bioinfo.csie.ntu.edu.tw:33/Disulfide/
1INTRODUCTION
Disulfidebonds,commonlyfoundinextracellularproteins,stabilizefoldedconformationsastheycontributetothestabilityofthethree-dimensionalstructureswithrespecttothermodynamics(Wedemeyeretal.,2000).Sincedisulfidebondsimposelengthandangleconstraintsonthebackboneofaprotein,correctpre-dictionofdisulfideconnectivitycanbeemployedtodramaticallyreducethesearchinconformationalspaceandgreatlyraisetheaccuracyforproteinstructureprediction(Huangetal.,1999).Dif-ferentmethods(FariselliandCasadio,2001;Farisellietal.,2002;VulloandFrasconi,2004)havebeendevelopedtopredictdisulfideconnectivitywiththepriorknowledgeoftheoxidizationstates
Towhomcorrespondenceshouldbeaddressed.
Theauthorswishittobeknownthat,intheiropinion,thefirsttwoauthorsshouldberegardedasjointFirstAuthors.
Ã
ofcysteineresidues.Thesemethodscanbeclassifiedintotwocategories:(1)patternwiseor(2)pairwise.Themajordifferencebetweenthemiswhetherthemethodologyisdevelopedtodealwithalternativedisulfideconnectivitypatterns(VulloandFrasconi,2004;Zhaoetal.,2005)ortherelationshipsbetweencysteine
`andpairs(FariselliandCasadio,2001;Baldietal.,2005;Ferre
Clote,2005).Thisdifferencedecideshowtheinformationisencoded.However,thepredictionaccuraciesofthesemethodsarestilllimitedsofar($50%).
Besidesthemethodologyused,anothercriticalfactordetermin-ingthepredictingperformanceisthedescriptoremployed.FariselliandCasadio(2001)computedresiduecontactpotentialsaccordingtothenearest-neighborresiduesofbondedcysteines.Secondary
`andClote,2005)andsolventstructure(Baldietal.,2005;Ferre
accessibility(Baldietal.,2005)werealsousedasdescriptorstorepresentinputinformation.Allthesedescriptorsonlydescribethelocalenvironmentsofbondedcysteines.However,adisulfidebridgeisalong-rangeinteractionbetweentwolinearlydistantcysteines.Descriptorscontaininglocalinformationonlyareinsufficientforpredictingdisulfideconnectivityaccurately.Therefore,informationregardingrelationshipsbetweencysteinesishighlydesired.
HarrisonandSternberg(1994)havesuggestedthatsequenceseparationbetweenbondedcysteinescorrelatesstronglywithspe-cificconnectivitypatterns.Zhaoetal.(2005)alsoobservedthatdisulfideconnectivitypatternishighlyconservedwiththesamecysteine-separationpatternofoxidizedcysteines.Althoughtherehavebeensomeattempts(Vullo,2004;Baldietal.,2005)totakeadvantageofsuchinformationbyusingdescriptorssuchasposi-tionsofcysteinesorrelativesequencelength,noemphasishasbeenaddressedontheeffectsofthesefeaturessofar.
Inthispaper,adescriptorderivedfromthelinearsequencedis-tancebetweenoxidizedcysteines(denotedasDOC)wasusedtodemonstrateitspoweronpredictingdisulfideconnectivity.Apair-wisemethodusingsupportvectormachine(SVM)togeneratebondingpotentialsofcysteinepairswasdeveloped.ThismethodwasfurthervalidatedwithadatasetderivedfromSwiss-Prot39(SP39),andsignificantimprovementswereobtainedwhenthe
ÓTheAuthor2005.PublishedbyOxfordUniversityPress.Allrightsreserved.ForPermissions,pleaseemail:journals.permissions@oxfordjournals.orgTheonlineversionofthisarticlehasbeenpublishedunderanopenaccessmodel.Usersareentitledtouse,reproduce,disseminate,ordisplaytheopenaccessversionofthisarticlefornon-commercialpurposesprovidedthat:theoriginalauthorshipisproperlyandfullyattributed;theJournalandOxfordUniversityPressareattributedastheoriginalplaceofpublicationwiththecorrectcitationdetailsgiven;ifanarticleissubsequentlyreproducedordisseminatednotinitsentiretybutonlyinpartorasaderivativeworkthismustbeclearlyindicated.Forcommercialre-use,pleasecontactjournals.permissions@oxfordjournals.org
Disulfideconnectivityprediction
non-localdescriptorDOCcoupledwithlocalsequenceprofileswasapplied.TheseresultsrevealthatDOCisaneffectivefeatureindisulfideconnectivityprediction.Thewebinterfaceserviceofthemethodproposedinthisstudyfordisulfideconnectivitypredictionisavailableathttp://bioinfo.csie.ntu.edu.tw:33/Disulfide/
SincescalingapproachesmayaffecttheperformanceofSVM,threescalingschemesforDOCweretested:
(1)DOCL,DOCnormalizedwiththeproteinsequencelengthL.(2)DOCmax,DOCnormalizedwiththemaximumvalueofthe
wholedataset.(3)DOClog,DOCvaluesnormalizedwiththelogarithmfunction.2.1.3MaximumweightmatchingFeatureswereencodedwithrespecttoeachpairofcysteines,andSVMmodelsweretrainedwiththesedatatogenerateposteriorprobabilitiesthatindicatethepotentialofconnectivitybetweencysteinepairs.AfterthebondingprobabilityofeachcysteinepairwasproducedbySVMmodels,animplementationofGabow’salgorithm(Gabow,1973),wmatch(Rothberg,http://elib.zib.de/pub/Packages/mathprog/matching/weighted/),wasusedtofindthemaximumweightmatch-ing.Finally,thematchingwithmaximumweightwastransformedtothecorrespondingdisulfideconnectivitypattern.
22.1
METHODOLOGY
Predictionoftheconnectivitypatternofdisulfidebridges
Withpriorknowledgeoftheoxidationstatesofcysteineresidues,apredictionstrategysimilartopreviousstudies(Fariselliand
`andClote,2005)wasCasadio,2001;Baldietal.,2005;Ferre
applied.Thewholeproblemwasmappedtoanundirectedcompletegraph,whereoxidizedcysteineswereconsideredasverticesandtheprobabilitiesofconnectivitybetweencysteinepairswereassignedastheweightsoftheedgesbetweencorrespondingvertices.Thenthedisulfideconnectivitypatterncanbeinferredbysolvingthemaximumweightmatchingofthisgraph,whichimpliesmaximumprobabilitiesforbondingpairsofthisresultingpattern.
2.1.1SVMInthiswork,SVMwasemployedtopredictthepoten-tialofconnectivitybetweencysteines.SVMhasbeenappliedbroadlywithinthefieldofcomputationalbiologytopattern-recognitionpro-blemsandisapromisingtechniquefordataclassification(Vapnik,1998).Givendatax1,...,x1,wesettheirlabels,yi,as+1ifxiisinclass1andasÀ1ifxibelongstoclass2.Thenwiththesetrainingdata,SVMsolvesanoptimizationproblemforbinaryclassification:
lXÀÁ1T
jiandyivTwðxiÞþb!1Àji‚ji!0‚minvvþC
v‚b‚j2i¼1ð1Þwherexiismappedtoahigherdimensionalspacebythefunctionw;jiisthetrainingerrorallowedandCisthecostoferror.Moreover,SVMcanfurtherbesolvedtoapproximateposteriorclassprobabilityP(yi¼1|xi)withasigmoidfunction(Platt,2000):
Pðyi¼1jfiÞ¼
1
‚
1þexpðAfiþBÞð2Þ
2.2Evaluationcriteria
OurmodelswereevaluatedbyQpandQcwhicharedefinedasfollows:
CpCc
‚Qc¼‚ð4ÞQp¼TpTcwhereCpisthenumberofproteinswhoseconnectivitypatternsare
correctlypredicted;Tpisthetotalnumberofproteinsinthetestset;CcisthenumberofdisulfidebridgescorrectlypredictedandTcisthetotalnumberofdisulfidebridgesintestproteins.
33.1
IMPLEMENTATIONANDRESULTSDataset
whereAandBareparametersandfi¼vTw(xi)+b.Using(2),wecaninferthebondingprobabilityforeachpairofcysteines.ThesoftwareLIBSVM(ChangandLin,2000),alibraryforSVMs,wasadoptedinourexperiments.
2.1.2DataencodingTwodescriptorsweremainlyconsideredtoencodeinputdatafortheSVM:(1)localsequenceprofiles(evolutionaryinformation)aroundtargetcysteinesfrommultiplesequencealignmentsand(2)thelinearDOC.
WegeneratedsequenceprofilesbyperformingmultiplesequencealignmentswiththewidelyusedprogramPSI-BLAST(Altschuletal.,1997).ForeachcysteinepairCys(i,j),profileswereextractedusingawindowcenteredatcysteinesiandj.Thewindowsizeindicatesthescopeofvicinityofthetargetcysteineanddetermineshowmuchinformationisprovidedforourmodels.Inourexperi-ments,thewindowsizewassetto13,andthevaluesofelementsintheprofileswerescaledto[0,1].
Foracysteinepairwithsequenceindexesiandj,thecorrespond-ingDOCisdefinedasfollows:
DOCði‚jÞ¼kiÀjk:
ð3Þ
Inordertocompareourmethodwiththeapproachesreportedpreviously(VulloandFrasconi,2004;Baldietal.,2005),thesamedatasetextractedfromSP39(BairochandApweiler,2002)wasemployed.Thesamefilteringprocedure(FariselliandCasadio,2001)wasappliedtoensureonlyhighqualityandexperimentallyverifiedintra-chaindisulfidebridgeannotationswereincluded.Forcross-validation,thisdatasetwasfurtherdividedintofoursubsetssothateachofthetwosharedsequencehomology 30%.
3.2Cross-validationofSP39
Table1liststheaccuraciesof4-foldcross-validationperformedwiththedatasetSP39forourmodelalongwiththeresultsreportedpreviously.Usingsequenceprofilesonly,ourSVMmodelsobtainedaQPof59%,whichisbetterthanthoseobtainedinpreviousworks.ThismaybenefitfromthegeneralityofSVM,whichavoidsover-fittingduringthetrainingprocess.Anotherreasonfortheimprove-mentistheenlargingofwindowsizewhenextractingsequenceprofiles.WetriedtousedifferentwindowsizestobuildSVMmodels,andtheaccuracyofthepredictionsisshowninFigure1.TheoverallQPincreaseswithenlargingwindowsizeandpeaksat13,whichwasadoptedinthiswork.Usingthesamewindowsizeof5asusedbyVulloandFrasconi(2004)andBaldietal.(2005),similaraccuracyof52%wasalsoobtainedusingourmethod.Moreover,whenDOCwasused,thepredictionaccuracywasfurtherimproved.ToexploretheeffectsofscalingschemesonDOC,threescalingfunctionswereconsidered:DOCL,DOCmax4417
C.-H.Tsaietal.
Table1.Resultsofcross-validationonthedataextractedfromSP39Methods
B¼2Qp(%)5668737462727679
B¼3Qp(%)21224151405353
B¼4Qp(%)1720242755334855
B¼5Qp(%)22131126184458
B¼2–5Qp(%)2934444949525963
Qc(%)56687374—727679
Qc(%)36375161—666262
Qc(%)37373744—506270
Qc(%)21263041—366071
Qc(%)38424956—586570
MCgraph-matchingaNNgraph-matchingbBiRnn-2profilec2D-RnnprofileddNN2eCSP
SVMprofile
SVMprofile+DOClogabFariselliandCasadio(2001).Farisellietal.(2002).cVulloandFrasconi(2004).dBaldietal.(2005).e`andClote(2005),onlyresultsofQpareavailable.Ferre
thescalingissimplyperformedbydividingthedistancewithafixed
value.Ontheotherhand,theoriginallyskeweddistributionofDOCbecomesclosetoanormaldistributionafterlogarithmfunctionwasapplied,andthedistributionofDOCLbecomesblurredduetothevariationofsequencelengths.
Thepredictionaccuraciesof59and61%wereobtainedbyusingthescalingfunctionDOCLorDOCmax.Ontheotherhand,thehighestpredictionaccuracyof63%wasobtainedbyusingthescalingfunctionDOClog,whichwasselectedtobuildourSVMmodelsfordisulfideconnectivityprediction.TheseresultssuggestthatthescalingofDOCcanaffectitscontributiontoourmodels.Withaproperscalingfunction,DOCcanenhancetheperformanceofSVMmodels.
Fig.1.Theaccuracy(Qp)ofpredictionsusingdifferentwindowsizestoextractsequenceprofilesonthedatasetSP39.
3.3
PreCys(predictionofcys–cyslinkagesinproteins)webserver
ThePreCysserver(athttp://bioinfo.csie.ntu.edu.tw:33/Disulfide/)providestheserviceofdisulfideconnectivitypredictionbythemethoddevelopedinthiswork.Inaddition,asimpleCSPsearchcanalsobeaccessedonthewebsite.ThisserverprovidestwoSVMmodelsbuiltfromSwiss-Protreleases39and47.Withthesequenceandthepositionsofoxidizedcysteines(optional)input,thebondingprobabilitiesofcysteinepairsandthefinalconnectivitypatterncanbegenerated.Additionalexperimentalresultsandthechainlistsusedcanbefoundatthiswebsite.
4DISCUSSIONANDCONCLUSION
Fig.2.Histogramofthefractionofchainsversus(a)theoriginaldistributionofDOCwithoutnormalization,(b)DOCL,(c)DOCmaxand(d)DOCloginthedatasetSP39.
andDOClog.ThetrendofDOCbetweencysteinebondingpairsindatasetSP39isshowninFigure2a,andthedistributionsofDOCL,DOCmaxandDOClogarealsoshowninFigure2b–d,respectively.Ascanbeseen,DOCmaxremainsthedistributionoftheDOCsince4418
Therearetwomajorcategoriesforthemethodsofdisulfidecon-nectivityprediction.The‘patternwise’approachestakethewholeproteinasaunitdirectlyandrankalternativeconnectivitypatterns(VulloandFrasconi,2004).Theycaneasilyincludeglobalinfor-mation,suchasthesequencelength,aminoacidcontentsorthepositionsofallcysteines.Ontheotherhand,the‘pairwise’methods
`andClote,2005)lacktheoverviewofthe(Baldietal.,2005;Ferre
wholeproteinandareusuallylimitedtothescopeoflocalenviron-mentsofcysteines.
However,thepatternwisemethodsoftensufferfromtheproblemofinsufficientdata,especiallywhenthenumberofdisulfidebondsincreases.Forproteinswithfivedisulfidebonds,therearesome
Disulfideconnectivityprediction
sizewithbestQpvalues.Third,thenon-localdescriptorDOClogalsocontributestothepredictionaccuracies.Ourmethodachievedanaccuracyof63%indatasetSP39whenDOCwasused,whichoutperformsotherpreviousapproaches.Consistentimprovementswerealsoobtainedonotherdatasets,detailedresultscanbefoundintheSupplementarydata.Theseresultsimplythattheformationofdisulfidelinkagesbetweencysteinesisdeterminednotonlybythelocalinformationofcysteinesbutalsobytherelationshipsbetweenthem.ThedescriptorDOCcontainsimportantinformationabouttherelationshipsbetweenoxidizedcysteinesandisaneffectivefeatureforpredictingdisulfideconnectivityaccurately.Thisdescriptorcanbeadditionallyappliedtootherproblemswheretheknowledgeofdisulfidebridgesisrequired.ThewebinterfaceofourprogramisprovidedonthePreCyswebsite.Theresultsfromourmethodmaybeusefulforadvancedstudiesinproteinstructureprediction,pro-teinstructuremodelingandproteinengineering.
Fig.3.(a)Thestructureandtheconnectivitypatternofdisulfidebridgesand(b)thebondingpotentialP(i,j)foreachcysteinepaircys(i,j)generatedbySVMmodelforchymotrypsinogenA(PDBid1HJA).Selectedbondingpairsareboxed.
ACKNOWLEDGEMENTS
WewouldliketothankJianlinChengforgenerouslysharingdata-setsandusefulcommentsandShih-ChiehChenforenlighteningdiscussion.FundingtopaytheOpenAccesspublicationchargesforthisarticlewasprovidedbytheInstituteforInformationIndustry.ConflictofInterest:nonedeclared.
patternsthatonlyhaveoneinstanceinthedataset.Thesepatternsarenotlikelytobepredictedcorrectlybypatternwisemethodsbecausethereisnotenoughinformationformodeltraining.Forexample,theconnectivitypatternsoftheproteinchainsCTRA_BOVIN(PDB:1HJA,pattern:[1–4,2–3,5–9,6–7,8–10],Fig.3)andUROK_HUMAN(PDB:1LMW,pattern:[1–3,2–4,5–9,6–7,8–10])onlyappearonceinthedatasetSP39.ThepatternwisemethodCSPfailstopredictthedisulfideconnectivityofthesechains,becausenotemplateisavailableforthepatternstobepre-dicted.Ontheotherhand,ourpairwiseSVMmodelscanstillpredicttheirconnectivitycorrectly,sincethepatterncanbeassembledbythebondingpairspredicted.
Inaddition,theimbalancesituationbetweenthepositiveandnegativedatadiffersforpairwiseandpatternwisemethods.AstoaproteinwithBdisulfidebonds,thepositive/negativeratiois1:(2BÀ2)forpairwiseencoding.However,forthepatternwiseencoding,theimbalanceismoresevere,sincethereisonlyonecorrectpatternamongthe(2BÀ1)!!generatedentries.TakingB¼5foranexample,thepositive/negativeratioisonly1:8inpairwiseencoding.WiththesamebondnumberBinpatternwiseencoding,thereare945entrieswherethepositive/negativeratiois1:944.Suchsevereimbalancecanbiasthelearningprocessandresultinpoormodels.Duetotheinsufficiencyofdataandthesevereimbalanceissueofpatternwiseencoding,weadoptedthepairwiseapproachinourmethod.
Inthispaper,wedevelopedamethodtopredictdisulfidecon-nectivitybasedonSVMs.Thenon-localdescriptorDOCdescribingthedistancebetweenoxidizedcysteineswasproposedtoencodeadditionalinformationforourinput.ForthedatasetSP39,thepre-dictionaccuracycanbeimprovedsignificantlywiththecombina-tionoflocalsequenceprofilesandthenon-localdescriptorDOC.Thesignificantimprovementonpredictionaccuraciesagainstpre-viousapproachesisbecauseofthefollowingreasons.First,SVMscanavoidover-fittingproblemscommonlyseeninneuralnetworksandothermachinelearningmethods.Second,weexploredthelocalenvironmentsofoxidizedcysteinesandfoundtheoptimumwindow
REFERENCES
Altschul,S.F.etal.(1997)GappedBLASTandPSI-BLAST:anewgenerationof
proteindatabasesearchprograms.NucleicAcidsRes.,25,33–3402.
Bairoch,A.andApweiler,R.(2000)TheSwiss–Protproteinsequencedatabaseandits
supplementTrEMBLin2000.NucleicAcidsRes.,28,45–48.
Baldi,P.,Cheng,J.andVullo,A.(2005)Large-scalepredictionofdisulphidebond
connectivity.InSaul,L.K.,Weiss,Y.andBottou,L.(eds),AdvancesinNeuralInformationProcessingSystems17.MITPress,Cambridge,MA,pp.97–104.Chang,C.-C.andLin,C.-J.(2000)LIBSVM:introductionandbenchmarks.Technical
Report,DepartmentofComputerScienceandInformationEngineering,NationalTaiwanUniversity,Taipei,Taiwan.
Fariselli,P.andCasadio,R.(2001)Predictionofdisulfideconnectivityinproteins.
Bioinformatics,17,957–9.
Fariselli,P.,Riccobelli,P.andCasadio,R.(2002)Aneuralnetworkbasedmethod
forpredictingthedisulfideconnectivityinproteins.InDamiani,E.,Jain,L.C.,Howlett,R.J.andIchalkaranje,N.(eds),Knowledgebasedintelligentinformationengineeringsystemsandalliedtechnologies(KES2002).IOSPress,Amsterdam,1,pp.4–468.
`,F.andClote,P.(2005)Disulfideconnectivitypredictionusingsecondarystruc-Ferre
tureinformationanddiresiduefrequencies.Bioinformatics,21,2336–2346.
Gabow,H.N.(1973)Implementationofalgorithmsformaximummatchingonnon-bipartitegraphs.PhdThesis,StanfordUniversity,CA.
Harrison,P.M.andSternberg,M.J.E.(1994)Analysisandclassificationofdisulphide
connectivityinproteins.J.Mol.Biol.,244,448–463.
Huang,E.S.etal.(1999)Abinitiofoldpredictionofsmallhelicalproteinsusing
distancegeometryandknowledge-basedscoringfunctions.J.Mol.Biol.,290,267–281.
Platt,J.(2000)Probabilisticoutputsforsupportvectormachinesandcomparison
¨lkopf,B.toregularizedlikelihoodmethods.InSmola,A.J.,Bartlett,P.L.,Scho
andSchuurmans,D.(eds),AdvancesinLargeMarginClassifiers.MITPress,Cambridge,MA,pp.61–74.
Rothberg,E.(1985)wmatch:aCProgramtosolvemaximumweightmatching.Vapnik,V.(1998)StatisticalLearningTheory.Wiley,NewYork,NY.
Vullo,A.andFrasconi,P.(2004)Disulfideconnectivitypredictionusingrecursive
neuralnetworksandevolutionaryinformation.Bioinformatics,20,653–659.Wedemeyer,W.J.etal.(2000)Disulfidebondsandproteinfolding.Biochemistry,39,
4207–4216.
Zhao,E.etal.(2005)Cysteineseparationsprofilesonproteinsequencesinferdisulfide
connectivity.Bioinformatics,21,1415–1420.
4419
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- ryyc.cn 版权所有 湘ICP备2023022495号-3
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务