Exploration de données - Colonne du cercle informatique

introduction

Ces dernières années, l'exploration de données a suscité une grande attention dans l'industrie de l'information.

Dataminingisahotissueinthefieldofartificialintelligenceanddatabaseresearch.Theso-calleddataminingreferstothenon-trivialprocessofrevealinghidden, previouslyunknownandpotentiallyvaluableinformationfromalargeamountofdatainthedatabase..Dataminingisadecisionsupportprocess, whichismainlybasedonartificialintelligence, apprentissage automatique, reconnaissance de formes, statistiques, bases de données, visualizationtechnology, etc., highlyautomatedanalysisofenterprisedata, makinginductivereasoning, anddiggingoutpotentialpatternsfromit, Tohelpdecision-makersadjustmarketstrategies, reducerisks, andmakecorrectdecisions.Theknowledgediscoveryprocessconsistsofthefollowingthreestages: ①datapreparation; ②datamining; ③resultexpressionandinterpretation. Le datamining peut interagir avec les utilisateurs ou les bases de connaissances.

Dataminingisatechniquetofindthelawfromalargeamountofdatabyanalyzingeachdata.Therearethreemainsteps: datapreparation, lawsearch, andlawexpression.Datapreparationistoselecttherequireddatafromrelateddatasourcesandintegratethemintoadatasetfordatamining; rulesearchistouseacertainmethodtofindoutthelawscontainedinthedataset; therulerepresentationisasfaraspossibletotheuser.Thewayofunderstanding (suchasvisualization) expressesthefoundrules.Dataminingtasksincludeassociationanalysis, clusteranalysis, classificationanalysis, anomalyanalysis, specificgroupanalysisandevolutionanalysis.

Inrecentyears, datamininghasattractedgreatattentionintheinformationindustry.Themainreasonisthatthereisalargeamountofdatathatcanbewidelyused, andthereisanurgentneedtoconvertthesedataintousefulinformationandknowledge.Theacquiredinformationandknowledgecanbewidelyusedinvariousapplications, includingbusinessmanagement, productioncontrol, marketanalysis, engineeringdesign, andscientificexploration.Dataminingusesideasfromthefollowingfields: ①sampling, estimationandhypothesistestingfromstatistics, ②searchalgorithms, modelingtechniquesandlearningtheoriesinartificialintelligence, patternrecognitionandmachinelearning.Datamininghasalsoquicklyadoptedideasfromotherfields, includingoptimization, evolutionarycomputing, informationtheory, traitement du signal, la visualisation, andinformationretrieval.Someotherareasalsoplayanimportantsupportingrole.Inparticular, lessystèmesdebasededonnéessontnécessairespourfournirunepriseenchargeefficacedustockage,del'indexationetdutraitementdesrequêtes. La technologie distribuée peut également aider à traiter d'énormes quantités de données, et est encore plus importante lorsque les données ne peuvent pas être traitées ensemble.

Fond

Inthe1990s, withthewidespreadapplicationofdatabasesystemsandtherapiddevelopmentofnetworktechnology, databasetechnologyhasalsoenteredabrandnewstage, thatis, fromthepast, onlysomemanagementSimpledatahasdevelopedtomanagevarioustypesofcomplexdatasuchasgraphics, images, audio, vidéo, electronicfiles, andWebpagesgeneratedbyvariouscomputers, andtheamountofdataisalsoincreasing.Whilethedatabaseprovidesuswithawealthofinformation, italsoreflectstheobviouscharacteristicsofmassiveinformation.Intheeraofinformationexplosion, massiveamountsofinformationhavebroughtmanynegativeeffectstopeople.Themostimportantthingisthateffectiveinformationisdifficulttoextract.Toomuchuselessinformationwillinevitablyproduceinformationdistance (informationstatetransferdistance), whichisanobstacletothetransferofinformationstateofathing.Themeasurementofthedisease, referredtoasDISTorDIT) andthelossofusefulknowledge.ThisiswhatJohnNalsbertcalledthe "informations dilemme des "riches mais pauvres en connaissances". Cependant, ce n'est qu'avec les fonctions d'entrée, de requête et de statistiques du système de base de données que les relations et les règles dans les données sont introuvables, que la tendance de développement futur ne peut pas être prédite sur la base des données existantes et que les moyens d'exploiter les connaissances cachées derrière les données font défaut.

Objets de datamining

Letypededonnéespeutêtrestructurée,semi-structurée,oumêmehétérogène.Laméthodededécouvertedesconnaissancespeutêtremathématique,non-mathématiqueouinductive.

L'objet de l'exploration de données peut être n'importe quel type de source de données.

La méthode de découverte des connaissances peut être numérique, non numérique ou généralisée.

Étapes d'exploration de données

Avant de mettre en œuvre l'exploration de données, formulez d'abord les étapes à suivre, ce qu'il faut faire à chaque étape et les objectifs nécessaires pour obtenir de bons résultats. Seuls des plans peuvent garantir la mise en œuvre ordonnée et le succès de l'exploration de données.

Thestepsofthedataminingprocessmodelmainlyincludedefiningproblems, buildingdatamininglibraries, analyzingdata, preparingdata, buildingmodels, evaluatingmodels, andimplementingthem.Let'stakealookatthespecificcontentofeachstepindetail: (1) Definetheproblem.Thefirstandmostimportantrequirementbeforestartingknowledgediscoveryistounderstanddataandbusinessissues.Theremustbeacleardefinitionofthegoal, whichistodecidewhatyouwanttodo.Forexample, whenyouwanttoincreasetheutilizationrateofe-mail, whatyouwanttodomaybe "increaseuserutilization" ou "increasethevalueofone-timeuseruse" .Themodelsestablishedtosolvethesetwoproblemsarealmostcompletelydifferent, Adecisionmustbemade. .

(2)Établirunebibliothèqued'explorationdedonnées.

(3)Analyserlesdonnées.L'objectifdel'analyseestdetrouverleschampsdedonnéesquiontleplusgrandimpactsurlesrésultatsdesprévisionsetdedéciderdedéfinirleschampsd'exportation.

(4)Préparerlesdonnées.Il s'agitdeladernièreétapedepréparationdesdonnéesavantlaconstructiondumodèle.Cetteétapepeutêtrediviséeenquatreparties : sélectionner des variables, sélectionner des enregistrements, créer de nouvelles variables et convertir des variables.

(5) Buildamodel.Modelbuildingisaniterativeprocess.Youneedtocarefullyexaminethedifferentmodelstodeterminewhichmodelismostusefulforthebusinessproblemyouface.Firstusepartofthedatatobuildamodel, andthenusetheremainingdatatotestandverifytheresultingmodel.Sometimesthereisathirddataset, calledthevalidationset, becausethetestsetmaybeaffectedbythecharacteristicsofthemodel.Atthistime, anindependentdatasetisneededtoverifytheaccuracyofthemodel.Trainingandtestingdataminingmodelsrequiresdividingthedataintoatleasttwoparts, oneformodeltrainingandtheotherformodeltesting.

(6) Evaluationmodel.Afterthemodelisestablished, itisnecessarytoevaluatetheresultsobtainedandexplainthevalueofthemodel.Theaccuracyrateobtainedfromthetestsetisonlymeaningfulforthedatausedtobuildthemodel.Inpracticalapplications, itisnecessarytofurtherunderstandthetypesoferrorsandtherelatedcosts.Experiencehasprovedthataneffectivemodelisnotnecessarilythecorrectmodel.Thedirectreasonforthisisthevariousassumptionsimplicitinthemodelestablishment.Therefore, itisimportanttotestthemodeldirectlyintherealworld.Firstapplyitinasmallarea, obtaintestdata, andthenpromoteittoalargeareawhenyoufeelsatisfied.

(7)Mise en œuvre.Une fois quelemodèleestétablietvérifié,ilexistedeuxmanièresprincipalesdel'utiliser.

Méthodes d'analyse de datamining

L'exploration de données est divisée en exploration de données guidée et exploration de données non guidée. L'exploration de données guidée consiste à utiliser les données disponibles pour construire un modèle, qui est la description d'un attribut spécifique.

Data mining

1.Classification.Ilsélectionned'abordl'ensembled'apprentissageclassifiéàpartirdesdonnées,utiliselatechnologied'explorationdedonnéessurl'ensembled'apprentissagepourétablirunmodèledeclassification,puisutiliselemodèlepourclasserlesdonnéesnonclassifiées.

2.Évaluation.L'évaluationestsimilaireàlaclassification,maislerésultatfinaldel'évaluationestunevaleurcontinueetlemontantdel'évaluationn'estpasprédéterminé.L'évaluationpeutêtreutiliséecommetravailpréparatoireàlaclassification.

3.prédire.Elleestréaliséeparlaclassificationoul'estimation,etunmodèleestobtenuparlaclassificationoul'entraînementàl'estimation.

4.Pertinencerèglesdegroupementoud'association.

5.Clustering.Il s'agit d'une méthode permettant de rechercher et d'établir automatiquement des règles de regroupement.

Réussites

1. L'exploration de données aide Credilogros Cía Financiera SA à améliorer les notes de crédit des clients

CredilogrosCíaFinancièreaSAestlapremièreenArgentineLescinqgrandessociétésdecrédit,avecunevaleurd'actifestiméeà95,7millionsdedollars américains,pourCredilogros,ilestimportantd'identifierlesrisquespotentielsassociésauxclientspotentielsdepaiementpréalableafindeminimiserlerisqueencouru.

Le premier objectif de l'entreprise est de créer un moteur de décision qui interagit avec le système de base de l'entreprise et les deux systèmes d'évaluation du crédit pour traiter les demandes de crédit.

Intheend, CredilogroschoseSPSSInc.'sdataminingsoftwarePASWModelerbecauseitcanbeflexiblyandeasilyintegratedintoCredilogros'coreinformationsystem.ByimplementingPASWModeler, Credirogrosreducedthetimeneededtoprocesscreditdataandprovidethefinalcreditscoretolessthan8seconds.Thisallowstheorganizationtoquicklyapproveorrejectcreditrequests.ThedecisionenginealsoenablesCredilogrostominimizetheidentificationdocumentsthateachcustomermustprovide.Insomespecialcases, onlyoneidentificationisrequiredtoapprovecredit.Inaddition, thesystemalsoprovidesmonitoringfunctions.CredilogroscurrentlyusesPASWModelertoprocessanaverageof35,000applicationspermonth.Only3monthsafteritwasrealized, ithelpedCredirogrostoreduceloandisbursementderelictionby20%.

2. L'exploration de données permet à DHL de suivre la température des conteneurs de fret en temps réel

DHLisaglobalmarketleaderintheinternationalexpressandlogisticsindustry.Itprovidesexpress, landandwaterAirthree-waytransportation, contractlogisticssolutions, andinternationalmailservices.DHL'sinternationalnetworkconnectsmorethan220countriesandregions, withatotalofmorethan285,000employees.UnderthepressureoftheUSFDAtoensurethatthetemperatureofdrugshipmentsduringthetransportationprocessmeetsthepressure, DHL'spharmaceuticalcustomersstronglydemandmorereliableandmoreaffordableoptions.ThisrequiresDHLtotrackthetemperatureofthecontainerinrealtimeatallstagesofdelivery.

Althoughtheinformationgeneratedbytheloggermethodisaccurate, thedatacannotbetransmittedinrealtime, andneitherthecustomernorDHLcantakeanypreventiveandcorrectivemeasureswhentemperaturedeviationsoccur.Therefore, DHL'sparentcompanyDeutschePostWorldNetwork (DPWN), throughtheTechnologyandInnovationManagement (TIM) Group, hasclearlydrawnupaplantouseRFIDtechnologytotrackthetemperatureoftheshipmentatdifferentpointsintime.DrawtheprocessframeworkfordeterminingthekeyfunctionparametersoftheservicethroughtheIBMGlobalBusinessConsultingServicesDepartment.DHLhasgainedtwobenefits: Fortheendcustomer, itenablesmedicalcustomerstorespondinadvancetoshippingproblemsduringthedeliveryprocess, andcomprehensivelyandeffectivelyenhancesdeliveryreliabilityatacompellinglowcost.ForDHL, ithasimprovedcustomersatisfactionandloyalty; laidasolidfoundationformaintainingcompetitivedifferences; andhasbecomeanimportantnewsourceofrevenuegrowth.

3.Applicationsdansl'industriedestélécommunications

Pricecompetitionisunprecedentedlyfierce, voicebusinessgrowthhassloweddown, andthefast-growingChinamobilecommunicationsmarketisfacingunprecedentedsurvivalpressure.TheacceleratedreformofChina'stelecommunicationsindustryhascreatedanewcompetitivesituation, andthebreadthandintensityofcompetitioninthemobileoperatingmarketwillfurtherincrease, especiallyinthefieldofgroupcustomers.Mobileinformatizationandgroupcustomershavebecomeanewengineforoperatorstocopewithcompetitionandobtainsustainedgrowthinthefuture.

Withthedomesticthree-leggedfull-servicecompetitionandtheissuanceof3Glicenses, itwillbeageneraltrendforoperatorstoprovidegroupcustomerswithintegratedinformatizationsolutions, andmobileinformatizationwillbecomeacomprehensiveentryintothefieldofinformatizationservices.Leadingforce.Therefore, traditionalmobileoperatorsarefacingthechallengeofshiftingfromtraditionalpersonalbusinesstosimultaneouslyexpandingthefieldofgroupcustomerinformatizationbusiness.Howtodealwithinternalandexternalchallengesandquicklyusemobileinformatizationservicesasoneofthecompetitivetoolsforintegratedservicestoexpandthegroup'scustomermarketandremaininvincibleinemergingmarketsisanurgentproblemthattraditionalmobileoperatorsneedtosolve.

Algorithme classique

Actuellement, les algorithmes d'exploration de données incluent principalement la méthode du réseau neuronal, la méthode de l'arbre de décision, le gorithme génétique, la méthode de l'ensemble approximatif, la méthode de l'ensemble flou, la méthode de la règle d'association, etc.

Méthode de réseau de neurones

Theneuralnetworkmethodsimulatesthestructureandfunctionofthebiologicalnervoussystem.Itisanon-linearpredictivemodellearnedthroughtraining.IttreatseveryconnectionasAprocessingunittriestosimulatethefunctionsofhumanbrainneurons, andcancompleteavarietyofdataminingtaskssuchasclassification, regroupement, andfeaturemining.Thelearningmethodofneuralnetworkismainlymanifestedinthemodificationofweights.Itsadvantagesareanti-ingérence, non-linearlearning, associativememoryfunctions, andaccuratepredictionresultsforcomplexsituations, thefirstdisadvantageisthatitisnotsuitableforprocessinghigh-dimensionalvariablesandcannotobservethelearningprocessinthemiddle.Ithas "blackbox" characteristicsandoutputsresults.Itisalsodifficulttoexplain, d'autre part, ittakeslongerlearningtime.Neuralnetworkmethodismainlyusedindataminingclusteringtechnology.

Méthode de l'arbre de décision

Decisiontreeisaprocessofconstructingclassificationrulesaccordingtotheutilityofthetargetvariable.Theprocessofclassifyingdatathroughaseriesofrules, anditsmanifestationisSimilartotheflowchartofthetreestructure.ThemosttypicalalgorithmisJ.R.QuinlanproposedtheID3algorithmin1986, andthenproposedtheextremelypopularC4.5algorithmbasedontheID3algorithm.Theadvantageofusingthedecisiontreemethodisthatthedecision-makingprocessisvisible, doesnotrequirealongtimetoconstructtheprocess, thedescriptionissimple, easytounderstand, andtheclassificationspeedisfast; thedisadvantageisthatitisdifficulttofindrulesbasedonacombinationofmultiplevariables.Decisiontreemethodisgoodatprocessingnon-numericaldata, anditisespeciallysuitableforlarge-scaledataprocessing.Decisiontreesprovideawaytoshowruleslikewhatvaluewillbeobtainedunderwhatconditions.Forexample, inaloanapplication, itisnecessarytomakeajudgmentonthedegreeofriskoftheapplication.

Algorithme génétique

Geneticalgorithmsimulatesthephenomenaofreproduction, matingandgenemutationthatoccurinnaturalselectionandheredity.Itisakindofoperationthatusesgeneticcombination, geneticcross mutation, andnaturalselection.Togeneraterules-basedmachinelearningmethodsbasedonevolutionarytheory.Itsbasicviewpointistheprincipleof "survivalofthefittest", whichhasthepropertiesofimplicitparallelismandeasyintegrationwithothermodels.Themainadvantageisthatitcanhandlemanydatatypes, andatthesametime, itcanprocessvariousdatainparallel, thedisadvantageisthatitrequirestoomanyparameters, codingisdifficult, andgenerallytheamountofcalculationisrelativelylarge.Geneticalgorithmsareoftenusedtooptimizeneuralnetworksandcansolveproblemsthataredifficulttosolvebyothertechnologies.

Méthode de dégrossissage

Theroughsetmethod, alsoknownasroughsettheory, wasproposedbyPolishmathematicianZPawlakintheearly1980s.Itisanewwaytodealwithambiguity, Mathematicstoolsforinaccurateandincompleteproblemscanhandledatareduction, datacorrelationdiscovery, anddatameaningevaluation.Theadvantageisthatthealgorithmissimple, andthepriorknowledgeaboutthedataisnotneededintheprocessingprocess, andtheinherentlawoftheproblemcanbeautomaticallyfound, thedisadvantageisthatitisdifficulttodirectlyprocesscontinuousattributes, andthediscretizationofattributesmustbeperformedfirst.Therefore, thediscretizationofcontinuousattributesisadifficultpointthatrestrictsthepracticalapplicationofroughsettheory.Roughsettheoryismainlyappliedtoproblemssuchasapproximatereasoning, digitallogicanalysisandsimplification, andtheestablishmentofpredictivemodels.

Méthode d'ensemble flou

La méthode des ensembles flous utilise la théorie des ensembles flous pour effectuer une évaluation floue, une prise de décision floue, une reconnaissance de modèles flous et une analyse de grappes floues sur les problèmes. La théorie des ensembles flous utilise le degré d'appartenance pour décrire les attributs des éléments flous.

AssociationRèglesLoi

Associationrulesreflecttheinterdependenceorcorrelationbetweenthings.ItsmostfamousalgorithmisR.TheApriorialgorithmproposedbyAgrawaletal.Theideaof thealgorithmis: firstfindoutallfrequencysetswhosefrequencyisatleastthesameastheminimumsupportofthepredeterminedmeaning, andthengeneratestrongassociationrulesfromthefrequencysets.Theminimumsupportandtheminimumcredibilityaretwothresholdsgivenfordiscoveringmeaningfulassociationrules.Inthissense, thepurposeofdataminingistominetheassociationrulesthatmeettheminimumsupportandminimumcredibilityfromthesourcedatabase.

Il y a des problemes

L'exploration de données implique également des problèmes de confidentialité. Par exemple, un employeur peut accéder aux dossiers médicaux pour éliminer les personnes atteintes de diabète ou de maladies cardiaques graves.

L'exploration de données gouvernementales et commerciales peut impliquer des problèmes tels que la sécurité nationale ou des secrets commerciaux.

L'exploration de données a de nombreuses utilisations légitimes. Par exemple, elle peut découvrir la relation entre un médicament et ses effets secondaires dans la base de données d'un groupe de patients.

L'exploration de données met en œuvre des méthodes pour découvrir des informations impossibles avec d'autres méthodes, mais elle doit être réglementée et doit être utilisée avec des instructions appropriées.

Si les données sont collectées auprès d'un individu spécifique, il y aura des problèmes de confidentialité, juridiques et éthiques.