Το έργο με τίτλο Mining distributed and heterogeneous data sources in the medical domain από τον/τους δημιουργό/ούς Moustakis Vasilis, Hristofis K., Potamias G., Orphanoudakis S. διατίθεται με την άδεια Creative Commons Αναφορά Δημιουργού 4.0 Διεθνές
Βιβλιογραφική Αναφορά
K. Hristofis, G. Potamias, M. Tsiknakis, V. Moustakis, S. Orphanoudakis,
"Mining Distributed and Heterogeneous Data Sources in the Medical Domain," presened at Eturopean Conference of Machine Learning. Barcelona, Spain, 2000.
With the current explosion of data, the problem of how to combine distributed andheterogeneous- D&H information sources becomes more and more critical. Besidescollecting enormous amount of data it is very important to consider the general needof semantic integration and knowledge discovery from these sources, an importantand necessary challenge for machine learning- ML, and data mining/knowledgediscovery- DM/KDD researchers. The main differences here, and consequently thegrand challenges with respect to single, static and homogeneous information sources,are: (a) the scale of the problem is much larger than anything attempted before in MLand DM/KDD, and (b) the raising need for integrating multiple knowledgerepresentations (e.g., domain ontologies and data-models) are more important andvital (Wah et.al., 1993).If the distributed nature of data has a more-or-less clear definition (even hard, andmost of the times tedious to achieve), heterogeneity is a more complex concept.Consider for example the situation where, the same or different database applicationsare installed and run at different remote locations. In such a set-up users may enterand record data in a non pre-specified and non-homogeneous format. This is acommon situation in an Integrated Electronic Health Care Record (I-EHCR)environment (Forslund, and Kilman, 1996; InterCare, 1999, pp. 7-13; Grimson et.al.,1997). A physician that accesses a patient’s healthcare record needs an overview ofthe patient’s EHCR segments, since in most cases only a small fraction of thecomplete record will be selected and presented in detail. That also means that whenaccessing a particular clinical information system there is a need for extracting only asubset of the information stored in it. The real issue here is not only how to accessspecific information systems that maintain EHCR segments, but also how to identifyand index the essential information in them. A promising approach to this integrationproblem is to gain control of the organization's information resources at a meta-datalevel, while allowing autonomy of individual systems at the data instance level. Theobjective of the meta-database model is to achieve enterprise information integrationover distributed and potentially heterogeneous systems, while allowing these systemsto operate independently and concurrently (Hsu, 1992). However, achievingintegration at the semantic level is a challenging problem mainly because the logic,knowledge, and data structures used in various systems are complex and oftenincompatible (Sciore, 1994). In addition, the further someone wishes to hideheterogeneity, the more he/ she has to deal with semantic integration issues. Thus, a realistic solution should hide heterogeneity at the top level, while making theindividual sources of information appear to end users as a large collection of objectsthat behave uniformly (Baldonado, 1996).This paper presents the problem of discovering and acquiring knowledge formD&H clinical data sources. In particular, we tackle the problem of inducinginteresting associations between data items stored in remote clinical informationsystems. The test-bed environment of our approach is the HYGEIAnet: The IntegratedHealth Care Network of Crete (Tsiknakis, 1997; HYGEIAnet Web site). One of thebasic healthcare services offered within the HYGEIAnet network is the access topatients' clinical information stored in autonomous (legacy) clinical informationsystems. Even if the focus is on the medical domain, the proposed methodology andsolutions could be smoothly extended to cover the general case of other applicationdomains.In the next section, we present the architecture of an integrated environment formining D&H data sources. Section 3, presents the basic technology for accessingdistributed and structured data sources, as well as the processes for the semantichomogenization and integration of heterogeneous data sources and items. In section 4,we present the information and data representation framework; based on the XMLframework and technology. Section 5, presents the machine learning and data miningprocesses, which are being adapted on flexible data representation structures. Insection 6, some preliminary experimental results are presented. In the last section weconclude, and discuss on the future research and development agenda.