| Content Summary | With the current explosion of data, the problem of how to combine distributed and
heterogeneous- D&H information sources becomes more and more critical. Besides
collecting enormous amount of data it is very important to consider the general need
of semantic integration and knowledge discovery from these sources, an important
and necessary challenge for machine learning- ML, and data mining/knowledge
discovery- DM/KDD researchers. The main differences here, and consequently the
grand challenges with respect to single, static and homogeneous information sources,
are: (a) the scale of the problem is much larger than anything attempted before in ML
and DM/KDD, and (b) the raising need for integrating multiple knowledge
representations (e.g., domain ontologies and data-models) are more important and
vital (Wah et.al., 1993).
If the distributed nature of data has a more-or-less clear definition (even hard, and
most of the times tedious to achieve), heterogeneity is a more complex concept.
Consider for example the situation where, the same or different database applications
are installed and run at different remote locations. In such a set-up users may enter
and record data in a non pre-specified and non-homogeneous format. This is a
common situation in an Integrated Electronic Health Care Record (I-EHCR)
environment (Forslund, and Kilman, 1996; InterCare, 1999, pp. 7-13; Grimson et.al.,
1997). A physician that accesses a patient’s healthcare record needs an overview of
the patient’s EHCR segments, since in most cases only a small fraction of the
complete record will be selected and presented in detail. That also means that when
accessing a particular clinical information system there is a need for extracting only a
subset of the information stored in it. The real issue here is not only how to access
specific information systems that maintain EHCR segments, but also how to identify
and index the essential information in them. A promising approach to this integration
problem is to gain control of the organization's information resources at a meta-data
level, while allowing autonomy of individual systems at the data instance level. The
objective of the meta-database model is to achieve enterprise information integration
over distributed and potentially heterogeneous systems, while allowing these systems
to operate independently and concurrently (Hsu, 1992). However, achieving
integration at the semantic level is a challenging problem mainly because the logic,
knowledge, and data structures used in various systems are complex and often
incompatible (Sciore, 1994). In addition, the further someone wishes to hide
heterogeneity, the more he/ she has to deal with semantic integration issues. Thus, a 
realistic solution should hide heterogeneity at the top level, while making the
individual sources of information appear to end users as a large collection of objects
that behave uniformly (Baldonado, 1996).
This paper presents the problem of discovering and acquiring knowledge form
D&H clinical data sources. In particular, we tackle the problem of inducing
interesting associations between data items stored in remote clinical information
systems. The test-bed environment of our approach is the HYGEIAnet: The Integrated
Health Care Network of Crete (Tsiknakis, 1997; HYGEIAnet Web site). One of the
basic healthcare services offered within the HYGEIAnet network is the access to
patients' clinical information stored in autonomous (legacy) clinical information
systems. Even if the focus is on the medical domain, the proposed methodology and
solutions could be smoothly extended to cover the general case of other application
domains.
In the next section, we present the architecture of an integrated environment for
mining D&H data sources. Section 3, presents the basic technology for accessing
distributed and structured data sources, as well as the processes for the semantic
homogenization and integration of heterogeneous data sources and items. In section 4,
we present the information and data representation framework; based on the XML
framework and technology. Section 5, presents the machine learning and data mining
processes, which are being adapted on flexible data representation structures. In
section 6, some preliminary experimental results are presented. In the last section we
conclude, and discuss on the future research and development agenda. | en | 
| Bibliographic Citation | K. Hristofis, G. Potamias, M. Tsiknakis, V. Moustakis, S. Orphanoudakis,
"Mining Distributed and Heterogeneous Data Sources in the Medical Domain," presened at Eturopean Conference of Machine Learning. Barcelona, Spain, 2000. | en |