Department of Computer Science
Institute of Mathematics and Statistics - University of São Paulo
Rua do Matão, 1010
Zip 05508-090 São Paulo - SP - Brazil
{jmena | cesar}@vision.ime.usp.br
Knowledge extraction and visualization from large datasets is an important research topic in computer science with strong potential impact in all scientific fields [12]. The research on this topic tipically involves the treatment of large datasets which can not be processed and understood by human experts due to its volume, diversity and complexity. Techniques from datamining, summarization, visualization, network modeling and high-performance computing are often brought together in order to solve problems of this nature. The present work is focused on the summarization and visualization of scientific reports obtained from the Brazilian Lattes platform.
The Conselho Nacional de Desenvolvimento Científico e Tecnológico (Brazilian National Council for Scientific and Technological Development - CNPq), makes efforts to integrate the Curricula of people associated to Brazilian scientific communities, in a curricular information system denominated Lattes1. For this reason, the so-called ``Lattes Curriculum'' is considered a national standard of information about the scientific and academic accomplishments of students, professors, researchers and professionals involved in science and technology in general.
The Lattes curriculum is used for academic evaluation because it represents the history of scientific, academic and professional activities [1]. It is hence a rich and powerful database that presents inumerous potential applications (scientific, technological, economical, etc.) The Lattes curriculum, in HTML format as available in the CNPq site, displays information only in a personal way, i.e., the registered information is individually associated to each person. This characteristic does not easily provide a way to figure out the bibliographical, technical or artistic productions of a given group, such as a research group, professors of an academic department or members of a Brazilian institution.
Currently, most of the Brazilian academic institutions usually explore the Lattes curricula in order to elaborate reports about scientific productions, supervisions, and projects of research groups related with these institutions, as well as to evaluate the graduate programs in Brazil. The reports are typically created by manually-assisted analysis of the Lattes curriculum data of each member of the group in order to obtain a complete digest of all scientific productions, supervisions and projects of the group. It is important to note that, despite having structured information, this procedure is very cumbersome and time consuming, being highly susceptible to errors caused by the manual treatment.
There are some interesting bibliometric questions that may be answered about a group just based on the respective Lattes curricula:
The scriptLattes, an open-source system, was designed to provide answers to the above questions through automatically created reports. Given a group of researchers registered in the Lattes platform, the scriptLattes download their Lattes curricula from the CNPq site, extract the information of interest, eliminate the redundant scientific productions and create reports about the production, reports of academic supervisions as well as the collaboration graph and the research map from the members of the group. We believe that the introduced system is a useful tool to easily extract knowledge about the Lattes curricula of a group. This knowledge may be used to explore, identify or validate patterns of academic activities, thus bringing bibliometric information about a group of interest [22].
This open-source system runs on a PC with GNU/Linux using Perl modules and basic structures of programming languages. The scriptLattes is a project registered in the Free Software Competence Center at University of São Paulo, being hosted at SourceForge [29]. To the best of our knowledge, the system is the first to be widely used in several Brazilian academic groups, including to the University of São Paulo (USP), the State of São Paulo Research Foundation (FAPESP), and the Agency for Agro-business Technology in the State of São Paulo (APTA), for instance. The system was successfully tested with at least 300 research groups of Brazilian institutions.
Therefore, the present paper describes a system that allows scientific data summarization from a structured database of curricula vitae, i.e. the Lattes platform. In this context, the paper describes a new system that allows the extraction of useful summarized knowledge from large set of data, a task that would be too difficult (in many cases, impossible) to be performed manually. In order to produce such a system, different solutions and algorithms have been adopted or proposed and implemented, being described in the paper. Possible applications of the system are also discussed. The paper's contributions show how Computer Science tools (developed data-structures, algorithms, visualization techniques and software system) solve an important knowledge extraction problem from large datasets. The relevance of the paper's contributions relies in the context of knowledge extraction and visualization systems that deal with possibly large datasets, an important Computer Science research topic [10,11,12,15].
The remaining of the paper is organized as follows: Section 2 discusses some important background references followed by Section 3, which describes the modules of the proposed system. Some results illustrating the use of scriptLattes, as well as a form to explore the obtained information, are described in Section 4. Finally, the conclusions and future directions are summarized in Section 5.
It is worth noting that co-autorship is not the only approach to create scientific networks. For instance, text-mining helps the generation of paper networks allowing clustering and hierarchical analyis of large datasets of papers based on subject. In all such cases, visualization is of utmost importance in order to produce good interfaces to allow the user to understand the summarized data [26]. Visualization of networks plays a central role in helping the user to understand and to interact with data, mainly because large volumes are tipically involved [3].
The input of the system is composed by an ASCII list of Lattes curriculum's IDs in conjunction with the time period of each member of the group to be analyzed, i.e., the years where each member has been associated to the group (research group, institute, department, university, etc.) The ID of a Lattes curriculum is a number of 16 algarisms associated to each person registered in the Lattes platform, being easily obtained from the Lattes curriculum. Therefore, the IDs are commonly used in the request of a given curriculum. See in Figure 1 an example of the input file format.
![]() |
The system has been divided into six modules. Figure 2 shows the schematic data flow diagram of the whole system, where each module is responsible to process a given type of information of the Lattes curricula. The system output is given by several reports, in HTML format, showing the summarized information in terms of bibliographical, technical and artistic production, as well as academic supervisions, collaborations and research map among the members. The HTML format was chosen in all reports because of being a standard format to be visualized on the internet.
![]() |
The curricula are downloaded in HTML format because they are publicly available only as plain HTML. Public users of the Lattes platform do not have access either to the Lattes database or to data in XML format. As a result, special attention is devoted to extract the information about the scientific productions, as explained in the data preprocessing module.
It is important to note that the types of scientific productions considered by the scriptLattes are those registered in the Lattes platform. All available information from the Lattes curricula are considered as validated by the CNPq. See in Table 1 the types of productions and supervisions considered by the system.
Personal information |
---|
Name |
Professional address |
Bibliographical production |
Articles in scientific journals |
Book published/organized |
Book chapter published |
Articles in newspapers/magazines |
Complete works published in proceedings of conferences |
Expanded summary published in proceedings of conferences |
Summary published in proceedings of conferences |
Articles accepted for publication |
Presentations of work |
Other kinds of bibliographical production |
Technical production |
Patented or registered software |
Not patented or registered software |
Technological products |
Techniques or process |
Technical works |
Other kinds of technical production |
Artistic productions |
Artistic/cultural production |
Ongoing/concluded supervisions |
Postdoctorate supervision |
Ph.D. thesis |
Master's thesis |
Monograph of completion for improvement/specialization |
Works of completion for graduation |
Scientific initiation |
Other academic advisory |
The productions title is considered as a standard characteristic in order to compare the scientific productions elaborated by the analyzed group. This characteristic is adopted because it is always displayed in all productions presented by the Lattes curricula in HTML format. Note that we also considered other characteristics such as the authors name or DOI number. Nonetheless, these values have not yet been standardized in the Lattes platform, being more difficult to parse and to compare by automatic means in the current version.
The following algorithm summarizes the process applied by this module on a list of scientific productions:
The FILTER-PRODUCTIONS-LIST procedure requires a list of
productions and a value as parameters. The
value is the desired
threshold (as a percentage value) in order to identify the similar productions.
The longest common subsequence algorithm [5] was used by the
COMPARE procedure in order to calculate the percentage of similarity
between the title of the scientific productions (for comparison purposes, only
the size of the longest common subsequence was used as a measure of distance
between productions).
Thus, the publications list is filtered by removing the equal or similar
publications into the
. Currently, in this module, the productions with
similarity of 92% in their titles are identified and merged. This
percentage was adopted after preliminary experiments. This value can be
easily modified.
All redundant scientific productions are recorded in a bi-dimensional matrix,
, that stores the number of bibliographical, technical
and artistic collaborations among the members. Let
be the length of elements
into
. Since we have compared the
-th production with the
remaining productions,
(
) comparisons are required to filter the
complete list.
Note that the process of this module is individually applied to each type of productions. Therefore, there may exist some publications with similar titles but belonging to different types. Those are not identified by the system as being the same.
The process in this module is simplified by using the
, a
symmetric matrix that contains the total amount of co-authored productions
between members, being computed by the redundancy treatment module.
is taken as an adjacency matrix that represents the
graph of collaborations. In the generated graph, it is possible to observe the
collaboration among members and clusters of cooperation.
This graph is an instrument that helps to discover the researchers with more
activity of co-authoring within the group and could be used in detailed
analysis of co-authoring as [17,20,28].
The country name, city name, and the zip code number of each member, available in the Lattes curricula, is used to query the location: the longitude and latitude values are obtained if the query string is valid. Furthermore, the geographic location of each formed PhD by the group is also plotted in the map. Thus, the research map shows where the formed students are working, given an idea of the influence maintained by the research group. It is important to note that, to accomplish the retrieval of the latitude and longitude values of the geographic position of each member, an interface with Google Maps is used.
Reports of the productions, as well as for ongoing and concluded supervisions, are created in this module. These reports are separated by type and show a quantitative information disclosed per year in inverse chronological order.
Bar charts are associated to the reports, where the bar lengths are proportional to the values of scientific productions of the group. Hence, it is possible to discover if the production volume of a particular group came to baseline.
Additionally, the generated reports show web links to search engines (including Google and Scholar, among others) in order to find possible citations or similar works. Currently, reports in HTML or JSP format are generated. These formats have been chosen because of being suitable to the internet. The collaboration graph can be visualized in PNG, PostScript or using an interactive java applet [18]
We describe two experiments that illustrate the software potential applications.
In this section we show an example of reports generated by the system. We use
the Lattes curricula of professors associated to the Department of Computer
Science of the Institute of Mathematics and Statistics at the University of
São Paulo (DCC-IME-USP). The time of permanence of each member was indicated
as an input data, whenever the information is available.
Figure 3 shows some generated reports of the group. The complete list is available in [23]. See in Figure 3(c) the generated collaboration graph. Figure 3(d) presents an example of the generated research map. Each member of the group is represented in green and each formed PhD in blue. The advisor-advisee relation is represented with a thin straight-line. The name of the researcher and the complete address are shown by clicking the location points.
|
The matrix of collaborations can be used in different approaches to explore and
to extract knowledge. For instance, in Figure 4 it is
possible to observe the temporal collaborations of a group of professors in the
last five triennia (years 1994-2008). The temporal academic collaboration
graphs show how the collaborations were made during each period, i.e. the
evolution of collaborations among the members through the years.
Note that members and
have been collaborating in the last five
triennia in a stable way. Member
only had collaboration with other
members in the triennium 2003-2005. On the other hand, member
has been
collaborating in the group since the triennium 2000-2002.
With the temporal graph of collaborations it is possible to automatically
characterize the collaboration among members. In particular, it is possible to
determine the degree of temporal collaboration of the whole group using
measurements such as those described in [6].
![]() |
The second experiment briefly investigates the profile (proportion of publication) of bibliographical productions from several groups with distinct characteristics.
To define the research groups, we used the Brazilian research groups database2 maintained by the CNPq. On May of 2009, using the keyword computação as query in the database, 56 research groups in Computer Science belonging to São Paulo State were extracted. In all groups, the lists of researchers registered on the CNPq were used to create the lists of Lattes curricula. Altogether, 56 lists were created as input to the scriptLattes. In order to consider the same time periods of the scientific productions, in the remaining of this experiment the productions of the last 10 years were considered, i.e., scientific productions between 1999 and 2008.
Figure 5(a) shows the total amount of bibliographical productions extracted from 306 members belonging to the research groups using scriptLattes. Note that these global amounts indicate a preference to publish as follows: complete works in conferences, summary in conferences and articles in scientific journals. See the radar chart in Figure 5(b). The data visualization through radar charts helps to display both the dominant type of bibliographical production for a given group, and which groups are most similar, as described below.
|
The reports of bibliographical productions of each group were used to
explore/classify the profiles of the research groups. The scientific
productions of the last 10 years of each bibliographical type were concatenated
to form a vector of 90 elements. Then, the vector was normalized by dividing its
elements by the maximum value. Thus, each element stores a relative amount of
annual scientific productions of the group. Principal component
analysis [13] (PCA) was carried out to examine the 56 vectors of temporal
scientific productions. This analysis allows to transform correlated data into a
smaller number of uncorrelated principal components. In particular, the first
two components were used to define four classes of research groups. The groups
were clustered in classes using the centroid distance method.
It was adopted the centroid distance method because it is a natural form
to cluster few groups.
Please refer to [8] for more detailed descriptions. The first two
components take into account more than of the total variation of the
analyzed vectors.
Figure 6(a) shows the results of clustering the 56 Computer Science research groups using the PCA data extracted by the scriptLattes reports. Each research group is represented by a regular shape (e.g. circle, square, triangle or diamond) and corresponds to one of four identified classes.
In Figure 6(b) the production profiles from four groups are shown: 2, 5, 8 and 13. These groups were selected because they are the more representative of the identified classes. Observe that the first principal component represents variation in the profile from complete works in conferences to summaries published in conferences. The second is rather flat, probably representing the variation in the profile from complete works in conferences to articles in scientific journals. This semantic interpretation is dependent upon the area of the analyzed research groups.
|
Similar approaches can be explored in order to analyze the overall progress of productions developed by scientific and technological Brazilian groups, i.e. to analyze the patterns of increasing or decreasing of bibliographical productions through the years. In this context, the quantitative values obtained with scriptLattes can be used to automatically estimate the different patterns of productivity and show if any pattern is significantly influenced by the area of knowledge (e.g. Social, Applied, Medical or Biological Sciences).
In this paper we have presented an open-source system for extraction and visualization of knowledge from Lattes curricula. The designed system allows to analyze scientific productions based on curricula registered in the Lattes platform. It is a simple form to obtain a survey of significant performance indicators of research groups, and to analyze the progress of scientific productions and the relevant information about their activities (e.g. see in reference [30] an analysis of the performance of research groups associated to Agronomy, Genetics and Sociology).
The scriptLattes only deal with structured information from Lattes currila. A natural path to be improved in the system is the use of information from other more general or semi-structured sources. In that sense, we can explore new methods in regard to automatically extract information [7,27].
An important issue regarding the generation of the collaboration graph is the treatment of the correct title disambiguation in author citations. Alternative methods, such as in [9], may be explored with the aim of use an unsupervised learning approach. Currently, the scriptLattes is being improved with the purpose of have a more general and efficient system. Future improvements aims at exploring the analysis about co-authoring networks [2,17,20,28], measuring and extracting proximity in networks [14], finding and evaluating community structure in networks [21], and applying ontologies [4,25]. We believe that these new approaches will allow more contributions to the extraction and management of knowledge from the Lattes platform.
While there is great potential to explore the Lattes curricula vitae there still
needs to be important effort on the part of institutions which host the data to
supply software interfaces, facilitating the data extraction process to the
pursuit of knowledge discovery on Lattes platform. Until that time, there is an
important opportunity for combine efforts to provide the bibliographical
production data that can integrate the information that is already available on
the internet.
ACKNOWLEDGEMENTS
The authors would like to thank Fabio Kon, Fabrício Martins Lopes
and Yossi Zana for discussions and suggestions on this work and the anonymous
reviewers for the critical comments and valuable suggestions. This work was
supported by CAPES, CNPq and FAPESP.