Data Processing

Shuyi Chen, Ruilin Wu

Task 1 Data Exploration and Discovery

acm_fellows.csv contains 1493 rows and 2 columns, and the column names are name and year. The dataset records the awards from 1994 to 2023. There are no duplicated rows and missing values. However, the following names of the researchers who received the award more than once.

conference_ranking.csv contains 1493 rows and 3 columns, and the column names are Antonym, Name and Rank. Conferences are ranked into 4 levels, A*, A, B, C. There are no duplicate rows and missing values. However, among the A* ranked conferences, three different conferences share the same abbreviation, and the abbreviation ICIS appears three times. As a result, relying solely on abbreviations may lead to ambiguity and potential errors.

country-info.csv contains 468 rows and 3 columns, with the column names: institution, region, and countryabbrv. The file has no duplicate rows or missing values. Since U.S. institutions are excluded, the majority of institutions are from Europe, totaling 239 institutions. In contrast, Africa has the fewest institutions, with only 3 entries.

csrankings.csv contains 30400 rows and 4 columns, with the column names: name, affiliation, homepage and scholarid. The dataset has no duplicate rows or missing values. However, when considering only the scholarid column, 10401 scholar IDs appear more than once.

data.csv contains 1827 rows and 26 columns. The number of total missing values is 8727. Ten columns have different counts of missing values. These missing values may pose challenges in obtaining a complete understanding of the represented CS schools.

dblp-aliases.csv contains 100987 rows and 2 columns, with column names: alias and name. While there are no missing values, the dataset includes 3 duplicate rows.

field_conference.csv contains 77 rows and 3 columns with the column names: major, field and conference. The major is consistently Computer Science, but the dataset includes 27 distinct fields spanning various conferences.

generated-author-info.csv contains 239169 rows and 6 columns, with column names: name, dept, area, count, adjustedcount and year. Within this large dataset, it has no duplicate rows or missing values. From the trend of adjustedcount over the years, showing a significant increase in research publications after 2000.

geolocation.csv contains 570 rows and 3 columns, with column names: institution, latitude and longitude. The dataset has no duplicate rows or missing values. The latitude and longitude values span both negative and positive ranges, indicating a global coverage. This suggests that CS rankings vary across different countries worldwide.

turing.csv contains 89 rows and 2 columns, with column names: name and year. The dataset has no duplicate rows or missing values. The award years range from 1966 to 2023. Notably, six researchers received the award in 2002, making it one of the most awarded years.

Task 2 Data Loading and Transformation

We processed all the datasets by reading CSV files, transforming specific columns, and storing them in an SQLite database. One key transformation we performed was splitting names into first_name, middle_name, and last_name for datasets like csrankings.csv and generated-author-info.csv. We achieved this using strsplit(df$name, “\\s+”), which separated names based on spaces. The first element was assigned as first_name, while the middle portion was combined into middle_name if present. The last element was usually assigned as last_name, but if the last token was numeric, we used the second-to-last word instead to ensure proper name extraction.

Additionally, we enhanced the conference_ranking.csv dataset by categorizing conferences based on their affiliations with ACM, IEEE, or both. Once we completed the transformations, we wrote all datasets into an SQLite database using dbWriteTable(), with queries verifying successful data storage. Lastly for this task, we used dbListTables(con) to confirm that all tables have been created, ensuring a structured and organized database for further analysis.

conference_ranking.csv (added academic society)

csrankings.csv (the names were separated into its component first name, middle name and last name values)


generated-author-info.csv (the names were separated into its component first name, middle name and last name values)

acm-fellows.csv

country-info.csv

data.csv



(and 16-26 of 26 columns are not showed for space purpose)

dblp-aliases.csv

field_conference.csv

geolocation.csv

turing.csv

Task 3 Data Cleaning

Task 3.1

  1. The first 10 names from the missing authors list, sorted alphabetically.

generated_author_info

acm_fellows

turing

We combined the three missing authors lists into a single dataframe with distinct names (824 missing names in total). Then we further splitted the name column into first name, middle name, and last name.

Finally, we inserted name, first_name, middle_name, and last_name from missing authors dataframe into csrankings table, other fields values are by default NULL.

Task 3.2

We use a two-step approach to determine whether two (or more) records from csrankings table represent the same author. First, we define a function that checks if two records have the same last and first name and compatible middle names—either an exact match or one name being an initial matching the first letter of the other. For instance, “John P. Smith” and “John Paul Smith” match if they share the same last name, have the same first name, and either matching or initial-matching middle names; we also consider two records duplicates if one or both are missing a middle name but the first and last names match (e.g., “John Smith” and “John Paul Smith”). Second, once we identify groups of authors who are duplicates, we pick one “most complete” name—typically the one with the longest middle name—to serve as the canonical reference (the “synonym”) for everyone in that group. This ensures, for example, that “John Smith,” “John P Smith,” and “John Paul Smith” all unify under “John Paul Smith.

Task 3.3

The following is the representative subsets of the duplicate_author table.


There are a total of 5705 rows of duplicate names being found based on our implementation.

Task 3.4

We retrieved the csrankings and duplicate_author tables from the database, then joined them to replace any duplicate names with their canonical synonyms, standardizing the “name” column. It groups the updated records by this canonical name and merges duplicate rows by selecting, for each attribute, the longest (most complete) value, then writes the resulting clean table back to the database. The cleaned cs_rankings have 28115 rows, compared with cs_rankings 31224 rows, we found and merged 3,109 duplicate rows.

Task 3.5

We update the generated_author_info table by replacing duplicate author names with their canonical synonyms as identified in the duplicate_author table.

We deliberately avoid merging rows because we want to preserve the original structure where each row represents a single count entry (e.g., one paper). Merging records would aggregate counts and potentially lose the granularity of the data. By simply replacing the names, we standardize author identifiers across tables while retaining the original row-level detail for further analysis. Therefore, the generated_clean retains 239169 rows.

Task 4 Analysis

Task 4.1

SQL Query:

SELECT
g.name AS author,
g.dept AS institution,
SUM(g.count) AS a_star_publication
FROM generated_author_info_clean AS g
JOIN conference_ranking AS c
ON LOWER(g.area) = LOWER(c.Antonym)
WHERE c.Rank = ‘A*’
AND g.year BETWEEN 2010 AND 2024
GROUP BY g.name, g.dept
ORDER BY a_star_publication DESC, author ASC, institution ASC
LIMIT 100;

Output:
From the above SQL statement, we have generated top 100 authors. The following screenshot displays the top 10 of them.

Task 4.2

SQL Query:

WITH RankedInstitutions AS (
SELECT
COALESCE(fc.field, ‘Unknown Field’) AS cs_field,
g.dept AS institution_name,
COUNT(*) AS a_star_publications,
RANK() OVER (
PARTITION BY COALESCE(fc.field, ‘Unknown Field’)
ORDER BY COUNT(*) DESC, g.dept ASC
) AS rank_position
FROM generated_author_info_clean g
JOIN conference_ranking c
ON LOWER(g.area) = LOWER(c.Antonym)
LEFT JOIN field_conference fc
ON LOWER(g.area) LIKE ‘%’ || LOWER(fc.conference) || ‘%’ – Substring matching
WHERE c.Rank = ‘A*’
GROUP BY COALESCE(fc.field, ‘Unknown Field’), g.dept
)
SELECT
cs_field,
institution_name,
a_star_publications
FROM RankedInstitutions
WHERE rank_position <= 10
ORDER BY cs_field ASC, a_star_publications DESC, institution_name ASC;

Output: From the above SQL statement, we have generated top 10 institutions in each field. The total number of fields is 17.