Introduction

It would be really useful to retrieve people with high expertise (publishing record) in an area of research. My goal in this notebook is to get the most probable author at Columbia University in NYC given a phrase or topic. I’m going to use R to do this, mostly because there’s an R package to retrive pubmed information directly from NCBI without needing to download and set a file in a local directory before processing.

This tutoral will also serve to help me find committee members for my qualifying exam and thesis. Thus, like any analysis, I’m going to tailor my analysis towards the question at hand. However, I’m going to leave notes in this tutorial on how you can use the code for your own purposes.

There’s a couple packages I’m going to use in this tutorial:

  • For the pubmed scraping portion, I’m going to use the package RISMed and draw from here. This package is great for fetching from the pubmed database, and from really any database you want to extract from NCBI.

  • For the preprocessing section, I’m going to use the tidyr and dplyr packages. These are great packages for making a tidy dataframe for reliable analysis.

  • For the bayes classification portion, I’m going to draw heavily from here. Bayes classification, in brief, calculates probabilities of features towards observations based on what’s already been seen. I’m going to explain this concept a bit more in this section but the link is a great explanation as well, probably more so.

The outcome of this notebook will be:

1. Extract Abstracts from Columbia biomedical sciences authors from Pubmed using functions in the RISMed package.

2. Preprocess the Abstract data into a usable form to classify topics and associate to authors using the tidyr and dplyr packages.

3. Generate a Naive Bayes Classifier to calculate the probability of an author given a topic or a phrase.

Disclaimer

I am not claiming this to be an original analysis or have this be my original code. I am truly standing on the shoulders of giants-I give credit where credit is due. In the case I mistakenly don’t, I apologize, and please let me know about it so I can give due credit!

Extracting Pubmed articles from Columbia University

I’m retrieving (hopefully) all records with atleast one author with a Columbia University in New York City affiliation.

suppressMessages( library(RISmed) )

search_term <- "columbia[ad] AND york[ad] NOT missouri[ad]"

#may have to change retmax to account for the result count. If it's too high EUtilsSummary will complain
search_query <- EUtilsSummary(search_term, 
                              type="esearch", 
                              mindate=2012, maxdate=2018, 
                              retmax = 30000)

summary(search_query)

records<- EUtilsGet(search_query)

Preprocessing pubmed records for reliable downstream analysis

Now I have to parse to get the author with the right affiliation.

Basically the records object is a Medline class where the Author, Affiliation, Abstract and other keys can be extracted. Since I’m interested in authors just from Columbia, I want to get the Columbia authors, and also the article abstract. So looks like I’ll have to:

1. Figure out which authors in an article are from Columbia.

2. Then put those authors and the article’s abstract in a dataframe.

count <- length( Author(records) )

pubmed <- NULL

for(i in 1:count){
  
  x <- Affiliation(records)[[i]]
  
  z <- sapply(x,function(y){
    
    ifelse( grepl("columbia",
                  y,
                  fixed = F,
                  ignore.case = T), "affiliated" , "not affiliated")
    }
   )
  
  if( length(z) == 0 ) next;
   
  for(j in 1:length(z)){
    
    if(z[j] == "affiliated"){
      
      authors <- Author(records)[[i]]
      
      author_place <- authors[j,"order"]
      
      num_authors <- nrow(authors)
      
      row <- c( paste0(authors[j,"LastName"],", ",authors[j,"Initials"]),
                authors[j,"order"] ,
                nrow(authors),
                Affiliation(records)[[i]][j],
                AbstractText(records)[[i]]
      )
      
      pubmed <- rbind( pubmed, row )
      
  }
 } 
}

colnames(pubmed) <- c( "Author", "Author_Order" , "Tot_Authors", "Affiliation", "Abstract" )

pubmed_raw <- as.data.frame( pubmed )

rownames(pubmed_raw) <- 1:nrow(pubmed_raw)

head(pubmed_raw)

And now I’ll process this and save for use later.

First I’ll load some helper functions:

# funcitons for cleaning and tokenizing
clean_text = function(text) gsub('[^A-Za-z ]', '', tolower(text))
tokenize = function(text) strsplit(gsub(' {1,}', ' ', text), ' ')
#for the cleaned dataframe
pubmed_df <- pubmed_raw

# tokenize the abstract
delimiter <- ","

#need to tokenize in order to save
pubmed_df$Affiliation = sapply(pubmed_df$Affiliation, function(x){
                            paste0(
                              unlist(tokenize(clean_text(x))),
                              collapse = delimiter
                              )
                            
                            }
                          )

#need to tokenize for downtream analysis as well as saving
pubmed_df$Abstract = sapply(pubmed_df$Abstract, function(x){
                            paste0(
                              unlist(tokenize(clean_text(x))),
                              collapse = delimiter
                              )
                            
                            }
                          )
head(pubmed_df)

Now I’m separating out the words in the abstract so each word will be treated as an observation and associated to an author.

suppressMessages( library(tidyr) )

pubmed_df_unnest <- pubmed_df %>% unnest(Abstract = strsplit(Abstract,","))
head(pubmed_df_unnest, 1)

I’m going to save this cleaned file, since it’s a lot to process and then compile the notebook. But the code works, and can be modified to your needs. For example, you may want other fields like Article date of publication-you can retrieve that and add it as a feature to the dataframe. Or you can remove code where I sequentially add to the pubmed dataframe in the for loop.

suppressMessages( library(readr) )

write_tsv(pubmed_df_unnest,"~/test/columbia_authors_tokenizedabstracxts.txt")

Here I’m just loading the processed dataframe for use in the notebook. I’m just removing the need to do all the downloading and processing steps when I compile this notebook :)

It’s a huge file (4.8Gbs)! I’m going to indicate the column names, which might actually help load it in faster. Still, I think loading a 5 GB file in around a minute is fast! But then again I don’t have very good benchmarks to compare :P

suppressMessages( library(readr) )

pubmed_df_unnest <- read_tsv(
  file = "~/test/columbia_authors_tokenizedabstracxts.txt",
  col_names = c( "Author", "Author_Order" , "Tot_Authors", "Affiliation", "Abstract" )
  )
## Parsed with column specification:
## cols(
##   Author = col_character(),
##   Author_Order = col_character(),
##   Tot_Authors = col_character(),
##   Affiliation = col_character(),
##   Abstract = col_character()
## )

Naive Bayes Classifier

After we preprocessed the pubmed data, now we can calculate the probabilities. Much inspiration and code for this post came from here.

In brief, each row in our dataframe, pubmed_df_unnest, is an observation. Each column is a feature associated to the observations. Those features include:

  • The Author placement in the list of all authors of an article,

  • A relative distance of their position from the last author (the objective is to later use that information to filter the results towards last authors, and the assumption is that people that I’m looking for to serve on my committees will most liokely be last or close to last authors on articles listed in pubmed),

  • The affiliation of the author-this will tell me what department they were or are from. I can use this feature for filteringh later on.

  • A word within the abstract (the assumption is the collection of words in an article’s abstract will give evidence for what the associated author’s expertise is in)

suppressMessages( library(dplyr) )

# relabel the Author and feature columns
train_data <- pubmed_df_unnest %>% 
  mutate(Author = Author, feature = Abstract) %>% 
  select(Author, feature)

# compute P(author), P(term|author) and finally log(P(author|term))
get_prob_data = function(train_data){
    total_feats = dim(train_data)[1]
    train_data %>% group_by(Author) %>%
        mutate(total_class_feats = n(),
               p_class = total_class_feats/total_feats) %>%
        group_by(Author, feature) %>%
        summarize(log_prob = log(mean(p_class*(n()/total_class_feats))))
}

prob_data <- get_prob_data(train_data)
head(prob_data, 3)
## # A tibble: 3 x 3
## # Groups:   Author [1]
##       Author feature  log_prob
##        <chr>   <chr>     <dbl>
## 1 A Bacha, E address -16.44415
## 2 A Bacha, E   after -15.75100
## 3 A Bacha, E against -16.44415

To get the joint probability of each author for the features, we have to multiply the log probabilities. Unfortunately if we do that, we will certainly suffer from underflow, meaning R will have trouble storing the very, very small numbers. That’s why we take the log-this allows us to sum the log probabilities and thus avoid underflow. Additionally, since we don’t just want to associate known terms to authors but also include probabilities for terms unobserved for authors. To do this, we will simply use the mean log probability for all terms averaged over all authors.

# function for returning the ranked classes for a text
naive_bayes = function(text, data, k=10, pseudo_prob = 1e-10){
    pseudo_prob = log(pseudo_prob)
    tokens = tokenize(clean_text(text))[[1]]
    n = length(tokens)
    filter(data, feature %in% tokens) %>%
        group_by(Author) %>%
        summarize(score = sum(log_prob) + pseudo_prob*(n-n())) %>%
        arrange(desc(score)) %>%
        head(k)
}

Now we have a function that will return a ranked list of authors given a text. Ok, let’s predict potential committee members!

Give me a ranking of most probable authors for a Topic or Phrase

So first, I want to see if I can pick up authors that I know. So let me see if I can get them in my ranking.

I generally know what my current PI and members in my lab publish on, so let me see if I can get them in a high ranking.

terms <- 'systems drugs side effects pipeline mechanisms'

naive_bayes(terms, prob_data, k=10)
## # A tibble: 10 x 2
##                 Author     score
##                  <chr>     <dbl>
##  1              NA, NA -67.37325
##  2       Tatonetti, NP -90.07075
##  3       Lorberbaum, T -91.47951
##  4            Vilar, S -92.08565
##  5         Hripcsak, G -96.18941
##  6             Shen, Y -96.79043
##  7           Stone, GW -98.19735
##  8            Leon, MB -98.26189
##  9              Liu, C -98.33885
## 10 Vunjak-Novakovic, G -98.56199

Cool! I know most of these people so this is pretty accurate in terms of what they publish on.

So let’s say I want people who have expertise in genomics and bioinformatics and things like that:

terms <- 'genomics bioinformatics mechanisms network computational statistical'

naive_bayes(terms, prob_data, k=10)
## # A tibble: 10 x 2
##            Author      score
##             <chr>      <dbl>
##  1         NA, NA  -72.36186
##  2    Califano, A  -89.83436
##  3          Yu, J  -96.87314
##  4        Wang, X  -98.51320
##  5    Hripcsak, G  -98.81045
##  6       Reitz, C -100.81578
##  7     Rabadan, R -101.10347
##  8  Tatonetti, NP -101.95076
##  9 Bussemaker, HJ -102.76169
## 10        Wang, Y -103.87112

Great! Now I can look more into asking Andrea Califano, Raul Rabadan or Xiadong Wang to see if they’d be a good fit on my committees. Another point I want to make-this is a great way to learn about who publishes well in an area i.e. I didn’t know who Xiadong Wang was but now I may take a look at his publications and research more on depth!

There are a lot of caveats to this, and I try to address them in the next section. But, this gives me a great place in looking for people with expertise in subjects that would be good to have on my committees.

This is a great tool for finding people who published a lot in a certain area. This can be useful for finding domain experts in general.

Caveats and Future Work

I welcome feedback! In this section I try to address the many caveats in the above analysis:

  • I am biasing towards those that publish a lot compared to new PIs or PIs who transferred from other universities.

  • A positive from that is I can more probably pick up PIs because they have more publications than lab members, generally. A negative is I’m not specifically picking up people that are presently at Columbia, or necessarily publishing more or less exclusively in the field I’m looking for.

  • You can see I have some nonsensical authors e.g. “NA, NA”.

  • Some authors may have slight deviations in author names e.g. use both their First initial and middle initial or use only the first. So in the end we’ll have two observations that would actually be the same author which is a confounder.

  • Some authors use slightly different affiliations in different articles. This is another confounder for attributing an author with a unique affiliation.

Refining my model for the question at hand

For removing bad instances, and for obtaining authors that are at Columbia presently and are actual PIs of labs, I need to do some filtering.

The best way to start is modifying the analysis so that:

  • I get to see their Affiliation-this will help me see the distribution of departments represented and select authors from particular departments.

  • I get to see their order in the contributing authors. If their close to last in the author list, they’re more probably a PI of a lab.

  • It’s still going to be hard to figure out which authors are still at Columbia. It will be the case where authors moved on to industry to other universities. However, those that publish a lot, who will be ranked high, will more likely still be at Columbia rather than be students or post-docs that probably moved on.

First, let’s remove “NA, NA” authors after classification (there’s only one),

terms <- 'systems drugs side effects pipeline mechanisms'

naive_bayes(terms, prob_data, k=10) %>% 
  filter( !(Author == "NA, NA") )
## # A tibble: 9 x 2
##                Author     score
##                 <chr>     <dbl>
## 1       Tatonetti, NP -90.07075
## 2       Lorberbaum, T -91.47951
## 3            Vilar, S -92.08565
## 4         Hripcsak, G -96.18941
## 5             Shen, Y -96.79043
## 6           Stone, GW -98.19735
## 7            Leon, MB -98.26189
## 8              Liu, C -98.33885
## 9 Vunjak-Novakovic, G -98.56199

Now I can add Affiliation and the other Author features. I want to do this because I can, hopefully, extract out PIs and filter out students or post-docs. In the example above, “Tatonetti, NP” is actually the PI of my lab and “Lorberbaum, T” was a lab member. I’m not sure how prevelant this is, but maybe extra filtering will help me out with getting to my question.

We’ll use some handy-dandy join functions from dplyr. Because we’ll get a lot of duplicates from the join, we want to do some filtering:

  • We don’t want the Abstract field

  • Different articles will have an author be in a different position in the author list. Maybe filtering for when the author is close to the end we can be more sure to get PIs.

  • Because of slight deviations in the affiliation, maybe we can group affiliations by author and then get those words that are unique.

Hopefully the filtering will help in giving attributes to authors that we calculated probabilities for.

terms <- 'systems drugs side effects pipeline mechanisms'


tmp1 <- naive_bayes(terms, prob_data, k=100) %>% 
  filter( !(Author == "NA, NA") ) %>% 
  left_join(pubmed_df_unnest , by = c("Author" = "Author") ) %>% 
  mutate(Author_Median = median(
                            as.numeric(Tot_Authors) - as.numeric(Author_Order),
                            na.rm=T) 
        ) %>% 
  select(Author, score, Author_Median , Affiliation) %>% 
  distinct()

tmp2 <- aggregate(. ~ Author, tmp1, paste0)

tmp3 <- tmp2

tmp3$score <- sapply(tmp2$score,
                     function(x){
                                sum( as.numeric( unique(x) ) )
                                })

tmp3$Author_Median <- sapply(tmp2$Author_Median,
                     function(x){
                                median( as.numeric( unique(x) ) )
                                })

tmp3$Affiliation <- sapply(tmp2$Affiliation,function(x){
                                unique(
                                  strsplit( paste0(x , collapse=","),",")[[1]]
                                )
                                })

tmp3 %>%
  arrange(desc(score)) %>% 
  head()
##          Author     score Author_Median
## 1 Tatonetti, NP -90.07075             2
## 2 Lorberbaum, T -91.47951             2
## 3      Vilar, S -92.08565             2
## 4   Hripcsak, G -96.18941             2
## 5       Shen, Y -96.79043             2
## 6     Stone, GW -98.19735             2
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Affiliation
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    observational, health, data, sciences, and, informatics, columbia, university, new, york, ny, usa, maryreginabolandgmailcom, department, of, biomedical, electronic, address, nicktatonetticolumbiaedu, dept, systems, biology, medical, center, west, th, street, ph, departments, medicine, united, states, america, ohdsi, st, vc, 
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             department, of, physiology, and, cellular, biophysics, columbia, university, new, york, biomedical, informatics, ny, usa, medicine, united, states, america, systems, biology, departments, , medical, center
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ciqupdepartment, of, chemistry, and, biochemistry, faculty, sciences, university, porto, portugal, department, biomedical, informatics, columbia, medical, center, new, york, ny, usa, united, states, america, systems, biology, observational, health, data, ohdsi, departamento, de, qumica, orgnica, facultad, farmacia, universidad, santiago, compostela, spain, departments, medicine, west, th, st, vc, electronic, address, savdbmicolumbiaedu, nicktatonetticolumbiaedu, , organic, pharmacy, qosantiyahooes
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           department, of, biomedical, informatics, columbia, university, new, york, ny, usa, observational, health, data, sciences, and, united, states, america, w, th, street, electronic, address, hripcsakcolumbiaedu, medical, center, services, newyorkpresbyterian, hospital, epidemiology, population, albert, einstein, college, medicine, yeshiva, bronx, ph, ohdsi, west, jersey, , bioinformatics, usahripcsakcolumbiaedu, vc
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          department, of, systems, biology, columbia, university, medical, center, new, york, ny, usa, and, biomedical, informatics, earth, environmental, engineering, west, th, street, mudd, building, for, translational, immunology, biochemistry, molecular, biophysics, united, states, america, departments, chemistry, pediatrics, medicine, jp, sulzberger, genome, yscolumbiaedu, pascolumbiaedu, jnbcolumbiaedu, pathology, cell, yorkny, hercumccolumbiaedu, irving, cancer, research, st, nicholas, avenue, xuewen, liu, mark, b, stoopler, yufeng, shen, jinli, chen, mahesh, mansukhani, sanjay, koul, balazs, halmos, alain, c, borczuk, haiying, cheng, montefiore, centeralbert, einstein, college, sun, yatsen, guangzhou, peoples, republic, china, yuxia, jia, penn, state, milton, s, hershey, pa, , system, wkccumccolumbiaedu, megansykescolumbiaedu, computational, bioinformatics, the, newyork
## 6 columbia, university, medical, centernew, yorkpresbyterian, hospital, new, york, cardiovascular, research, foundation, presbyterian, center, the, east, th, street, floor, ny, usa, and, newyorkpresbyterian, city, from, of, north, carolina, chapel, hill, mac, brigham, womens, heart, vascular, dlb, beth, israel, deaconess, cmg, harvard, school, boston, ma, gws, universit, parisdiderot, sorbonne, paris, cit, inserm, u, dhu, fire, hpital, bichat, assistance, publiquehpitaux, de, france, gs, institute, medicine, science, national, lung, imperial, college, royal, brompton, london, united, kingdom, kerckhoff, clinic, thoraxcenter, bad, nauheim, germany, cwh, scripps, la, jolla, ca, mjp, medicines, company, parsippany, nj, jp, se, end, stanford, kwm, rah, auckland, zealand, hdw, clinical, trials, for, interventional, therapy, division, cardiology, electronic, address, gscolumbiaedu, hospitalcolumbia, centernewyorkpresbyterian, br, pg, tm, xh, am, gw, rm, ajk, jd, gagnon, morristown, du, sacrcoeur, montral, quebec, canada, department, pneumology, helios, amperklinikum, dachau, bw, els, amp, charles, bendheim, shaare, zedek, jerusalem, zena, michael, a, wiener, icahn, at, mount, sinai, toledo, oh, rg, moo, oby, sanger, institutecarolinas, healthcare, system, charlotte, nc, mjr, wellmont, cva, kingsport, tn, dcm, lebauerbrodie, educationcone, health, greensboro, tds, brb, jy, ub, prince, wales, nsw, australia, syo, kx, freiburg, krozingen, fjn, minneapolis, abbott, northwestern, mn, tdh, cedarssinai, los, angeles, lehigh, valley, network, allentown, pa, dac, reid, firsthealth, carolinas, pinehurst, pld, ohio, state, wexner, columbus, elm, lebauer, foundationcone, states, nns, fact, french, alliance, unit, assistancepubliquehpitaux, nhli, pgs, thorax, translational, green, lane, service, od, ik, j, puskas, cleveland, jfs, international, centre, circulatory, pws, hygiene, tropical, sjp, oxford, hospitals, dpt, banning, leicester, nhs, trust, mh, ag, all, in, santa, clara, cas, es, pp, hospitalier, luniversit, hteldieu, sm, nn, montreal, piedmont, atlanta, dek, nl, wmb, ramsay, gnrale, sant, hopital, priv, jacques, cartier, massy, mcm, semmelweis, budapest, bm, fh, szeged, iu, gb, both, hungary, medisch, centrum, leeuwarden, pwb, ajb, erasmus, rotterdam, apk, netherlands, barcelona, ms, pomar, silesia, katowice, american, poland, ustron, pb, bochenek, physicians, surgeons, broadway, uoc, fondazione, irccs, policlinico, san, matteo, pavia, italy, sl, palo, alto, herbert, sandi, feinberg, valve, dd, mbl, dk, map, jwm, washington, st, louis, mo, jml, program, advanced, coronary, disease, duke, durham, emo, henry, ford, detroit, mi, wwo, illinois, chicago, as, miami, miller, fl, mgc, massachusetts, general, ifp, nb, il, nu, tufts, nkk, seattle, wl, gdd, nk, gsm, rp, christ, lindner, cincinnati, djk, mc, yo, uk, aa, hh, jn, british, vancouver, jl, emory, ga, ljs, walter, reed, military, bethesda, md, tcv, california, irvine, mjk, lund, sweden, diego, pravia, ea, ps, leesburg, regional, hmgg, newyork, presbyteriancolumbia, mv, mms, rigshospitalet, copenhagen, denmark, pc, dpartement, hospitalouniversitaire, fibrosis, inflammation, remodelling, nykoebing, falster, isala, klinieken, zwolle, avh, db, gstonecrforg, avenue, e, mlo, pariscit, lvts, hupnvs, aphp, tl, gg, os, ss, jyc, bern, switzerland, sw, erasmusmc, surgery, institut, cardiovasculaire, sud, aalst, onzelievevrouwziekenhuis, ziekenhuis, belgium, ww, deparment, lorrain, coeur, et, des, vaisseaux, ilcv, nancybrabois, vandoeuvrelsnancy, ec, hospitalo, universitaire, fibrose, remodelage, diderot, maasstad, pcs, thoraxcentrum, twente, enschede, cvb, gentofte, hellerup, sg, basel, rvj, kyoto, graduate, japan, tk, gwm, hoag, memorial, newport, beach, di, lm, society, angiography, interventions, dc, ro, seoul, main, korea, hsk, ferrara, departm, universitynewyorkpresbyterian, td, cl, sjb, angiology, ii, universittsherzzentrum, freibrug, methodist, brooklyn, kpa, kp, dbm, kja, ldr, sciences, uppsala, sj, warsaw, aw, holy, name, hackensack, ajm, gilead, inc, foster, ao, rff, nanjing, first, china, slc, lz, stone, cardiac, hz, djx, jz, psychology, arts, mxc, sheffield, amkr, chmengxcom, yorkpresbyteriancolumbia, nyu, langone, nrs, internal, teaching, dg, jw, sb, mvm, rabin, petach, tikva, rk, qubec, ct, radiology, bc, women, rb, gc, hp, newyorkpresbyteriancolumbia, baltimore, pag, cardiothoracic, tel, aviv, sourasky, ybg, r, mohr, ff, ak, mehran, drr, jjm, ks, istituto, cardiologia, bologna, tp, escola, paulista, medicina, universidade, federal, so, paulo, israelita, albert, einstein, paolo, brazil, ac, ek, vincent, indiana, indianapolis, jbh, mwk, rambam, care, campus, technionisrael, technology, haifa, en, ospedale, papa, giovanni, xxiii, bergamo, petachtikva, washingtonharborview, prehospital, emergency, wa, gn, jae, huntsville, al, ws, southern, county, ds, vanderbilt, nashville, jm, laval, saintefoy, saint, velomedix, menlo, park, gwt, ri, jagiellonian, krakow, ad, tr, dante, pazzanese, sao, alexandre, abizaid, rac, andrea, inspiremd, eb, tiqva, isar, munich, sjk, ts, asan, sgs, gd, serrao, mit, amper, kliniken, lodz, jzp, moses, cone, charitecvk, berlin, mm, terrence, donnelly, michaels, toronto, ontario, ab, cbg, hadassahhebrew, dp, jdn, america, education

The affiliation tokens are kind of messy…but they still work for what we want to do: basically query which columbia department authors are in.

Also, I don’t really know what to do with the Author_Median feature…I thought it would help distinguish young verse senior authors. It’ not very informative right now.

But now maybe I can filter out for specific departments:

tmp3 %>% 
  select(Author, score, Affiliation, contains("department of biomedical informatics") ) %>% 
  arrange(desc(score)) %>% 
  head()
##          Author     score
## 1 Tatonetti, NP -90.07075
## 2 Lorberbaum, T -91.47951
## 3      Vilar, S -92.08565
## 4   Hripcsak, G -96.18941
## 5       Shen, Y -96.79043
## 6     Stone, GW -98.19735
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Affiliation
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    observational, health, data, sciences, and, informatics, columbia, university, new, york, ny, usa, maryreginabolandgmailcom, department, of, biomedical, electronic, address, nicktatonetticolumbiaedu, dept, systems, biology, medical, center, west, th, street, ph, departments, medicine, united, states, america, ohdsi, st, vc, 
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             department, of, physiology, and, cellular, biophysics, columbia, university, new, york, biomedical, informatics, ny, usa, medicine, united, states, america, systems, biology, departments, , medical, center
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ciqupdepartment, of, chemistry, and, biochemistry, faculty, sciences, university, porto, portugal, department, biomedical, informatics, columbia, medical, center, new, york, ny, usa, united, states, america, systems, biology, observational, health, data, ohdsi, departamento, de, qumica, orgnica, facultad, farmacia, universidad, santiago, compostela, spain, departments, medicine, west, th, st, vc, electronic, address, savdbmicolumbiaedu, nicktatonetticolumbiaedu, , organic, pharmacy, qosantiyahooes
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           department, of, biomedical, informatics, columbia, university, new, york, ny, usa, observational, health, data, sciences, and, united, states, america, w, th, street, electronic, address, hripcsakcolumbiaedu, medical, center, services, newyorkpresbyterian, hospital, epidemiology, population, albert, einstein, college, medicine, yeshiva, bronx, ph, ohdsi, west, jersey, , bioinformatics, usahripcsakcolumbiaedu, vc
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          department, of, systems, biology, columbia, university, medical, center, new, york, ny, usa, and, biomedical, informatics, earth, environmental, engineering, west, th, street, mudd, building, for, translational, immunology, biochemistry, molecular, biophysics, united, states, america, departments, chemistry, pediatrics, medicine, jp, sulzberger, genome, yscolumbiaedu, pascolumbiaedu, jnbcolumbiaedu, pathology, cell, yorkny, hercumccolumbiaedu, irving, cancer, research, st, nicholas, avenue, xuewen, liu, mark, b, stoopler, yufeng, shen, jinli, chen, mahesh, mansukhani, sanjay, koul, balazs, halmos, alain, c, borczuk, haiying, cheng, montefiore, centeralbert, einstein, college, sun, yatsen, guangzhou, peoples, republic, china, yuxia, jia, penn, state, milton, s, hershey, pa, , system, wkccumccolumbiaedu, megansykescolumbiaedu, computational, bioinformatics, the, newyork
## 6 columbia, university, medical, centernew, yorkpresbyterian, hospital, new, york, cardiovascular, research, foundation, presbyterian, center, the, east, th, street, floor, ny, usa, and, newyorkpresbyterian, city, from, of, north, carolina, chapel, hill, mac, brigham, womens, heart, vascular, dlb, beth, israel, deaconess, cmg, harvard, school, boston, ma, gws, universit, parisdiderot, sorbonne, paris, cit, inserm, u, dhu, fire, hpital, bichat, assistance, publiquehpitaux, de, france, gs, institute, medicine, science, national, lung, imperial, college, royal, brompton, london, united, kingdom, kerckhoff, clinic, thoraxcenter, bad, nauheim, germany, cwh, scripps, la, jolla, ca, mjp, medicines, company, parsippany, nj, jp, se, end, stanford, kwm, rah, auckland, zealand, hdw, clinical, trials, for, interventional, therapy, division, cardiology, electronic, address, gscolumbiaedu, hospitalcolumbia, centernewyorkpresbyterian, br, pg, tm, xh, am, gw, rm, ajk, jd, gagnon, morristown, du, sacrcoeur, montral, quebec, canada, department, pneumology, helios, amperklinikum, dachau, bw, els, amp, charles, bendheim, shaare, zedek, jerusalem, zena, michael, a, wiener, icahn, at, mount, sinai, toledo, oh, rg, moo, oby, sanger, institutecarolinas, healthcare, system, charlotte, nc, mjr, wellmont, cva, kingsport, tn, dcm, lebauerbrodie, educationcone, health, greensboro, tds, brb, jy, ub, prince, wales, nsw, australia, syo, kx, freiburg, krozingen, fjn, minneapolis, abbott, northwestern, mn, tdh, cedarssinai, los, angeles, lehigh, valley, network, allentown, pa, dac, reid, firsthealth, carolinas, pinehurst, pld, ohio, state, wexner, columbus, elm, lebauer, foundationcone, states, nns, fact, french, alliance, unit, assistancepubliquehpitaux, nhli, pgs, thorax, translational, green, lane, service, od, ik, j, puskas, cleveland, jfs, international, centre, circulatory, pws, hygiene, tropical, sjp, oxford, hospitals, dpt, banning, leicester, nhs, trust, mh, ag, all, in, santa, clara, cas, es, pp, hospitalier, luniversit, hteldieu, sm, nn, montreal, piedmont, atlanta, dek, nl, wmb, ramsay, gnrale, sant, hopital, priv, jacques, cartier, massy, mcm, semmelweis, budapest, bm, fh, szeged, iu, gb, both, hungary, medisch, centrum, leeuwarden, pwb, ajb, erasmus, rotterdam, apk, netherlands, barcelona, ms, pomar, silesia, katowice, american, poland, ustron, pb, bochenek, physicians, surgeons, broadway, uoc, fondazione, irccs, policlinico, san, matteo, pavia, italy, sl, palo, alto, herbert, sandi, feinberg, valve, dd, mbl, dk, map, jwm, washington, st, louis, mo, jml, program, advanced, coronary, disease, duke, durham, emo, henry, ford, detroit, mi, wwo, illinois, chicago, as, miami, miller, fl, mgc, massachusetts, general, ifp, nb, il, nu, tufts, nkk, seattle, wl, gdd, nk, gsm, rp, christ, lindner, cincinnati, djk, mc, yo, uk, aa, hh, jn, british, vancouver, jl, emory, ga, ljs, walter, reed, military, bethesda, md, tcv, california, irvine, mjk, lund, sweden, diego, pravia, ea, ps, leesburg, regional, hmgg, newyork, presbyteriancolumbia, mv, mms, rigshospitalet, copenhagen, denmark, pc, dpartement, hospitalouniversitaire, fibrosis, inflammation, remodelling, nykoebing, falster, isala, klinieken, zwolle, avh, db, gstonecrforg, avenue, e, mlo, pariscit, lvts, hupnvs, aphp, tl, gg, os, ss, jyc, bern, switzerland, sw, erasmusmc, surgery, institut, cardiovasculaire, sud, aalst, onzelievevrouwziekenhuis, ziekenhuis, belgium, ww, deparment, lorrain, coeur, et, des, vaisseaux, ilcv, nancybrabois, vandoeuvrelsnancy, ec, hospitalo, universitaire, fibrose, remodelage, diderot, maasstad, pcs, thoraxcentrum, twente, enschede, cvb, gentofte, hellerup, sg, basel, rvj, kyoto, graduate, japan, tk, gwm, hoag, memorial, newport, beach, di, lm, society, angiography, interventions, dc, ro, seoul, main, korea, hsk, ferrara, departm, universitynewyorkpresbyterian, td, cl, sjb, angiology, ii, universittsherzzentrum, freibrug, methodist, brooklyn, kpa, kp, dbm, kja, ldr, sciences, uppsala, sj, warsaw, aw, holy, name, hackensack, ajm, gilead, inc, foster, ao, rff, nanjing, first, china, slc, lz, stone, cardiac, hz, djx, jz, psychology, arts, mxc, sheffield, amkr, chmengxcom, yorkpresbyteriancolumbia, nyu, langone, nrs, internal, teaching, dg, jw, sb, mvm, rabin, petach, tikva, rk, qubec, ct, radiology, bc, women, rb, gc, hp, newyorkpresbyteriancolumbia, baltimore, pag, cardiothoracic, tel, aviv, sourasky, ybg, r, mohr, ff, ak, mehran, drr, jjm, ks, istituto, cardiologia, bologna, tp, escola, paulista, medicina, universidade, federal, so, paulo, israelita, albert, einstein, paolo, brazil, ac, ek, vincent, indiana, indianapolis, jbh, mwk, rambam, care, campus, technionisrael, technology, haifa, en, ospedale, papa, giovanni, xxiii, bergamo, petachtikva, washingtonharborview, prehospital, emergency, wa, gn, jae, huntsville, al, ws, southern, county, ds, vanderbilt, nashville, jm, laval, saintefoy, saint, velomedix, menlo, park, gwt, ri, jagiellonian, krakow, ad, tr, dante, pazzanese, sao, alexandre, abizaid, rac, andrea, inspiremd, eb, tiqva, isar, munich, sjk, ts, asan, sgs, gd, serrao, mit, amper, kliniken, lodz, jzp, moses, cone, charitecvk, berlin, mm, terrence, donnelly, michaels, toronto, ontario, ab, cbg, hadassahhebrew, dp, jdn, america, education

It seems getting out Authors from certain departments is tough…there’s so many tokens in affiliation it’s impossible to single out authors by department name since the affiliation contains centers and and institutes etc.

It seems for this tutorial the top n authors, filtering out the “NA, NA” authors, and providing enough terms for specificity will be the most useful in extracting authors with high publishing rates associated to the terms. It’s unfortunate that there’s such a strong bias towards those who published a lot and a bias against new faculty.

Another point of future work will be to refine the pubmed search query to include only certain department instead of all of Columbia. This might be helpful in making the analysis more precise.

In the end, it seems most prudent to use this classification for getting high expertise, senior authors on a topic. But overall, this was a fun weekend project (as well as putting this up as my first blog post)!