Worksheet 4: NLP and Classifying

This week, we studied some examples of research that used machine-learning and natural language processing techniques to classify and understand large amounts of text data.

In this worksheet, we go through the use of a particular API to which we can connect to classifying tweet content for toxicity.

The API was developed on top of a model built by Google the purpose of which was to identify content that may be in need of online moderation.

The API is called the “Perspective” API and details can be found here.

There is guidance on how to get started with this API here and there is also a useful rundown of how the API works on the Github page of the R package we will be using to connect to this API.

We will be classifying the same tweets as discussed in Barrie (2023).

You can do this locally on your computers with:

tweets_sample  <- readRDS(gzcon(url("https://github.com/cjbarrie/CS-ED/blob/main/data/tweets-ranked.rds?raw=true")))

You can download these data from Github with:

We can then use the following code to classify this content:

library(peRspective)
library(dplyr)
library(ggplot2)

models <- c(peRspective::prsp_models)
models_subset <- models[c(1:5, 7, 9:10, 12, 14)]
models_subset

toxtwts <- tweets_sample %>%
  prsp_stream(text = text,
              text_id = tweet_id, 
              score_model = models_subset,
              verbose = T,
              safe_output = T)

colnames(toxtwts) <- c("tweet_id", "error", models_subset)

tweets_sample_tox_r <- tweets_sample %>%
  left_join(toxtwts, by = "tweet_id")

And then we can see some examples of tweets along with TOXICITY scores:

tweet_id user_username text source possibly_sensitive author_id lang conversation_id created_at user_name user_profile_image_url user_location user_verified user_description user_url user_created_at user_protected user_pinned_tweet_id retweet_count like_count quote_count user_tweet_count user_list_count user_followers_count user_following_count sourcetweet_type sourcetweet_id sourcetweet_text sourcetweet_lang sourcetweet_author_id in_reply_to_user_id date error TOXICITY SEVERE_TOXICITY IDENTITY_ATTACK INSULT PROFANITY THREAT ATTACK_ON_AUTHOR ATTACK_ON_COMMENTER INFLAMMATORY OBSCENE
1558572382947876865 crabcrawler1 RT (Fire_fux?): (crabcrawler1?) https://t.co/HcKNbVEQIP Twitter Web App FALSE 1324742403492892673 qme 1558572382947876865 2022-08-13T21:53:07.000Z Crab Man https://pbs.twimg.com/profile_images/1587559247637843970/CNfOGD_7_normal.jpg Ohio, USA FALSE DONATE PLEASE https://t.co/BF7MOhA7d0 https://t.co/qzu1mWI2EM
NEWs and politics. By night time I normally watch TV and movies. Also, go to church on Sunday. NA 2020-11-06T15:56:36.000Z FALSE 1590851209111670784 1 0 0 83259 152 30253 1806 retweeted 1558570726965350401 (crabcrawler1?) https://t.co/HcKNbVEQIP qme 1531709708783984642 NA 2022-08-13 No Error 0.1004571 0.0169647 0.0029228 0.0218039 0.2472755 0.0064921 0.0207121 0.1875430 0.0881871 0.3237084
1579236823527608321 HankVenture5 Blocked and reported https://t.co/18aKrUQy92 Twitter for iPhone FALSE 1285388676751437824 en 1579236823527608321 2022-10-09T22:26:14.000Z Hank Venture https://pbs.twimg.com/profile_images/1389707714826182656/EFOyaLz2_normal.jpg NA FALSE Slightly right-of-center so evil probably. Gen-X. Stamp Collector. Team Venture Stan account. Chief Executive Officer of #HankCo NA 2020-07-21T01:39:30.000Z FALSE 1491907548827381761 0 12 0 31043 10 3887 930 quoted 1579236650244517889 No matter which city you’re in, all pizza is pretty much the same en 11203972 NA 2022-10-09 No Error 0.0673801 0.0027275 0.0057347 0.0175498 0.0272739 0.0126478 0.0914053 0.1940324 0.1848972 0.1367777
1591122529694818304 krus_chiki (eDonut_?) Look i’ve said it before and i’ll say it again, Kruschiki supply co almost became a retro/vintage candy store Twitter for iPhone FALSE 922267018526707712 en 1591120235104727040 2022-11-11T17:35:47.000Z krus🪖 https://pbs.twimg.com/profile_images/1227595520773885952/8MWK0ssD_normal.jpg NA FALSE cossack. rare rug dealer. sells military surplus. super villains, inc. https://t.co/52TOsDw8YZ 2017-10-23T01:02:46.000Z FALSE 1378899646630801416 0 8 0 41302 94 39752 1646 NA NA NA NA NA 1563662952892227584 2022-11-11 No Error 0.0570059 0.0016975 0.0096194 0.0233509 0.0215351 0.0073271 0.0090324 0.0460657 0.0358919 0.0665830
1578034288842264582 entelechiada gm https://t.co/5maKDeYAHV Twitter for iPhone FALSE 1415419642693111809 und 1578034288842264582 2022-10-06T14:47:47.000Z Entelechiada https://pbs.twimg.com/profile_images/1501024250114678787/NFVa-9Xe_normal.jpg United States FALSE entelechy + enchiladas: here for the good stuff https://t.co/iG5j2CQlzw 2021-07-14T21:15:41.000Z FALSE 1565024760718798848 0 8 0 3504 7 594 747 NA NA NA NA NA NA 2022-10-06 No Error 0.0138227 0.0017548 0.0015724 0.0084249 0.0213302 0.0065569 0.0279729 0.0983622 0.1409975 0.0694246
1573056421914382336 EITC_Official RT (EITC_Official?): This 8th grade science teacher at (RVKPanthers?) explains that one of her boy students wearing fingernail polish was “one… Twitter for iPhone FALSE 1520028377457078275 en 1573056421914382336 2022-09-22T21:07:31.000Z 👁 Inside The Classroom https://pbs.twimg.com/profile_images/1522240813442342912/6oXnbTIV_normal.jpg San Antonio, TX FALSE Providing receipts that refute, “it’s not happening.” Videos belong to their respective owners. https://t.co/qI9VxrbMp2 2022-04-29T13:13:46.000Z FALSE 1592899675530866689 53 0 0 7250 73 18285 947 retweeted 1571349132090179589 This 8th grade science teacher at (RVKPanthers?) explains that one of her boy students wearing fingernail polish was “one of the best experiences [she has] had so far as an educator.” https://t.co/8DvkicyG5O en 1520028377457078275 NA 2022-09-22 No Error 0.0426573 0.0023270 0.0108034 0.0187100 0.0209886 0.0089453 0.4159997 0.5920156 0.4637360 0.5854430
1583998206547226627 MsAvaArmstrong RT (RSBNetwork?): President Donald J. Trump: “They’re coming after me because I am fighting for you, and that is true.” Join us for the full… Twitter for iPhone FALSE 2449913803 en 1583998206547226627 2022-10-23T01:46:16.000Z AvaArmstrong, 🇺🇸Author https://pbs.twimg.com/profile_images/1377022534235918341/r_zFt102_normal.jpg Last Outpost FALSE Thriller-Romance author on AMAZON. MY OPINION STATED HERE. Expert at triggering Leftists. Trump won. CONTENT OF CHARACTER. Taken by (EagleEyeFlyer?) ❤️ NO DM🚫 NA 2014-04-17T15:02:47.000Z FALSE 1591443575249768448 518 0 0 785719 374 157472 86624 retweeted 1583992381535186944 President Donald J. Trump: “They’re coming after me because I am fighting for you, and that is true.” Join us for the full speech on Rumble: https://t.co/XJnh6tcghu https://t.co/xx9A404nF0 en 4041824789 NA 2022-10-23 No Error 0.0451312 0.0022221 0.0096934 0.0140109 0.0182558 0.0132174 0.0162480 0.6661568 0.2061624 0.0820840

Questions

  1. Inspect the data to see if you agree with the labels (CW: hateful content)

  2. Plot the distribution of the scores

  3. Estimate what % of total tweets are “toxic”