{"id":1973,"date":"2020-12-08T07:00:00","date_gmt":"2020-12-08T07:00:00","guid":{"rendered":"https:\/\/underunderstood.com\/podcast\/?post_type=episode&#038;p=1973"},"modified":"2020-12-09T20:24:15","modified_gmt":"2020-12-09T20:24:15","slug":"reuters-data-set-blah-blah-blah","status":"publish","type":"episode","link":"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/","title":{"rendered":"The Case of the Blah Blah Blahs"},"content":{"rendered":"<div id=\"modal-ready\">\n<h3 class=\"episode-descrip wp-block-heading\">A famous Reuters dataset from the 1980s includes \u201cBlah blah blah.\u201d in place of some stories. Why?<\/h3>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h4 class=\"wp-block-heading\">Show Notes<\/h4>\n\n\n\n<ul>\n\n<li><a class=\"jump-point button underline\" href=\"#00:31\">00:31<\/a> &#8211; <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Reuters-21578+Text+Categorization+Collection\">The link Jess sent<\/a><\/li>\n\n<li><a class=\"jump-point button underline\" href=\"#08:31\">8:31<\/a> &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Standard_Generalized_Markup_Language\">SGML<\/a><\/li>\n\n<li><a class=\"jump-point button underline\" href=\"#08:46\">8:46<\/a> &#8211; <a href=\"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2020\/12\/blah_results.pdf\">This<a\/> is what the blahs look like and <a href=\"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2020\/12\/reut2-000.pdf\">this<\/a> is what all the entries look like.<\/li>\n\n<li><a class=\"jump-point button underline\" href=\"#24:00\">24:00<\/a> &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/File_Transfer_Protocol\">FTP<\/a><\/li>\n\n<li><a class=\"jump-point button underline\" href=\"#24:34\">24:34<\/a> &#8211; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Linguistic_Data_Consortium\">Linguistic Data Consortium<\/a><\/li>\n<li><a class=\"jump-point button underline\" href=\"#29:00\">29:00<\/a> &#8211; <a href=\"https:\/\/trec.nist.gov\/data\/reuters\/reuters.html\">RCV1 at NIST<\/a> and David D. Lewis\u2019s <a href=\"http:\/\/www.daviddlewis.com\/resources\/testcollections\/reuters21578\/readme.txt\">README<\/a><\/li>\n\n\n<li><a class=\"jump-point button underline\" href=\"#30:22\">30:22<\/a> &#8211; <a href=\"https:\/\/www.aaai.org\/Papers\/IAAI\/1990\/IAAI90-006.pdf\">Construe-TIS: A System for Content-Based Indexing of a Database of News Stories<\/a> (Phil Hayes and Steven Weinstein)\n<\/li>\n\n<\/ul>\n\n\n<span style='display:inline;'><input type='hidden' bg_collapse_expand='6a19e7f2cb0fb6035175408' value='6a19e7f2cb0fb6035175408'><input type='hidden' id='bg-show-more-text-6a19e7f2cb0fb6035175408' value='Transcript'><input type='hidden' id='bg-show-less-text-6a19e7f2cb0fb6035175408' value='Transcript'><a id='bg-showmore-action-6a19e7f2cb0fb6035175408' class='bg-showmore-plg-link bg-arrow '  style=\" color:inherit;\" href='#'>Transcript<\/a><span id='bg-showmore-hidden-6a19e7f2cb0fb6035175408' ><b><\/b><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Hey everyone.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Hey Adrianne.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">Hey.\u00a0<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Hey.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">We got an email from a listener and I called dibs on it, but I think everyone read it anyway.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Sorry!<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">\u00a0Does someone want to read this email?<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Here it is, a website contact form message from Jess.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">That&#8217;s the one.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Why does Thomson Reuters newswire say \u201cblah-blah-blah\u201d? Reuters-21578, with a link, is a dataset containing Reuters and newswire items, short businessy headlines and descriptions from 1987. It&#8217;s very popular for machine learning research because it&#8217;s extensive and well labeled. For some reason, some articles in the dataset have article bodies containing only the words, blah blah blah. How did this happen? Was it in the Reuters database? Or did the academics they worked with introduced it? Why blah blah blah instead of just leaving the article body blank?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I am very into this because I love machine learning training datasets. It&#8217;s like this essential distillation of humans and computers trying to communicate with each other and I just think it&#8217;s really lovely. So I wrote about this back at the Outline and I&#8217;m just going to read from this story, which is from 2017 so, I&#8217;m going to quote myself here.\u00a0<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Nice.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">\u201cAs machine learning research accelerates, scientists have started pooling their resources. ImageNet is a popular data set produced by researchers at Stanford and Princeton that contains 14 million images grouped by nouns in synonym sets such as \u201ckid, child,\u201d \u201cwoman, adult female,\u201d \u201coffice, business office.\u201d<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-weight: 400\">So ImageNet is one of these many publicly available data sets made by corporations and researchers and released for free online for others to use in training algorithms.\u00a0<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Training for what?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">To train machines learning algorithms. So this is like what would end up in an app like a face tuning app or language translation. Anything that involves using a lot of data to try to emulate some kind of more humanlike function with an algorithm.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">It&#8217;s like someone who says kid might actually mean child, or it might also mean child?\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Exactly.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Okay.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">And also associate that with the image. So you&#8217;re just trying to teach a computer like basic s**t that people learn by the time they&#8217;re five.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">Dumb computer.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Yeah, what idiots. This can be done for basically any type of data, as long as you can get a whole lot of them and label it somehow consistently.<\/span><\/p>\n<p><span style=\"font-weight: 400\">So these datasets are all called corpuses and there are tons of them. The dataset that Jess emailed about is a relatively small one by today&#8217;s standards. It is a text corpus and it&#8217;s called Reuters-21578.\u00a0<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Catchy name.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Yeah, I know. It&#8217;s 21,578 Reuters articles.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">So it&#8217;s also a very creative title.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Yes, and these articles are labeled with topics. So those topics might be financial or economic, like mergers and acquisitions or interest rates, or they might be labeled with a proper noun, like a person or a country or region. And this data set is available for free online.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">How do you access this?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">The place that Jess was looking at was UCI, University of California Irvine, has a machine learning repository that has a bunch of datasets.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Okay.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">She said that she was downloading this dataset not for her job but for a personal project.\u00a0<\/span><\/p>\n<p><b>Jess: <\/b><span style=\"font-weight: 400\">I&#8217;m actually a designer. I work with data, but I&#8217;m a designer and I wanted a cool news dataset that I could use. I liked the retro quality. I started going into it in a little bit more detail and found all these amazing instances where the entire article body of certain things was just the phrase, \u201cblah blah blah\u201d and knowing that Reuters is very\u2026<\/span><\/p>\n<p><b>Adrianne:<\/b><span style=\"font-weight: 400\"> Straight-laced?<\/span><\/p>\n<p><b>Jess: <\/b><span style=\"font-weight: 400\">Straight-laced with lots of journalistic integrity. I couldn&#8217;t see that as being intentional and in any sense in the journalist news sense.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Jess is actually pretty qualified to say what a Reuters employee might do or not do because she happens to be a Reuters employee. However, she&#8217;s in a different department and she was very clear that she has unfortunately no ability to help us get this answer institutionally.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">What do you mean?\u00a0<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">Interesting.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Put up a bulletin in the cafeteria.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Jess looked through the dataset and found 1,605 articles with blah blah blah in the body.<\/span><\/p>\n<p><b>Jess: <\/b><span style=\"font-weight: 400\">It always seems to be the full \u201cBlah\u00a0 blah blah\u201d. It&#8217;s \u201cBlah blah blah.\u201d with the first B being capitalized and a period at the very end, so it&#8217;s punctuated, it\u2019s not just&#8230;for what it&#8217;s worth, it is like a proper \u201cblah blah blah\u201d it&#8217;s a statement in and of itself.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Very intentional looking.<\/span><\/p>\n<p><b>Jess: <\/b><span style=\"font-weight: 400\">Yes. Yes.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">This is a pretty solid little dataset. It is in 7,600 papers on Google Scholar.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">And did the Google Scholar papers mention the Blah blah blahs?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Yeah, a couple of them do. If you search Reuters-21578 and \u201cblah\u201d in Google Scholar, you get 35 results.<\/span><\/p>\n<p><span style=\"font-weight: 400\">One paper is talking about the limits of the data set and says \u201cThis collection is also disputed in reason of the famous blah blah blah. Another says, \u201cOf course we have omitted the body text having only blah blah blah like sentences\u201d another paper refers to \u201cDubious documents containing just the words blah blah blah in the body\u201d. And then one paper speculates that blah blah blah was inserted deliberately by the datasets creators as \u201cnoise\u201d to \u201ctest the tolerance of classification algorithms\u201d.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">There&#8217;s no evidence for this theory or for any other theory at this point, but this thing, this quirk of this dataset is definitely out there. Anybody who looks closely at the data is aware of this phenomenon.<\/span><\/p>\n<p><b>Jess: <\/b><span style=\"font-weight: 400\">It\u2019s a mystery that\u2019s been bubbling in minds for 30 something years now.\u00a0<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">I would just like to note that there was an Iggy pop album named Blah-Blah-Blah that came out in October, 1986 and was released on cassette in 1987.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">But what was the capitalization and punctuation?<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Uh, it&#8217;s actually different. They&#8217;re all capitalized and there&#8217;s dashes between them.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Not the same.\u00a0<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Yeah. It kills the theory.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">Yeah, no.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">These articles were labeled in the data as type equals brief, suggesting that they were news alerts or headlines that were sent out for a developing story where the body was filled in later, or maybe it&#8217;s just the headline and that&#8217;s the whole thing.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">If it&#8217;s the whole thing or just the headline, why would they type in Blah blah blah?\u00a0<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Right. Wouldn&#8217;t it be the story developing\u2026<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">Right, exactly.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I don&#8217;t know. Okay. I&#8217;m going to send you all a sample of the data. So this is what the type equals brief articles look like. The longer articles we&#8217;ll also have a dateline with the date and location and an actual body. I just put this into Slack.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Do any of these tags even close? It&#8217;s weird. It looks a little bit like HTML, but it&#8217;s not HTML.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">This is SGML.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Oh, what&#8217;s that?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Standard Generalized Markup Language.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Oh.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">It&#8217;s a document markup language originally designed to enable the sharing of machine-readable large project documents.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">I guess if you&#8217;re listening to this on a podcast player where you can see the show notes, look for a link to this cause it is illustrative and I don&#8217;t really know how to convey this.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Can you explain what you&#8217;re looking at?<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">So we see this title tag. Inside the title tag, you see the title of an article, and then that title tag gets closed like an HTML tag would be closed. So it&#8217;s structured like HTML, where there are tags and inside those tags is content, but the Blah blah blah, sits outside of those tags. It&#8217;s not enclosed by anything. Yeah I see what you mean about this not being formatted like the rest of it is.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">It\u2019s in a weird spot.\u00a0<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">I don\u2019t know. I think it has to have served a purpose. Like, I really just want to know what the purpose was.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Yeah, well, a bunch of people worked on this dataset. A lot of them are on LinkedIn, so I&#8217;m going to see how many of them I can track down.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">I think you can do it.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Thank you, John, for your vote of confidence.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Coming up,<\/span> <span style=\"font-weight: 400\">Adrianne does some research, asks some questions, blah, blah, blah.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I&#8217;m back.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Hey!<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">Welcome back!<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I&#8217;m back from field reporting on LinkedIn.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Wow. Did you have a hard time getting people to respond to you when you reached out with the subject line, Blah blah blah?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I did, actually, funny.\u00a0<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">I would imagine.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">It was \u201cpodcast query, Blah blah blah.\u201d<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Oh, my god.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I did get a surprising number of people responding.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">People who worked on this in the 80s?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">People who worked on this in the 80s, yeah. Well, let&#8217;s get into the dataset sausage making.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Okay.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">My favorite kind of sausage making.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">The 80s, you can put some 80s music in here, maybe.<\/span><\/p>\n<p><span style=\"font-weight: 400\">[John makes weird laser sounds]<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">A what?<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">I&#8217;m sorry, what?<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Those are the sounds of synthesizers.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Like a little kid using a phaser.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Reuters is sending out a huge volume of news. Subscribers are wanting to get updated on specific topics or specific regions or companies, so Reuters editors would manually add topics to each story as it came across their desk. And Reuters decided to automate this.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">In the 80s? Wow!<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">And a few years later, this dataset pops up on the internet. It turns out it took a lot of people to make Reuters-21578 happen and in the case of the Blah blah blahs, I have basically three different groups of suspects. So first up is Reuters, someone there could have put Blah blah blah into their actual feed of stories, maybe in some kind of invisible way on the backend.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Then there was another group that came in later to clean up the data and publish it for academic use and it could have been them.\u00a0<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-weight: 400\">But before it got to the internet, Reuters-21578 was in the hands of the Carnegie Group, an AI startup that Reuters contracted with to build this system of news article classification.\u00a0<\/span><\/p>\n<p><b>Monica: <\/b><span style=\"font-weight: 400\">At the time that this data set was collected, I was a programmer, I was fresh out of school, just a few years. This was actually my first company so I didn&#8217;t really have a sense of the ways of the world or anything like that.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">That&#8217;s Monica Cellio. She was a programmer at Carnegie Group and she worked on the system that relied on this dataset and the system was called Construe.<\/span><\/p>\n<p><b>Monica: <\/b><span style=\"font-weight: 400\">I actually saw these rooms where they had rooms full of people whose job was to receive a story from a wire and in just a few seconds, scan it, attach tags to it and send it back out. So this was what we were trying to automate, at least the 90% that could be automated.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">We were working on that, which meant we needed a pile of data to work with. And we needed to consult, we needed access to their experts, how do you make decisions about how you categorize this stuff? We actually had one of their categorizers working onsite with us as we were developing the rules that the software would use and figuring out the edge cases.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Unfortunately, Monica did not remember anything about the Blah blah blah\u2019s.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">The person who alerted me to this sent me some XML showing what these Blah blah blah records look like, and they look like a mistake.<\/span><\/p>\n<p><b>Monica: <\/b><span style=\"font-weight: 400\">They do, they are in the wrong place. So we&#8217;ve got a text block that contains a title block and the Blah blah blah shows up after the title, I guess. Let me find one that doesn&#8217;t have the blah blah blah. So, yeah, the ones that don&#8217;t have Blah blah blah after the title, you have a date line block, and then a body block and the ones that have Blah blah blah are missing the dateline in the body and they just say Blah blah blah instead, which is weird.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">So Monica didn&#8217;t remember anything, but at least she was excited about this mystery.<\/span><\/p>\n<p><b>Monica: <\/b><span style=\"font-weight: 400\">If you publish something, please, please let me know, send me a link and good luck! And if you get the answer to the blah blah blah I want to know now.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">She suggested I talk to one of the linguists who worked on the Construe project, so I called Peggy Andersen.<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">I had to actually go look up what Reuters 2578 whatever that was cause I wasn&#8217;t really aware of what happened to it after I finished working on it.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">It&#8217;d be weird if she remembered that specific one, you know,<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">I mean, it was infamous.<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">The whole mission of the company was to apply artificial intelligence such as it was back then in the 80s to commercial problems. So, Reuters contracted with us to automate tagging their news stories so that their users could find the story that was of interest to them. The problem was that reporters don&#8217;t always use consistent language, so we had to discover the language that they used and their natural reporting.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Today, you would do this with statistics. Today, you would have your program look at all of this data and then find patterns in it on its own, right? But back then they didn&#8217;t have the computing power to do it that way so they did what was called knowledge based categorization.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">What&#8217;s that mean?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">They were writing rules.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">So they have to come up with these rules on their own and then add them.<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">We actually had humans, me and other linguists on the project studying the words that were used and creating rules that they would say. So grain, you know, grains are also traded and we took different grains, you could say grain but if you allowed every single story that had the word grain in it, you&#8217;d get some things that were not about grains that are traded, that could be whole grain alcohol, the fine grain of wood or something like that.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Did you realize that this data would be publicly released?<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">No. No. Our goal wasn&#8217;t to create data, it was to put tags in real time on news stories so that Reuters readers could find what they were looking for.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Some of the records have just a title and then no dateline, no body and it just says \u201cBlah blah blah.\u201d like capitalized \u201cBlah\u201d lower case \u201cblah\u201d lower case \u201cblah\u201c, period.\u00a0<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">I can&#8217;t explain that.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Do you remember seeing that?<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">No. I worked with software engineers. They&#8217;re a special breed of people. The people at Carnegie Group were really some of the smartest people I&#8217;ve ever known most of them graduated from Carnegie Mellon. But also playful and they did some crazy things. It could have been introduced then it could have been introduced later on by these people who manage the Corpus once it was released for public use. I really don&#8217;t know.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Do you think it&#8217;s possible that it was in the original data from Reuters?<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">I don&#8217;t know. I really don&#8217;t know. That seems unlikely to me.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I explained that<\/span> <span style=\"font-weight: 400\">we had a listener who had requested this information.<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">I mean, why does this person care at this point?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I think they were curious because they thought that they work at Reuters and they were like, Reuters would never put this in any of his own stuff because Reuters is so, you know, grown up.<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">Yes, exactly. It probably was not initiated by Reuters.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Peggy\u2019s only theory was that it might&#8217;ve been an issue where the test program could not accept stories that didn&#8217;t have bodies.<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">If you&#8217;d find out, I&#8217;d love to know the answer.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Okay. See, now you&#8217;re curious too.<\/span><\/p>\n<p><b>Peggy: <\/b><span style=\"font-weight: 400\">Yeah.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">So Monica and Peggy both told me that they did not realize that this dataset was going to be published. Which makes sense because Reuters built it for a competitive advantage, to sell a product to customers. So I started to wonder how this even got into the world.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Dave Lewis is credited as the source of the data in the UCI repository so I figured I should talk to him.\u00a0<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">So, I was a graduate student in computer science, at University of Massachusetts working with Bruce Croft and Bruce called me into his office one day and said \u201cLook at this\u201d. And it was a\u00a0 newsletter from a company called Carnegie Group, which was an AI startup back during the second AI bubble.<\/span><\/p>\n<p><span style=\"font-weight: 400\">They had a graph on the front page of this newsletter which was purportedly comparing an expert system they&#8217;d built with a statistical text retrieval system, which is what Bruce and I worked on. We were pretty upset about this because it was comparing apples and oranges.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">So Dave is saying that Construe which is Carnegie&#8217;s system was being compared with something that his group had done in the past and Carnegie Group was basically bragging about how well Construe had performed versus these other methods. So Dave and his advisor thought this wasn&#8217;t a fair comparison.<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">There was a lot of debate going on between whether one should use knowledge-based systems for information retrieval or statistical machine learning for information retrieval. And Bruce and I were mostly on the statistical and machine learning side that we both dabbled in the other.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Anyway, you know, we thought this was kind of unfair and it was in the moment it was a kind of debate that was going on.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Dave&#8217;s advisor reached out to the Carnegie Group and got in touch with this guy, Phil Hayes and Phil Hayes was extremely chill. He said, \u201cWhy don&#8217;t I give you this dataset? And you can work on it and do experiments using your different methods\u201d.<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">And he sent it to us and so I ended up using that in my dissertation. It was actually the central data set that I used in doing experimentation on machine learning and natural language processing or text categorization.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">How did this dataset become public?<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">Well, yeah, that was sort of accidental. I would say I and many computer scientists were probably a little more careless around IP issues back in those days and intellectual property issues.<\/span><\/p>\n<p><span style=\"font-weight: 400\">I don&#8217;t think there was ever any formal document between Carnegie Group and UMS so I kind of carried the data set along with me. I did a research faculty position at the University of Chicago and then I was at Bell Labs, I was collaborating with a bunch of people and I just had the data up on an open FTP site.<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">Oh my god.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">So Dave had the data up on an open FTP site, which is just a way to easily send large files and it didn&#8217;t even occur to him to put a password on it.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Oh, wow! So anybody on the internet could access this?<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">It was the Wild Wild West back then.<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">People traded around FTP sites in those days, with datasets pretty casually. And I had talked to Carnegie Group and Carnegie Group was looking into releasing it. We talked about whether we&#8217;ll do some sort of public announcement or maybe put it at the linguistic data consortium, which was just starting up back then. But what happened basically, was it just sort of diffused out there and started showing up in other papers.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Dave wanted to be very clear that he would never do this today.<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">I will say that it is sort of funny because over my career, I ended up later in life, working a lot with lawyers and doing expert witness work and building legal software and things and I&#8217;ve become much more fussy about intellectual property issues. I worked for a cyber security company now, too so I should say that I&#8217;m now very, very fussy about these things if anybody&#8217;s listening here,<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">You would put a password on it today.<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">I would put a password, yeah right. Well today there&#8217;d be like a 17 page legal agreement that&#8217;s been signed off by general councils and things.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">It&#8217;s the 80s, you know?<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Well, it also fits with the culture of open source software and people developing this stuff like wanting to be able to share things and have them work across different companies when they move or with other people they&#8217;re collaborating with. So it makes sense, it just seems like they didn&#8217;t have any formalized way to either say like, \u201cOh, one company owns this or yes, this is under an open license, Wild Wild West\u201d.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Dave actually ended up collaborating with Reuters.<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">Reuters began to notice that these computer scientists were all using this weird thing they were calling the writers dataset, which nobody did Reuters seem to know where it came from or how.<\/span><\/p>\n<p><span style=\"font-weight: 400\">So anyway, they decided that, you know, to their credit, they wanted a good, if there was going to be a Reuters data set out there, they wanted, and they also saw themselves as benefiting from people working with their data. They decided they would put a good dataset out there.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Reuters published a new dataset called RCV1. This data set was much larger. It had about 800,000 documents and it&#8217;s held at the National Institute of Standards and Technology. But for this dataset, you can&#8217;t just download it. You have to submit a request to NIST, and then you have to agree to a bunch of terms and conditions for how you&#8217;re going to use the data.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Dave was able to negotiate with Reuters to release a version of the data, but it wasn&#8217;t quite like Reuters-21578. The public dataset did not include the article text, like in the way it would have appeared originally.<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">I couldn\u2019t release just the actual documents that somebody could read as if they were news. But Reuters was okay with releasing the set of words that occurred in the documents. If they&#8217;d been scrambled in order, which is fine for many machines. It&#8217;s not good for natural language processing, but it&#8217;s fine for many machine learning tasks.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">I don&#8217;t really understand the difference between those two things. Like why would it be okay for one and not for the other?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I think it&#8217;s not good for natural language processing because it&#8217;s not in the natural language, right? Like if you&#8217;re trying to teach a computer what a normal sentence sounds like, and maybe generate its own sentences, that sound normal. It\u2019s not going to be helpful for it to look at a word list. But if you were trying to teach a computer, just like what words are associated with other words in these articles, then it would be fine if they&#8217;re out of order.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Wait, the thing I&#8217;m still confused about is, so if they did this thing where they&#8217;re like, \u201cOkay, yes, we can make this publicly available, but you have to scramble things\u201d then why is the original one still available?<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Well, it\u2019s still useful and also they have no control over it at this point. I mean, they could go around DMCA-ing everybody.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">But the original one isn&#8217;t officially available from them? It&#8217;s just in all of these other places now.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Yeah.<\/span> <span style=\"font-weight: 400\">I asked Dave about the Blah blah blah\u2019s in Reuters-21578, big surprise, he didn&#8217;t know why they were there, but he was able to establish that he did not put them in. They were in the data before it got to him.<\/span><\/p>\n<p><span style=\"font-weight: 400\">And so why didn&#8217;t you take out the Blah blah blah\u2019s at this point?<\/span><\/p>\n<p><b>Dave: <\/b><span style=\"font-weight: 400\">We did our best not to mess with the raw data with the text, even when it seemed like there were errors or weird things in there because we viewed ourselves as cleaning up the formatting and the metadata not changing the original data. So, we weren&#8217;t sure where that came from my guess was it was kind of a filler.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">So at this point, nobody from Carnegie Group remembers this. Dave remembers it being in there and says he wasn&#8217;t the one who added it. So, I felt like the next person I needed to talk to would be somebody from Reuters.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">It\u2019s 1986. What are you doing?<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">Well in 1986, I was working at Reuters and we were doing some experimental work in Artificial Intelligence.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">This is<\/span> <span style=\"font-weight: 400\">Steven Weinstein, he worked at Reuters on the editorial side and ended up working on this project which at that time was called the Construe Topic Identification System.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">How cutting edge was this at the time?<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">That had never been done before.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">So pretty cutting edge?<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">I would say bleeding cutting edge, yes.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Steven told me he gets messages every so often about this dataset. And he even gets messages about the Blah blah blah thing.<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">It&#8217;s been coming up every handful of years for decades now, so it&#8217;s fun to think that these things live on in perpetuity.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">He told me that when people ask him about the Blah blah blah thing, he usually ignores them, but I was so persistent that he called Peggy Andersen, who I spoke with before to try to dislodge this from their collective memory.<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">We think that the answer is that the way the system worked is it evaluated the text of the story, not the headline. And one of the things that we did back in that time in order to get news quickly out onto the Newswire was sometimes we would publish the headline first, we&#8217;d call it a flash or a bulletin, and we just put the headline out so there was a headline with no body to the text.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">That would mess up the system when the system went to try to evaluate what was going on in the body of the text in order to categorize it or later on, do some other things with it. Since the system couldn&#8217;t process no data and come up with a reliable response, we believe, Peggy and I agree on this, that a little program was written that for items that didn&#8217;t have any texts, we would just have the dataset be updated to include Blah blah blah just as something there that we could key off of if we needed to identify those stories. So we think that&#8217;s Blah blah blah.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I see. So you don&#8217;t actually remember doing this, but you think after talking it over with Peggy that that&#8217;s what happened?<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">Well, Peggy and I had the same recollection about it. I don&#8217;t think we constructed it together. I think we both had the same memory of that&#8217;s how that came about. And it was in the dataset that we were using for testing. It was never in the feed that Reuters put out or the data that went into the database.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Steven&#8217;s story is that this was a hack, a temporary workaround. Reuters was sending out stories, separated by these control characters that would indicate where stories started and stopped. And these were things like ampersand, pounds, two semi-colon. The other markup, that SGML, was added by the Carnegie Group.<\/span><\/p>\n<p><span style=\"font-weight: 400\">At first Steven&#8217;s group thought that the body of an article would be more important than the headline for categorization and that later changed. But the way he remembers it during this one period, the system was looking for a headline, skipping the headline, and then attempting to process whatever texts came immediately after that closed title tag. And so stories that had nothing there, after the title tag ended would cause the code to break.<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">So the system could look at Blah blah blah and say, \u201cokay, there&#8217;s some texts there that I&#8217;m deciding not to evaluate rather than no text and breaking\u201d.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">And then, so that was done while you were testing, but at some point you would have to like that couldn&#8217;t go into the final product because your system would still have to deal with these body lists headlines?<\/span><\/p>\n<p><b>Steven:<\/b><span style=\"font-weight: 400\"> That\u2019s right. We changed the coding of the system to recognize what was the headline and what was the body of a story.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">And why Blah blah blah?\u00a0<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">I think we use Blah blah blah because there was no chance that would ever be in a news story and it was something that we could catch like XXX or some string of characters that if we needed to swap it out or pull it out of the dataset, it was a unique and distinct set of characters that wouldn&#8217;t affect anything else that was in the dataset.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">But it&#8217;s also feasible that Reutors would quote somebody saying, \u201cBlah blah blah\u201d.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I asked Steven about that and he said there were different phases of the project where they looked at other types of stories but at this point they were just looking at financial stories and this phrase would never occur in a finance story. Reuters had really strict rules for those reporters like you couldn&#8217;t even say that stocks had \u201cplummeted\u201d. That was too editorialized.<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">Reuters didn&#8217;t actually ever publish anything that said \u201cBlah blah blah\u201d. I think reporters would have gotten fired if they said that.\u00a0<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">And they couldn&#8217;t do a phrase like \u201cNo body text found\u201d because that could potentially trigger some of the rules that they were writing for categorization.<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">\u201cNo\u201d is a word that causes a lot of things to happen. And when \u201cno\u201d is paired up with other things, it certainly could create something going in the wrong direction.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">Yeah. So you think if you wanted it to be unique, it would be like a random string of letters, it would be like, QWERTY, ASDF or something.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Steven said they also didn&#8217;t want to use anything that could potentially break something else. Like it&#8217;s possible that could have confused the system, which was relying on standard language dictionaries to parse the text.<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">I want it to be very careful about not causing a problem, not creating a problem by trying to work around another problem.<\/span><\/p>\n<p><b>Billy: <\/b><span style=\"font-weight: 400\">I think the thing also is now regardless of why they put it in there, it&#8217;s out there. It&#8217;s already been spread to all of these places as this kind of freely shared dataset. So it&#8217;s just like, it&#8217;s just in the mix now.<\/span><\/p>\n<p><b>Steven: <\/b><span style=\"font-weight: 400\">I would never think that it\u2019s a dataset for 35 years and we&#8217;d still be talking about it now.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">I decided it was time to call Jess.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">One problem I ran into on this story is that every time I contacted someone, they were like, \u201cWhy do you care about this\u201d?<\/span><\/p>\n<p><b>Jess: <\/b><span style=\"font-weight: 400\">I mean, it was in 1987. I feel like they&#8217;ve moved on with their lives. Not that&#8217;s not necessarily right, but I haven&#8217;t so yeah, let&#8217;s hear it.<\/span><\/p>\n<p><b>Adrianne:<\/b><span style=\"font-weight: 400\"> I explained<\/span> <span style=\"font-weight: 400\">why the data set was created, who made it, how the Reuters feed relied on special control characters to separate stories and then SGML was added later.<\/span><\/p>\n<p><b>Jess: <\/b><span style=\"font-weight: 400\">I do love how 80s this whole story is. This is fantastic.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">And I told her about the Blah blah blah\u2019s. How they were a temporary fix that managed to stick around 30 years later.<\/span><\/p>\n<p><b>Jess: <\/b><span style=\"font-weight: 400\">Honestly, if it was like a movie, it would be a very boring movie, but in real life, it&#8217;s a quite exciting little venture of natural language processing development. That&#8217;s awesome! That&#8217;s more than I could have hoped for of the Blah blah blah\u2019s.\u00a0<\/span><\/p>\n<p><b>John: <\/b><span style=\"font-weight: 400\">That&#8217;s our show. Underunderstood is Adrianne Jeffries, Regina Dellea, Billy Disney and me, John Largomarsino. We&#8217;ll be back with another episode next week, Until then you can follow us on all the social media except for TikTok.<\/span><\/p>\n<p><b>Regina: <\/b><span style=\"font-weight: 400\">We definitely will have a TikTok soon.<\/span><\/p>\n<p><b>Adrianne: <\/b><span style=\"font-weight: 400\">Or consider joining us over on Patreon and you&#8217;ll get a bonus episode on Thursday. You&#8217;ll also be helping us pay for stuff like editing software and music. This episode came to us from a listener. Thank you, Jess. If you have a burning question that the internet can&#8217;t answer, drop us a line at hello@underunderstood.com. Maybe we can find the answer.<\/span><\/p>\n<p><b>Billy:<\/b><span style=\"font-weight: 400\"> Thanks for listening.\u00a0<\/span><\/p>\n<p><!-- \/wp:post-content --><\/p>\n<p><\/span><\/span>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p><\/div>","protected":false},"excerpt":{"rendered":"<p>A famous Reuters dataset from the 1980s includes \u201cBlah blah blah.\u201d in place of some stories. Why?<\/p>\n","protected":false},"author":4,"featured_media":1987,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"categories":[51],"tags":[],"class_list":["post-1973","episode","type-episode","status-publish","has-post-thumbnail","hentry","category-season-3"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>The Case of the Blah Blah Blahs - Underunderstood<\/title>\n<meta name=\"description\" content=\"A famous Reuters dataset from the 1980s includes \u201cBlah blah blah.\u201d in place of some stories. Why?\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"28 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/\",\"url\":\"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/\",\"name\":\"The Case of the Blah Blah Blahs - Underunderstood\",\"isPartOf\":{\"@id\":\"https:\/\/underunderstood.com\/podcast\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2020\/12\/blahblahblah2.jpg\",\"datePublished\":\"2020-12-08T07:00:00+00:00\",\"dateModified\":\"2020-12-09T20:24:15+00:00\",\"description\":\"A famous Reuters dataset from the 1980s includes \u201cBlah blah blah.\u201d in place of some stories. Why?\",\"breadcrumb\":{\"@id\":\"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#primaryimage\",\"url\":\"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2020\/12\/blahblahblah2.jpg\",\"contentUrl\":\"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2020\/12\/blahblahblah2.jpg\",\"width\":1920,\"height\":1280},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/underunderstood.com\/podcast\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Episodes\",\"item\":\"https:\/\/underunderstood.com\/podcast\/episode\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"The Case of the Blah Blah Blahs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/underunderstood.com\/podcast\/#website\",\"url\":\"https:\/\/underunderstood.com\/podcast\/\",\"name\":\"Underunderstood\",\"description\":\"The internet doesn&#039;t have all the answers, but that doesn&#039;t mean we can&#039;t find them.\",\"publisher\":{\"@id\":\"https:\/\/underunderstood.com\/podcast\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/underunderstood.com\/podcast\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/underunderstood.com\/podcast\/#organization\",\"name\":\"Underunderstood\",\"url\":\"https:\/\/underunderstood.com\/podcast\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/underunderstood.com\/podcast\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2019\/12\/uus.png\",\"contentUrl\":\"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2019\/12\/uus.png\",\"width\":1000,\"height\":1000,\"caption\":\"Underunderstood\"},\"image\":{\"@id\":\"https:\/\/underunderstood.com\/podcast\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/facebook.com\/underunderstood\",\"https:\/\/x.com\/underunderstood\",\"https:\/\/www.instagram.com\/underunderstood\/\",\"https:\/\/www.linkedin.com\/company\/underunderstood\",\"https:\/\/www.youtube.com\/channel\/UC0wlzPoBtEDrZ0meIzVMSag\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"The Case of the Blah Blah Blahs - Underunderstood","description":"A famous Reuters dataset from the 1980s includes \u201cBlah blah blah.\u201d in place of some stories. Why?","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/","twitter_misc":{"Est. reading time":"28 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/","url":"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/","name":"The Case of the Blah Blah Blahs - Underunderstood","isPartOf":{"@id":"https:\/\/underunderstood.com\/podcast\/#website"},"primaryImageOfPage":{"@id":"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#primaryimage"},"image":{"@id":"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#primaryimage"},"thumbnailUrl":"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2020\/12\/blahblahblah2.jpg","datePublished":"2020-12-08T07:00:00+00:00","dateModified":"2020-12-09T20:24:15+00:00","description":"A famous Reuters dataset from the 1980s includes \u201cBlah blah blah.\u201d in place of some stories. Why?","breadcrumb":{"@id":"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#primaryimage","url":"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2020\/12\/blahblahblah2.jpg","contentUrl":"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2020\/12\/blahblahblah2.jpg","width":1920,"height":1280},{"@type":"BreadcrumbList","@id":"https:\/\/underunderstood.com\/podcast\/episode\/reuters-data-set-blah-blah-blah\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/underunderstood.com\/podcast\/"},{"@type":"ListItem","position":2,"name":"Episodes","item":"https:\/\/underunderstood.com\/podcast\/episode\/"},{"@type":"ListItem","position":3,"name":"The Case of the Blah Blah Blahs"}]},{"@type":"WebSite","@id":"https:\/\/underunderstood.com\/podcast\/#website","url":"https:\/\/underunderstood.com\/podcast\/","name":"Underunderstood","description":"The internet doesn&#039;t have all the answers, but that doesn&#039;t mean we can&#039;t find them.","publisher":{"@id":"https:\/\/underunderstood.com\/podcast\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/underunderstood.com\/podcast\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/underunderstood.com\/podcast\/#organization","name":"Underunderstood","url":"https:\/\/underunderstood.com\/podcast\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/underunderstood.com\/podcast\/#\/schema\/logo\/image\/","url":"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2019\/12\/uus.png","contentUrl":"https:\/\/underunderstood.com\/podcast\/wp-content\/uploads\/2019\/12\/uus.png","width":1000,"height":1000,"caption":"Underunderstood"},"image":{"@id":"https:\/\/underunderstood.com\/podcast\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/facebook.com\/underunderstood","https:\/\/x.com\/underunderstood","https:\/\/www.instagram.com\/underunderstood\/","https:\/\/www.linkedin.com\/company\/underunderstood","https:\/\/www.youtube.com\/channel\/UC0wlzPoBtEDrZ0meIzVMSag"]}]}},"_links":{"self":[{"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/episode\/1973","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/episode"}],"about":[{"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/types\/episode"}],"author":[{"embeddable":true,"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/comments?post=1973"}],"version-history":[{"count":10,"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/episode\/1973\/revisions"}],"predecessor-version":[{"id":1995,"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/episode\/1973\/revisions\/1995"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/media\/1987"}],"wp:attachment":[{"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/media?parent=1973"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/categories?post=1973"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/underunderstood.com\/podcast\/wp-json\/wp\/v2\/tags?post=1973"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}