Old enough to know better/M/Boston/Married. I'm a Ph.D. in computer science and linguistics. I run on metal, wordplay, and snark. Most of the languages I speak are dead.
Almost like algorithms trained on biased data will exhibit bias, no?
Basically, the algorithm doesn’t suck. You suck.
I’ve changed my mind on this. You suck, but the algorithm sucks, too.
Think about is this way: take image classification, particularly facial recognition—all pixels have to be converted to numbers for the computer to use. Seems reasonable to convert color values to something like RGB, right? But, the RGB for a white pixel is (255,255,255) while RGB for a black pixel is (0,0,0). Lighter pixels get higher values than darker pixels. What’s a picture of a black face made of, mostly? Darker pixels and hence lesser numerical value (take this instance of “lesser” to be purely in the numerical sense here, and not as a value judgment, but it does matter to the value judgment later on).
If your algorithm uses something like average pooling (basically, take the average color value of a small patch of the image, like a 3x3 square), those darker areas contribute less overall numerical value to the data that gets propagated through the algorithm.
Therefore, an algorithm that makes some typical default assumptions like “convert pixels to RGB” and “use average pooling” will take a darker-skinned face as input and see a bunch of vectors of lesser overall magnitude that make it difficult for the algorithm to pick up key differences that allow it to place this instance into one region of a partitioned dataset (aka, it makes it harder to categorize because it may not think there’s much there—it’s seeing a lot of things that appear to be close to 0, and 0 is basically the absence of information in many cases). So it may not be able to differentiate a black man from a black woman very easily. Or worse, it may not even see a person there at all, and in that case even if “not a person” isn’t one of it’s possible category labels, the output is going to be kind of random, because the algorithm as structured can’t make sense of an image that it’s presented with.
These assumptions and algorithmic choices don’t exist in a vacuum. “Numbers” themselves aren’t biased, of course, but the way you use them and put them together can cause systems to behave in a biased fashion.
I will be starting a natural language processing lab in a computer science department in the fall, but my approach to NLP has always been firmly grounded in linguistics, society and culture, and a strong interest in dialogue, narrative, and literature. I hope my lab can be notable in this regard and help keep the humanities “cool” in the eyes of the STEM cohorts.
I’ve been off Tumblr for over a year, and off most social media for most of that time. I feel like the break has been good for my mental health, and probably for my career.
Long story short, I was on the faculty job market last year, and the process was as grueling as they say. After dozens of applications a few finalist interviews, and basically a wasted January and February, I wound up in April 2019 with nothing to show for it. I came close, but close doesn’t count when someone else gets the job.
Suffice to say, I spent the latter two-thirds of 2019 in “Hello Darkness My Old Friend” Mode, that is, a fairly continuous low-key depression, racked with impostor syndrome and convinced that I could never go on the academic job market again. However, the prospect of working in AI for some giant corporation, building weapons systems, helping the Chinese government surveil their citizens, or improving some schlubby company’s ad click-through rate was even less appealing, so fortunately my PI was generous enough to extend my postdoc for another year while I sorted myself out.
In the end, I went back on the job market. It seemed twice as grueling this time, as I sent out 50% more applications, but got fewer responses, fewer interviews, and fewer finalist callbacks.
But, maybe the intervening year gave me the practice I needed to talk about my research better, and it became clearer where I would really fit. In the end, it only takes one, and finally I’m able to announce that I’ll be moving to Fort Collins, Colorado in August to start as Assistant Professor of Computer Science at Colorado State University! The fact is, I really can’t see myself doing anything else but being a professor.
Fittingly, just as I seem to be getting my shit straight, the world appears to be falling apart, but at least I have a new website! Click here: https://www.nikhilkrishnaswamy.com
I’m always eager to hear from potential students and collaborators. Anyone out there interested in pursuing graduate study in computational linguistics, AI, and multimodality, I’m happy to talk.
I’ll keep this Tumblr open, though I’m not sure how active I’ll be. You can also follow me on…
STORIES THAT WERE TOLD BY PEOPLE SPEAKING LANGUAGES WE NO LONGER KNOW
STORIES TOLD BY PEOPLE LOST TO THE VOID OF TIME
STORIES
GUYS LOOK AT THIS
OH MY GOD YOU GUYS
GUYYYYYSSSS
“Here’s how it worked: Fairy tales are transmitted through language, and the shoots and branches of the Indo-European language tree are well-defined, so the scientists could trace a tale’s history back up the tree—and thus back in time. If both Slavic languages and Celtic languages had a version of Jack and the Beanstalk (and the analysis revealed they might), for example, chances are the story can be traced back to the “last common ancestor.” That would be the Proto-Western-Indo-Europeans from whom both lineages split at least 6800 years ago. The approach mirrors how an evolutionary biologist might conclude that two species came from a common ancestor if their genes both contain the same mutation not found in other modern animals.”
How do they control for stories that were borrowed, which almost certainly happened?
“ Unlike genes, which are almost exclusively transmitted “vertically”—from parent to offspring—fairy tales can also spread horizontally when one culture intermingles with another. Accordingly, much of the authors’ study focuses on recognizing and removing tales that seem to have spread horizontally. When the pruning was done, the team was left with a total of 76 fairy tales.”
This article doesn’t say how, but I bet those methods are in the paper.
For this, they used a library of cultural traits for each culture a fairy tale occurred in, and then measured the likelihood that trait t occurs in culture c due to either phylogenetic proximity (inheritance) or spatial proximity (diffusion), using autologistic regression:
(Autologistic regression is a graphical model where connected nodes have dependencies on each other, except instead of an undirected graph, ALR is a special case that requires sequential binary data and assumes a spatial ordering. In this case, the binary data are the cultural features).
Cultural traits states are generated using Monte-Carlo simulation and phylogenetic or spatial influence are fitted as local dependencies between the nodes in the graph representing cultural traits. I can’t find this in the paper (though it may be mentioned in the citation of the method they used), but presumably if the spatial influence exceeds the phylogenetic influence by a certain threshold, the trait is removed.