Brian Lu Nguyen A12940672
[email protected]
1. The Exhentai.org Dataset The dataset being studied is 100,000 of the most recent doujinshi uploaded to the website exhentai.org as of June 10 th, 2017. Doujinshi are Japanese self-published works, often taking the form of fanfiction comics, although original works do exist. Exhentai.org is a website dedicated to archiving the doujinshi that exist, and requires an account to access, as the vast majority of these works are 18+. This dataset includes only the original Japanese language doujinshi manga, and does not contain con tain alternate language translations. To be more frank, this is a dataset of Japanese pornographic manga (comics). While exhentai.org does host doujin translations in a multitude of languages (English, Chinese, Russian, Spanish, etc.), I decided to thin this dataset to just the Japanese entries to remove any weird results that may come from involving all languages. Keeping translations would essentially mean a translated work could be double or triple counted (or more) based on the number of translations that exist, which would skew the results of any sort of data analysis. This dataset was scraped by first running spiders that crawled through every page of Japanese-only doujinshi (each page containing 25 gallery listings) and having them record the urls of each of the galleries
on every page. Once the urls were compiled, I then ran a program that extracted the metadata for each of the galleries using exhentai’s dedicated API. Each gallery has an associated ID and gallery token, which can be used in a JSON request that retrieves the gallery’s metadata. This process is incredibly time intensive (taking several hours), and despite following th e site’s guidelines on load limiting “25 entries per request, 4 -5 sequential requests… before having to wait for ~5 seconds” (https://ehwiki.org/wiki/API https://ehwiki.org/wiki/API)), I was IP banned several times and needed to use multiple proxies to get around the bans. With all of the metadata eventually compiled into a series of JSON files, I could then begin my analysis.
Table 1: Basic Statistics of the Dataset Exhentai.org Dataset Statistics Size of the Dataset
100,000 galleries
Average Rating
4.280 / 5.000
Average Page Count
62 pages
Average File Size
54 megabytes
Oldest Doujin Date
January 9th, 2010
Most Recent Doujin Date
June 10th, 2017
Figure 1a: Example Metadata Information
9%
tankoubon
9%
female: futanari
9%
male: males only
9%
female: double penetration penetration
8%
female: nakadashi
8%
male: sole male
8%
An oddity visible in Table 2 is the disparity between male: yaoi (male gay content) and female: yuri (lesbian content). The latter tag doesn’t register among the top 20, instead found 28th at 6.2% on the list – about 7% less frequent than yaoi. This suggests that yaoi as a genre is much less niche than yuri, and has wider appeal. This makes sense in context, considering a large amount of female consumers of doujinshi gravitate to yaoi content. This exists to such an extent that the term “fujoshi” , a self-deprecating term for female fans of “boys love”, is common vocabulary in Western anime and manga communities.
Figure 1b: Equivalent on exhentai.org
Table 2: Top 20 Most Popular Tags Tag Descriptor
female: ahegao
Frequency
female: big breasts
30%
female: lolicon
25%
group
21%
female: stockings
19%
female: schoolgirl uniform uniform
17%
female: anal
14%
male: shotacon
13%
male: yaoi
13%
female: glasses
11%
female: bondage
11%
full color
10%
female: rape
10%
female: sole female
10%
Another interesting note from this table is the presence of “tankoubon” at 9% frequency within the dataset. Tankoubon are paperback volumes that often act as an omnibus for multiple different artists to have their works published in. While outside the scope of my intended model, one could make a network community of artists collaborating within such anthological works. The more pressing matter for my model is that the presence of tankoubons to this degree means that file sizes and pagecounts could be skewed due to the large size of these books. By calculating the individual sizes of pages within the gallery, I can avoid this potential pitfall and instead rely on less biased data. For more explanation on some of the terms in this table, a full list of gallery tags can be found here: https://ehwiki.org/wiki/Category:Tag
Table 3: Top 20 Most Popular Series Tag Descriptor
Frequency
Touhou Project
7%
Kantai Collection
5%
Idolm@ster
3%
Mahou Shoujo Lyrical Nanoha
1%
Neon Genesis Evangelion
1%
Love Live
1%
Sailor Moon
mean that the art for the doujins are cleaner and that the scanning techniques used are representing doujins better, Older series with older doujins are prone to poor quality scans of the original pages, or simply poor quality art to begin with. Table 4: Top 20 Most Popular Artists Tag Descriptor
Frequency
Itaba Hiroshi
.36%
.7%
Inochi Wazuka
.24%
Granblue Fantasy
.7%
Crimson
.23%
Free
.7%
Natsuka Q-ya
.20%
To Love-Ru
.6%
Nekogen
.19%
Pokemon
.6%
Ueda Yuu
.18%
Puella Magi Madoka Magica
.57%
Uchi-Uchi Keyaki
.18%
K-On
.55%
Erect Sawaru
.17%
Shingeki no Kyojin
.54%
Saigado
.17%
Touken Ranbu
.5%
Nozarishi Satoru
.17%
Ore no Imouto ga Konna ni Kawaii Wake ga Nai Street Fighter
.49%
Marui Maru
.17%
Manabe Jouji
.16%
.47%
Kawamori Misaki
.16%
Fate/stay Night
.47%
Ken
.15%
Sword Art Online
.46%
Yanagawa Rio
.15%
Kuroko no Basuke
.45%
Koutarou
.15%
Equal
.15%
Zen9
.15%
Takasugi Kou
.15%
Nagiyama
.15%
There is an interesting thing to note here with regards to popularity. Each of these series came out at a different time, so while some have had time to solidify their position in the doujin market, others just became wildly popular. Examples of this are Kantai Collection and Granblue Fantasy, both mobile games with a large roster of attractive characters (all female in Kantai Collection, mixed in Granblue Fantasy). The two games came out fairly recently, “Kancolle” in 2013 and Granblue in 2014. Compared to longrunning, popular series like Sailor Moon, Pokemon, Neon Genesis Evangelion, or Street Fighter, their frequency on exhentai ’s database is significant. Coming out more recently might also
There’s not much here to say about artists here, other than the fact that Itaba Hiroshi is ahead by quite the margin. This might mean their works are easier to find and upload to the site, that they produced tons of works between 2010 and an d 2017, or that they might have just pitched in to multiple tankoubans.
2. Predictive Task With sexuality and personal taste differing from person to person, I’d like to find out if there are general factors that make a doujinshi highly rated. This could range from qualitative features such as the image size of each scanned page (ex. HD vs. standard definition porn), content-specific features like a gallery’s tags (i.e . the fetishes the work plays upon), or highly subjective su bjective features features like the popularity of the series the doujinshi is based on. Essentially, my predictive task is to predict the rating of a doujinshi gallery based on its metadata. This task can be performed using an SVM classifier as was applied to predicting a beer’s ABV in homework 1 and improved upon in homework 2. Apart from simply testing my model’s performance against the test set, I will be evaluating my model based on its test accuracy compared with other models, each varying in complexity. For instance, if my model is performing worse than a naïve classifier, it clearly needs improvement.
I’m applying an SVM classifier here because I’m only really interested in what makes a doujinshi “good”, and so would only need to predict whether the doujinshi’s score lies above a certain threshold. In this case, I’ll be attempting to predict whether a doujin lies above or below the average score for all the galleries in my dataset (~4.28 out of 5.00). Predicting whether a gallery will be a 2 out of 5 or a 3 out of 5 is not relevant here, so my model is only concerned with the doujinshi being above-average. The features I will be using for my model are the following:
1. Filesize-per-image of the gallery – if the images in the image gallery for a doujinshi are of low resolution, it is unlikely to be rated highly. Conversely, if the images in the image gallery are of high resolution (or possibly in color), then the score might be higher. This value can be calculated using the ‘filecount’ and ‘filesize’ properties of the gallery metadata, functioning on the assumption that all of the pages are roughly consistent in size within galleries (filesize divided by filecount should produce this value). 2. Number of popular tags occurring in the doujin – tag frequencies across the entire dataset may indicate generally what people have a preference for. This can be calculated similarly to the most common unigrams used in Homework 4. The tags are included in the metadata of each gallery file, and can be compared with a list of “top tags” (sorted by frequency of appearance in doujin galleries). This feature also naturally includes the popularity of the artist. If the artist is prolific, they’re probably doing something right. The artist can be found within the tags of the metadata, and can be acquired by stripping the artist from the rest of the tags. Galleries can have multiple artists if the work is an anthology or some collaborative work, so that will be taken into consideration as well. Sifting through the popular tags and collecting only tags with ‘artist ’ gets me the list of the most popular artists. For my naïve classifiers to compare with my model, I will be using this feature for one:
1. Popularity of the series – if the source material is popular, that might be an indication of quality (i.e. Game of Thrones porn is probably better than Marvel: Inhumans porn). And this feature for the other: 1. Popularity of the tags – just going off of all of the tags combined, if the tags in the gallery are part of the most frequent tags, they should be favored/desired. This does not consider the filesize-per-image and is used to gauge whether the filesize feature is actually helping. 3. Model The model I chose was a classifier that runs logistic regression on the features listed above to predict whether a doujin has an above-average rating. I ’m using this model because it was effective and relatively simple to implement. The data from the exhentai.org dataset makes this model effective, due to its similarity to the beer review dataset we worked with, and the statistics found during my exploration seemed to indicate that the elements I’m using to predict ratings are effective (or at least have significant enough differences in frequency so as to make an impact). I constructed the feature vector using the three features listed above for my model (justifications for each feature above as well). Calculating the filesize-per-image was simply a division of the total file size by the file count and converting that byte number into a megabyte number. The tag popularity features were calculated by counting the number of popular tags/artists/etc. that appeared in a given doujin, and incrementing the “popularity value” of the feature by 1
every time a popular artist or tag was encountered. My training-validation-test set splits were assembled by first randomizing the doujins along with their scores, then cutting them into a 70-15-15 ratio. 70000 doujins were used in the training set and 15000 doujins were used in the validation and test sets. s ets. Not randomizing the doujin list biases the data because the doujins are ordered from most to least recent. Splitting them chronologically like this results in the classifier training on recent content and testing on old content; the model would overfit to the training set and perform poorly on the test set. Optimizing the model took some tinkering with the amount of “top tags/artists/series/etc.” I was considering. Using only the top 10 or top 50 did not provide enough information to the classifier, so I ended up calculating the feature vectors using the top 100 of each of the tag categories. I use the hyperparameter lambda = 1.0 due to there being a dropoff in effectiveness past this point. My two other models, based on the most popular series and most popular tags, did not perform nearly as well. The series-only model performed worse than the content (tag)-only model, implying that a series ’ popularity does not dictate the quality of its doujins. The tagonly model came closest to the performance of my model, but wasn’t able to beat its accuracy on the test set. The first model (filesize-per-image and popular tags) relies on the idea that people like the content that is popular and that a higher filesize-per-image means that the image resolution and quality of the image is better. While this approach is effective, it relies on the assumption that the gallery is
tagged accurately and that a high resolution image doesn’t contain flaws in image clarity. Another complication is the presence or lack of consideration for censoring in the gallery, as regardless of image quality a censored doujinshi is not as preferable as an uncensored one. The second model (series-only) is pretty naïve and falls flat. The popularity (frequency) of a series just means that more doujins exist for it, and these doujins can vary wildly in quality, especially for older series. The third model (tag-only) served as a test for the first model and carries similar pros and cons. The lack of filesize-per-image means that the classifier simply has less to go on and the accuracy suffered as a result. The numerical results and the conclusions drawn from them are included in section 5 (Results). 4. Pornographic (Academic) Literature Academic data-driven research on porn isn’t a subject most scholars are willing to cover, research on obscure Japanese pornographic comics even less so. In a more general sense, however, data science/analytics literature pertaining to porn does exist. Pornhub, for example, understands its position with regards to the amount of web traffic it receives. receives. The site publishes various various articles about what they’ve learned from their extensive dataset of porn and their users ’ interactions with the site (https://www.pornhub.com/insights/ https://www.pornhub.com/insights/)). Pornhub’s less schorlarly articles mainly focus on interesting search trends during times of the year or certain events, like holidays or natural disasters. Search terms
also factor into their analysis, delving into what people look for, both sexually and literally, in their pornography. For a more academic approach, Sexualitics is a dedicated collaboration between scholars that “tries to contribute to human sexuality understanding through a big data approach”. They release datasets and papers to help promote more discussion on what is otherwise a bit of a taboo subject. One of their studies from 2014 (http://sexualitics.org/wpcontent/uploads/2014/08/mazieres_pornstu dies_2014.pdf ) come fairly close to the type of of analysis this assignment focuses on, namely frequency and exploration of tags of a porn site’s data (in their case, xHamster). Their study focused on network connectivity and categorizing tags into ethnic groups based on which regions searched for which terms. The conceit was that different types of people prefer different types of porn, and that this difference could be seen at a cultural level. Similar to the community-building algorithms taught in this course, the Sexualitics study was able to cordon off certain fetishes that could be used to describe different ethnic or regional porn preferences. This type of research could improve the accuracy of targeted marketing on various porn sites and possibly increase revenues for sites that take heed of these lessons. The conclusions gleamed from the Sexualitics study roughly align with my own findings, as there were noticeable groupings within the exhentai.org dataset that could be further hashed out. Just like in their community network, the metadata for exhentai.org galleries includes tag information that could possible inform people of the types of tags that might be associated with the doujins created for a certain series of
anime or manga that the doujinshi is based off of. For example, the series “Touhou Project”, identifiable among the doujins as those containing the tag “parody:touhou project”, contains a substantial amount of lesbian sex, and could be grouped in a community of other such series whose doujins produce similarly tagged content. Being porn, these doujinshi share a similarity to videos hosted on xHamster (which were used in the Sexualitics paper), and can thus be used in a similar context for analysis and study. 5. Results Predictive Task: Given the metadata of the gallery, predict whether the the gallery rate at above or below the average rating (~4.2/5.0)
Model 1 Performance (Filesize-per-image and Top 100 Popular Tags) Set Training Validation Test
Accuracy 0.737 0.738 0.739
Model 2 Performance (Top 100 Series) Set Training Validation Test
Accuracy 0.684 0.676 0.681
Model 3 Performance (Top 100 Tags) Set Training Validation Test
Accuracy 0.707 0.703 0.706
My proposed model (Model 1) seems s eems to have come out on top over the two naïve classifiers, so it certainly seems to be doing something right. The significance of these results at the very least means that my model can predict whether a doujin rates above or below the average with 74% accuracy. Model 1’s performance over Model 3 proves that my filesize-per-image feature indeed makes the predictions more accurate, specifically by around 4%. The performance of Model 3 is quite impressive though, as tags alone can predict the score with 71% accuracy. If anything that proves the effectiveness of tags on people ’s enjoyment and rating of the doujinshi. Model 2 performed the worst, possibly due to the reasons mentioned in section 3 previously. As a whole, Model 1 performed well because its features matched the best with why someone would rate a doujinshi highly. People ’s porn preferences can be highly specific, and if a doujin’s tags “hit the spot ”, so to speak, then the doujin is likely to do well. On a general level, people want higher definition media, and the higher the resolution of the images, the better the doujin can be enjoyed. Adding on specific features for artists or series diluted the accuracy of the predictor during testing. These results suggest that (possibly unsurprisingly) what matters most in a doujinshi’s rating are the image quality and the act(s) of sex itself, with the original series and art style coming second.