![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I'm in the middle of receiving my first data analysis contract :) Just waiting on the red tape getting sorted out so I can get started on the analysis proper. I have also finished some tutorials for something else I wanted to be able to add to my resume.
I recently discovered this neat text analysis tool for Japanese aimed at learners. It does your basic wordcount, kanji count for good measure, and grade level... and it also has an option for user-based readability scores, if you happen to have a list of words you know.
I don't have a complete list (who does?), but I do have a flashcard set that I can export to a CSV file, so I ran it on that. This produced odd results at first - it gave Satton's book a 92% readability score, which is way too low. Luckily, you can also remove words from the output using the same kind of list, so after some clean-up of adding in names, common words I know but don't happen to be in the flashcards, words I don't need to memorize because they're katakana'd English, conversational noises, phrases that the parser just deals with weirdly, etc, it looked more reasonable.
So I compared Satoko Challenge vs a short story I have 記憶の森の魔女 The Witch in the Forest of Memory vs the first Onmyouji book vs this novel I found while trying to find Satton's book via the bad Kobo search, called Skate Boys. (Originally I used samples of the latter two, but then I went ahead and bought them anyway, and so made it the full texts - though the samples had similar results as the full texts.) Results:
Satoko Challenge - 8th grade, 98.31% of all words known, 87.41% of unique words known.
The Witch - 8th grade, 94.37% of all words known, 79.11% of unique words known.
Skate Boys - 9th grade, 93.27% of all words known, 56.85% of unique words known.
Onmyouji - 9th grade, 89.94% of all words known, 51.70% of unique words known.
Now, there's probably still some wriggle room left in all of these results, Skate Boys doesn't feel too difficult to read, but for reference: I found about 114 pages-worth of text in Satoko Challenge (minus the pictures, English phrases she explains at the back, etc - it's not a long book even in print). When I counted, I had 176 sentence/phrases that I pulled from it, plus some more I'd already made into flashcards, so let's round up to 200. Many sentence/phrases I use for studying have one unfamiliar word or grammar point in them, but sometimes it's a word plus a grammar point or two words, so let's say 1.5 new(-ish) to me things per sentence. That's about 3 per page of Satoko Challenge.
With Skate Boys, I've read 7.2 pages-worth so far and grabbed 39 new sentences out of that, so with the same estimation math, that's about 8 unknown words per page! A quick check of the first few pages says that's maybe a little high, but not completely off. It doesn't feel like quite that many while reading... which is a good thing, I guess. (And I usually understand enough of the context/kanji to take a stab at the meaning, even if the word is new to me.) Now I'm curious about the numbers for The Witch in the Forest of Memory, but I only highlighted words in that one b/c of my wrist issues, so I don't have a count at the moment. I have also looked through the first bit of Onmyouji and yeah, as expected from a book set in the Heian era, there's an uptick in difficulty.
Of course, I could just... look at book samples and use that to decide what's the right level for me rather than going through the hassle of exporting them through Calibre to run a program on them, but I think it's an interesting tool nonetheless. My first thought was that I could download a bunch of fic from Pixiv, or maybe Python docs if I'm feeling boring and want to learn technical words, and use this to decide what's the easiest to read first. Too bad it can't tell me which is the good fic, though.
I recently discovered this neat text analysis tool for Japanese aimed at learners. It does your basic wordcount, kanji count for good measure, and grade level... and it also has an option for user-based readability scores, if you happen to have a list of words you know.
I don't have a complete list (who does?), but I do have a flashcard set that I can export to a CSV file, so I ran it on that. This produced odd results at first - it gave Satton's book a 92% readability score, which is way too low. Luckily, you can also remove words from the output using the same kind of list, so after some clean-up of adding in names, common words I know but don't happen to be in the flashcards, words I don't need to memorize because they're katakana'd English, conversational noises, phrases that the parser just deals with weirdly, etc, it looked more reasonable.
So I compared Satoko Challenge vs a short story I have 記憶の森の魔女 The Witch in the Forest of Memory vs the first Onmyouji book vs this novel I found while trying to find Satton's book via the bad Kobo search, called Skate Boys. (Originally I used samples of the latter two, but then I went ahead and bought them anyway, and so made it the full texts - though the samples had similar results as the full texts.) Results:
Satoko Challenge - 8th grade, 98.31% of all words known, 87.41% of unique words known.
The Witch - 8th grade, 94.37% of all words known, 79.11% of unique words known.
Skate Boys - 9th grade, 93.27% of all words known, 56.85% of unique words known.
Onmyouji - 9th grade, 89.94% of all words known, 51.70% of unique words known.
Now, there's probably still some wriggle room left in all of these results, Skate Boys doesn't feel too difficult to read, but for reference: I found about 114 pages-worth of text in Satoko Challenge (minus the pictures, English phrases she explains at the back, etc - it's not a long book even in print). When I counted, I had 176 sentence/phrases that I pulled from it, plus some more I'd already made into flashcards, so let's round up to 200. Many sentence/phrases I use for studying have one unfamiliar word or grammar point in them, but sometimes it's a word plus a grammar point or two words, so let's say 1.5 new(-ish) to me things per sentence. That's about 3 per page of Satoko Challenge.
With Skate Boys, I've read 7.2 pages-worth so far and grabbed 39 new sentences out of that, so with the same estimation math, that's about 8 unknown words per page! A quick check of the first few pages says that's maybe a little high, but not completely off. It doesn't feel like quite that many while reading... which is a good thing, I guess. (And I usually understand enough of the context/kanji to take a stab at the meaning, even if the word is new to me.) Now I'm curious about the numbers for The Witch in the Forest of Memory, but I only highlighted words in that one b/c of my wrist issues, so I don't have a count at the moment. I have also looked through the first bit of Onmyouji and yeah, as expected from a book set in the Heian era, there's an uptick in difficulty.
Of course, I could just... look at book samples and use that to decide what's the right level for me rather than going through the hassle of exporting them through Calibre to run a program on them, but I think it's an interesting tool nonetheless. My first thought was that I could download a bunch of fic from Pixiv, or maybe Python docs if I'm feeling boring and want to learn technical words, and use this to decide what's the easiest to read first. Too bad it can't tell me which is the good fic, though.