CounterContent Scraper Group

TamaSaga · Mar 17, 2018

Considering all of the complaints, I'm curious about why someone hasn't put out a "call to programmers" because the easiest way to counter content thieves is to just glut the market with unique content protection. Make it so that they need to do far more manual work just to clean up everything.

Basically we should work together with these kinds of goals:
- Make it easy to use so that even plebian nonprogrammers can use it. I'm thinking like....Ctrl + Alt + F4 + Home + Numpad-0 kind of easy.
- Bake in a bunch of encoding strategies so that content thieves can't just use a silver bullet to clear everything.
- Make it so that the original text is as faithfully reproduced as possible to the eye. So we want to take the translated stuff and google translate it back to moon runes.
- Make it so that we're not accountable if it doesn't work. If people complain, shrug them off since we would be doing it for free. Programming isn't that easy.
- Convenience should not preempt security. I believe that the program shouldn't phone home at all. Without a central figure to guide its development, who's to stop someone malicious from making it so that an innocent "check for updates" sends the translator's hard work back instead.
- Other groups have already created their own content protection. Let them keep it proprietary and roll our own using common protection schemes. The reason will be explained later below and it lets them add extra noise to our obfuscation efforts.

Basically, my idea is this:
- A program which you copy and paste text into / reads your clipboard. It'll encrypt the text somehow, and it'll spit out a usable encoded output that you just paste back onto your content form.
- It has to have at least 10 encoding schemes. For instance, Random insertion of site messages. Div and p swapping, etc.
- Here's the trick:

Every time the translator goes to publish new content, they'll use this program and it will randomly choose 1 or 2 encoding schemes. The others will be held in reserve for future publications.

Why is this clever? For those of us who have made content scrapers before, you generally mess with regex's and the like. That means we need to specifically tailor a scraper to a website and we generally don't count on their appearance to change much post to post. If it changes, it's usually something minor.

So if the website keeps changing up its encoding scheme with every post, you either need to figure out a pattern that works with everything or you need to readjust the scraper every time while also manually cleaning up the scraped content.

Now let 20 websites use this strategy and it can quickly become rough as there are umm...10 + 45 + 120 = 175 combinations you can come up with if you just randomly choose up to 3 encoding schemes from the 10. Let's make it so that if the thieves plan to content scrape everyone, it'll become a full-time job for them.

We're going to run into a bunch of problems:
- We need to figure out what platforms people use to host their translations. Self-hosted would be great but not everyone can handle a web server and one of our goals is to flood the market.
- Going with the above, ideally it should be able to use javascript and css to broaden our possible encryption schemes. It might even go so far as to create images.
- Need a programming language that eventually spits out an application that anyone can use. It'd be great if it can integrate into wordpress or the like since the encryption would come free then.
- Some encoding schemes will conflict with each other. Others are so similar that they can be dealt with together. We need to come up with a bunch of orthogonal schemes that can work together.
- Hasn't someone come up with something like this already? I haven't checked, but it'd be great if we don't have to reinvent the wheel, we just need to enhance it then.

tldr; Make a bunch of programmers work together to develop a content-protection program that's widely available, well- featured, and easy to use for translators.

SoulZer0 · Mar 17, 2018

TamaSaga said: ↑

tldr; Make a bunch of programmers work together to develop a content-protection program that's widely available, well- featured, and easy to use for translators.
Click to expand...

That'd cost money, do you think they'll do it for free?

chencking · Mar 17, 2018

TamaSaga said: ↑

Considering all of the complaints, I'm curious about why someone hasn't put out a "call to programmers" because the easiest way to counter content thieves is to just glut the market with unique content protection. Make it so that they need to do far more manual work just to clean up everything.

Basically we should work together with these kinds of goals:
- Make it easy to use so that even plebian nonprogrammers can use it. I'm thinking like....Ctrl + Alt + F4 + Home + Numpad-0 kind of easy.
- Bake in a bunch of encoding strategies so that content thieves can't just use a silver bullet to clear everything.
- Make it so that the original text is as faithfully reproduced as possible to the eye. So we want to take the translated stuff and google translate it back to moon runes.
- Make it so that we're not accountable if it doesn't work. If people complain, shrug them off since we would be doing it for free. Programming isn't that easy.
- Convenience should not preempt security. I believe that the program shouldn't phone home at all. Without a central figure to guide its development, who's to stop someone malicious from making it so that an innocent "check for updates" sends the translator's hard work back instead.

Basically, my idea is this:
- A program which you copy and paste text into / reads your clipboard. It'll encrypt the text somehow, and it'll spit out a usable encoded output that you just paste back onto your content form.
- It has to have at least 10 encoding schemes. For instance, Random insertion of site messages. Div and p swapping, etc.
- Here's the trick:

Every time the translator goes to publish new content, they'll use this program and it will randomly choose 1 or 2 encoding schemes. The others will be held in reserve for future publications.

Why is this clever? For those of us who have made content scrapers before, you generally mess with regex's and the like. That means we need to specifically tailor a scraper to a website and we generally don't count on their appearance to change much post to post. If it changes, it's usually something minor.

So if the website keeps changing up its encoding scheme with every post, you either need to figure out a pattern that works with everything or you need to readjust the scraper every time while also manually cleaning up the scraped content.

Now let 20 websites use this strategy and it can quickly become rough as there are umm...10 + 45 + 120 = 175 combinations you can come up with if you just randomly choose up to 3 encoding schemes from the 10. Let's make it so that if the thieves plan to content scrape everyone, it'll become a full-time job for them.

We're going to run into a bunch of problems:
- We need to figure out what platforms people use to host their translations. Self-hosted would be great but not everyone can handle a web server and one of our goals is to flood the market.
- Going with the above, ideally it should be able to use javascript and css to broaden our possible encryption schemes. It might even go so far as to create images.
- Need a programming language that eventually spits out an application that anyone can use. It'd be great if it can integrate into wordpress or the like since the encryption would come free then.
- Some encoding schemes will conflict with each other. Others are so similar that they can be dealt with together. We need to come up with a bunch of orthogonal schemes that can work together.
- Hasn't someone come up with something like this already? I haven't checked, but it'd be great if we don't have to reinvent the wheel, we just need to enhance it then.

tldr; Make a bunch of programmers work together to develop a content-protection program that's widely available, well- featured, and easy to use for translators.
Click to expand...

I've done very little with webservers and near 0 security (student), but the challenges that stand out to me are that:

if it doesn't decrypt automatically (in which case it might as well not be there) then you're just driving away tons of readers. Anyone who doesn't use NU definitely wouldn't be in the know, which means you lose all readers potentially using aggregators anyway. Also, a lot of TLers tried adding stuff, but it usually just loses them loyal readers who now prefer the scraped clean aggregator.

I will watch to see if anything comes out of this though.

DocB · Mar 17, 2018

Basically you are saying encrypt the data and make the key public. In half the time that you create the software to encrypt and the app for normal user to decrypt, aggregator will create a code to steal and decrypt automaticly and post. Now will you have aggregator that are easier to read than the main site.
Best advice i can give is divulge NU, because with the existance of this type of tracker , aggregator become futile

TamaSaga · Mar 17, 2018

DocB said: ↑

Basically you are saying encrypt the data and make the key public. In half the time that you create the software to encrypt and the app for normal user to decrypt, aggregator will create a code to steal and decrypt automaticly and post. Now will you have aggregator that are easier to read than the main site.
Best advice i can give is divulge NU, because with the existance of this type of tracker , aggregator become futile
Click to expand...

Nope. I say encryption because that's what it essentially boils down to when you mess with the vanilla text. But it's really just adding color tags, switching font colors, inserting extra text. Nothing that requires mathematicians and rocket science. Just lots of text manipulation.

R0 said: ↑

That'd cost money, do you think they'll do it for free?
Click to expand...

You'd be surprised how many would volunteer if you just ask and you don't demand that they cure cancer and broker world peace while they're at it. Keep it reasonable, an hour at most...

chencking said: ↑

I've done very little with webservers and near 0 security (student), but the challenges that stand out to me are that:

if it doesn't decrypt automatically (in which case it might as well not be there) then you're just driving away tons of readers. Anyone who doesn't use NU definitely wouldn't be in the know, which means you lose all readers potentially using aggregators anyway. Also, a lot of TLers tried adding stuff, but it usually just loses them loyal readers who now prefer the scraped clean aggregator.

I will watch to see if anything comes out of this though.
Click to expand...

It's not encryption, prime number-game wise. It's just adding nonvisible noise that humans don't notice but bots will definitely get hit with.

wonderer · Mar 17, 2018

I have no idea on this subject, but it sounds good? I don't know.

But it got me wonderering, if it was this easy for such sites to protect themselves, then why like for example in china there's still pirate websites that pirate novels from other platforms, when the companies that run those platforms make millions, and could pay to implement this. I mean it's not like they want their novels stolen.

UnGrave · Mar 17, 2018

Sounds like it would be too much work for little to no benefit in the long run. It probably wouldn't take to long for a work around to come out, especially if we were to source it to community developers.

Also, the text colour thing is really annoying for any user who tries to view the chapter in the "make site mobile friendly" feature that comes in most mobile browsers.

chencking · Mar 17, 2018

wonderer said: ↑

I have no idea on this subject, but it sounds good? I don't know.

But it got me wonderering, if it was this easy for such sites to protect themselves, then why like for example in china there's still pirate websites that pirate novels from other platforms, when the companies that run those platforms make millions, and could pay to implement this. I mean it's not like they want their novels stolen.
Click to expand...

In China, they have the great firewall. It's easier for the huge companies to shut down the pirate sites, but they pop up as fast as they go down. Or so hearsay told me, anyway.

DocB · Mar 17, 2018

TamaSaga said: ↑

Just lots of text manipulation.
Click to expand...

An "A" is ascii code 65 no matter if the font is huge red or small pink, the fact that you think that would affect data reveal your lack of knowledge

TamaSaga · Mar 17, 2018

wonderer said: ↑

I have no idea on this subject, but it sounds good? I don't know.

But it got me wonderering, if it was this easy for such sites to protect themselves, then why like for example in china there's still pirate websites that pirate novels from other platforms, when the companies that run those platforms make millions, and could pay to implement this. I mean it's not like they want their novels stolen.
Click to expand...

Because they're struggling to protect themselves. So they put in a ton of money to develop a bigger club.

My solution is different in that I'll be giving a bunch of people sticks. It definitely isn't 100% effective. Maybe only 20%. But it also means that the content scrapers will be getting hit by something harder than a pillow so it might bruise.

HnM_Pete · Mar 17, 2018

Have you heard about the thing called OCR? Because it exists, and all this will just make the bots switch to OCR.

bob3002 · Mar 17, 2018

It's been tried before. I remember a certain site posted pictures of text with crazy backgrounds, so entire chapters were like CAPTCHA puzzles. In the end it was basically unreadable (and didn't adapt to differently sized windows either.) Didn't work with screen reader software either, so if you were vision impaired you were out of luck. I understand making it hard for ripoff sites, but readers were probably suffering even more.

By the way, the black text in your post is almost unreadable in night mode. Not sure if that was intentional or not.

Way · Mar 17, 2018

chencking said: ↑

In China, they have the great firewall. It's easier for the huge companies to shut down the pirate sites, but they pop up as fast as they go down. Or so hearsay told me, anyway.
Click to expand...

Funnily enough, Qidian (China) itself is hit the most by pirates
Much less pirates of many other sites xd

TamaSaga · Mar 17, 2018

HnM_Pete said: ↑

Have you heard about the thing called OCR? Because it exists, and all this will just make the bots switch to OCR.
Click to expand...

*shrug* And to combat OCR, you use watermarks and...surprisingly, text color gradients. Plus it's not perfect, I had to hand edit quite a bit when I used OCR software. Or are those thieves going to leave their readers holding the bag?

DocB said: ↑

An "A" is ascii code 65 no matter if the font is huge red or small pink, the fact that you think that would affect data reveal your lack of knowledge
Click to expand...

I'm impressed. Next, you should learn how to count in binary. It goes 0, 1, 10, 11, 100, 101, ...you'll be a programmer yet.

Anyhow, Javascript fixes that problem. Go ask GM_Rusaku if you don't believe me.

SoulZer0 · Mar 17, 2018

TamaSaga said: ↑

You'd be surprised how many would volunteer if you just ask and you don't demand that they cure cancer and broker world peace while they're at it. Keep it reasonable, an hour at most...
Click to expand...

That would limit it to those who are familiar to this community.

bob3002 · Mar 17, 2018

Way said: ↑

Funnily enough, Qidian (China) itself is hit the most by pirates
Much less pirates of many other sites xd
Click to expand...

The site with the most popular works is pirated most frequently. It's not surprising at all.

DocB · Mar 17, 2018

chencking said: ↑

In China, they have the great firewall. It's easier for the huge companies to shut down the pirate sites, but they pop up as fast as they go down. Or so hearsay told me, anyway.
Click to expand...

The great firewall don't remove site, it is the chinese service provider that block passage of certaint foreign site, if the site is hosted in a chinese server it is removed by the host provider after dmca, not retained at the firewall

Way · Mar 17, 2018

bob3002 said: ↑

The site with the most popular works is pirated most frequently. It's not surprising at all.
Click to expand...

Aye, that's true. But I find it surprising because they pirate almost every novel there and I literally couldn't find the same treatment for any other site, which if are pirated usually stop after it becomes paid...

Maybe I need to dig harder, but hmm.

phoenom · Mar 17, 2018

the thing is , even we rule out incovenience for user ..
encrypting or anything that you want to use to protect your content will tied to resource used in webserver . you want to make strong encrypted content ? sure , but how much resource it needed to do that ? resource is money . if they need big resource than normal wordpress resource, they need to spend more on server . the question is ? can community afford that ? unless you are backed with big financial like WW or QI , i dont think benefit encrypted content can outweight the cost

TamaSaga · Mar 17, 2018

phoenom said: ↑

the thing is , even we rule out incovenience for user ..
encrypting or anything that you want to use to protect your content will tied to resource used in webserver . you want to make strong encrypted content ? sure , but how much resource it needed to do that ? resource is money . if they need big resource than normal wordpress resource, they need to spend more on server . the question is ? can community afford that ? unless you are backed with big financial like WW or QI , i dont think benefit encrypted content can outweight the cost
Click to expand...

Pretty certain most translators don't even consider this. For those that do, they might want to figure out what's leeching their bandwidth because it sure as heck isn't the text. But yeah, I'll ball park like 3 times more resources. However, text content is tiny relative to pictures and stuff though so you might only see an increase of like 10% bandwidth.

Log in

CounterContent Scraper Group

TamaSaga Well-Known Member

SoulZer0 Heaven Refining

chencking [Daolord Grammar Nazi]

DocB "I see you, little mouse! Run along"

TamaSaga Well-Known Member

wonderer Well-Known Member

UnGrave ななひ～^^

chencking [Daolord Grammar Nazi]

DocB "I see you, little mouse! Run along"

TamaSaga Well-Known Member

HnM_Pete Well-Known Member

bob3002 Well-Known Member

Way Crimeless Raubritter

TamaSaga Well-Known Member

SoulZer0 Heaven Refining

bob3002 Well-Known Member

DocB "I see you, little mouse! Run along"

Way Crimeless Raubritter

phoenom Well-Known Member

TamaSaga Well-Known Member

Log in

CounterContent Scraper Group

TamaSaga Well-Known Member

SoulZer0 Heaven Refining

chencking [Daolord Grammar Nazi]

DocB "I see you, little mouse! Run along"

TamaSaga Well-Known Member

wonderer Well-Known Member

UnGrave ななひ～^^

chencking [Daolord Grammar Nazi]

DocB "I see you, little mouse! Run along"

TamaSaga Well-Known Member

HnM_Pete Well-Known Member

bob3002 Well-Known Member

Way Crimeless Raubritter

TamaSaga Well-Known Member

SoulZer0 Heaven Refining

bob3002 Well-Known Member

DocB "I see you, little mouse! Run along"

Way Crimeless Raubritter

phoenom Well-Known Member

TamaSaga Well-Known Member

Useful Searches